# Exploring Hacker News Posts
![Alt text](../JupyterProject2/hacker_news.jpg)

*Practice Project by: Dustin Mook*

---
## Skills covered
* GitHub
* Jupyter Notebook
* Data Engineering
* Object Oriented Programming
* datetime module
* csv module

[The raw data csv for this project may be found here](HN_posts_year_to_Sep_26_2016.csv)

---
# Objectives
1. Do Ask HN or Show HN posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

In [38]:
#import packages and setup
from csv import reader
from datetime import datetime as dt

original_csv = 'HN_posts_year_to_Sep_26_2016.csv'

In [33]:
# Read csv and store original data as a list of lists `original_data`
file_object = open(original_csv, 'r', encoding='UTF8')
reader_object = reader(file_object)
original_data = list(reader_object)
file_object.close()
headers = original_data[0]
hn = original_data[1:]
headers


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [34]:
# Seperate the hn data into three lists
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(f'# of ask_posts = {len(ask_posts)}')
print(f'# of show_posts = {len(show_posts)}')
print(f'# of other_posts = {len(other_posts)}')

# of ask_posts = 9139
# of show_posts = 10158
# of other_posts = 273822


In [37]:
# Determine number of comments on average per type of post
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(f'avg_ask_comments =', format(avg_ask_comments, '.2f'))

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(ask_posts)
print(f'avg_show_comments =', format(avg_show_comments, '.2f'))

avg_ask_comments = 10.39
avg_show_comments = 5.43


### We've answered objective 1:

1. ask_comments receive almost twice as many posts on average as show_comments as seen above.

---


In [44]:
result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

posts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at = row[0]
    num_comments = int(row[1])
    time_object = dt.strptime(created_at, '%m/%d/%Y %H:%M')
    hour_created = time_object.strftime('%H')
    if hour_created not in posts_by_hour:
        posts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = num_comments
    else:
        posts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += num_comments
print(posts_by_hour)
print(comments_by_hour)        

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [47]:
avg_by_hour = []
for post in posts_by_hour:
    avg = comments_by_hour[post] / posts_by_hour[post]
    avg_by_hour.append([int(post), avg])
print(avg_by_hour)

[[2, 11.137546468401487], [1, 7.407801418439717], [22, 8.804177545691905], [21, 8.687258687258687], [19, 7.163043478260869], [17, 9.449744463373083], [15, 28.676470588235293], [14, 9.692007797270955], [13, 16.31756756756757], [11, 8.96474358974359], [10, 10.684397163120567], [9, 6.653153153153153], [7, 7.013274336283186], [3, 7.948339483394834], [23, 6.696793002915452], [20, 8.749019607843136], [16, 7.713298791018998], [8, 9.190661478599221], [0, 7.5647840531561465], [18, 7.94299674267101], [12, 12.380116959064328], [4, 9.7119341563786], [6, 6.782051282051282], [5, 8.794258373205741]]


In [66]:
sorted_avg_by_hour = sorted(avg_by_hour)
print(sorted_avg_by_hour)

[[0, 7.5647840531561465], [1, 7.407801418439717], [2, 11.137546468401487], [3, 7.948339483394834], [4, 9.7119341563786], [5, 8.794258373205741], [6, 6.782051282051282], [7, 7.013274336283186], [8, 9.190661478599221], [9, 6.653153153153153], [10, 10.684397163120567], [11, 8.96474358974359], [12, 12.380116959064328], [13, 16.31756756756757], [14, 9.692007797270955], [15, 28.676470588235293], [16, 7.713298791018998], [17, 9.449744463373083], [18, 7.94299674267101], [19, 7.163043478260869], [20, 8.749019607843136], [21, 8.687258687258687], [22, 8.804177545691905], [23, 6.696793002915452]]


In [49]:
# Swap the columns to easily sort out the top columns
swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour])
print(swap_avg_by_hour)

[[11.137546468401487, 2], [7.407801418439717, 1], [8.804177545691905, 22], [8.687258687258687, 21], [7.163043478260869, 19], [9.449744463373083, 17], [28.676470588235293, 15], [9.692007797270955, 14], [16.31756756756757, 13], [8.96474358974359, 11], [10.684397163120567, 10], [6.653153153153153, 9], [7.013274336283186, 7], [7.948339483394834, 3], [6.696793002915452, 23], [8.749019607843136, 20], [7.713298791018998, 16], [9.190661478599221, 8], [7.5647840531561465, 0], [7.94299674267101, 18], [12.380116959064328, 12], [9.7119341563786, 4], [6.782051282051282, 6], [8.794258373205741, 5]]


In [67]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[28.676470588235293, 15], [16.31756756756757, 13], [12.380116959064328, 12], [11.137546468401487, 2], [10.684397163120567, 10], [9.7119341563786, 4], [9.692007797270955, 14], [9.449744463373083, 17], [9.190661478599221, 8], [8.96474358974359, 11], [8.804177545691905, 22], [8.794258373205741, 5], [8.749019607843136, 20], [8.687258687258687, 21], [7.948339483394834, 3], [7.94299674267101, 18], [7.713298791018998, 16], [7.5647840531561465, 0], [7.407801418439717, 1], [7.163043478260869, 19], [7.013274336283186, 7], [6.782051282051282, 6], [6.696793002915452, 23], [6.653153153153153, 9]]


In [65]:
print('Top 5 hours for Ask Posts Comments')
for hour in sorted_swap[:5]:
    avg_posts = hour[0]
    hour_object = dt.strptime(str(hour[1]), '%H')
    formatted_hour = hour_object.strftime('%H:%M')
    print(f'{formatted_hour}:', format(avg_posts, '.2f'), 'average comments per post')

    #print(f'{hour[1]}:00 {format(hour[0], '.2f')} average comments per post)

Top 5 hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


### We've answered objective 2:

2. hours 15, 13, 12, 2, and 10 have the highest average number of comments per post.

One could form an assumption that posting during these hours may give more comments per post if your goal is to receive the highest # of comments

---