# Hacker News Posts Analysis

Analyze two types of posts:  "Ask HN" and "Show HN"

* Which of these two types receive more comments on average?
* Do posts created at a certain time receive more comments on average?

To analyze this data, we obtained a dataset file  containing posts submissions on the Hacker News web site.

The following describe the fields in this database:
* id -- Unique identifier for the Hacker News post.
* title -- Title of the post.
* url -- URL the post links to, if the post has a URL.
* num_points -- Number of points the post acquired, calculated as the total number of up votes minus the total number of down votes.
* num_comments -- Number of comments made on the post.
* author -- Username of the author who submitted the post.
* created_at -- Date and time the post was submitted.

In [1]:
from csv import reader

hn = list(reader(open("hacker_news.csv", encoding="UTF8")))

print(hn[:6])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn") :
        ask_posts.append(row)
    elif title.lower().startswith("show hn") :
        show_posts.append(row)
    else :
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


In [4]:
total_ask_comments = 0

for ask_post in ask_posts:
    total_ask_comments += int(ask_post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [5]:
total_show_comments = 0

for show_post in show_posts:
    total_show_comments += int(show_post[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


## Analysis of Show Posts vs Ask Posts

It seems as though the Ask Posts have more comments on average than Show Posts.  The ask posts average about 14 comments per post versus about 10 comments per show post.

In [6]:
import datetime as dt

result_list = []

for ask_post in ask_posts:
    result_list.append([ask_post[6], int(ask_post[4])])

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    hr = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M").strftime("%H")
    #hr = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M").hour
        
    if hr in counts_by_hour:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += result[1]
    else:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = result[1]

#print(counts_by_hour)
print(comments_by_hour)

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [7]:
avg_by_hour = []

for hr in counts_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    format_str = "{0}: {1:.2f} average comments per post."
    dt_hour = dt.datetime.strptime(row[1], "%H")
    str_hour = dt_hour.strftime("%H:%M")
    print(format_str.format(str_hour, row[0]))


[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]
Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


## Analysis of Average Comments per Hour

On average, the 15:00 (3:00pm EST, according to the dataset documentation) hour receives the most comments at 38.59 comments per post.  That's approximately a 60% increase over the second highest average comments per hour.

## Conclusion

kajsdf