# Exploring Hacker News Posts

Hacker news is a site where user submitted stories (posts) are voted and commented similar to reddit. The posts that get more votes or comments make to the top of Hacker news listing and can get more visitors.

In this project we explore the Hacker news dataset to see what type of posts (Ask HN or Show HN) are receiving more comments on average and we will also see if posts created at a certain time receive more comments.

Importing required modules

In [24]:
import csv
import datetime as dt

Reading the data from CSV file to list of lists.

In [10]:
file = open('hacker_news.csv', 'r')
reader = csv.reader(file)
hn = list(reader)
print(hn[:3])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


Seperating header row and actual data

In [11]:
headers = hn[0]
hn = hn[1:]

print(headers)
print('\n')
print(hn[:1])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']]


Now let us extract the posts begining with 'Ask HN' and 'Show HN' into seperate lists.

In [13]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Let us see the first few rows of each to verify.

In [14]:
print(ask_posts[:2])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]


We see each row is having the title (element at index 1) starting with 'Ask HN'

In [15]:
print(show_posts[:2])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]


We see each row is having the title (element at index 1) starting with 'Show HN'

In [16]:
print(other_posts[:2])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


We see each row is having the title (element at index 1) is not starting with either 'Ask HN' or 'Show HN'

Now let us proceed and see which type of posts (Ask HN or Show HN) have more no of comments on average.

In [22]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4]) #Since 'num_comments' is at index 4
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average no of comments for Ask HN posts is: {}'.format(round(avg_ask_comments,2)))

Average no of comments for Ask HN posts is: 14.04


In [23]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4]) #Since 'num_comments' is at index 4
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('Average no of comments for Show HN posts is : {}'.format(round(avg_show_comments, 2)))

Average no of comments for Show HN posts is : 10.32


From the above findings we see that the Average no of comments for 'Ask HN' posts (14.04) is more than the Average no of comments for 'Show HN' (10.32).

Now that we know 'Ask HN' posts are likely to receive more no of comments let us focus on these posts and try to find if 'Ask HN' posts created at cetain time are more likely to attract comments.

For this first let us calculate no of posts created in each hour and no of comments generated in each hour.

For doing this we need the values 'created_at' and 'num_comments' which are at index 6 and index 4 respectively. Let us extract these values into a seperate list called 'result_list' first.

In [40]:
result_list = []
date_format = '%m/%d/%Y %H:%M' #Format for parsing into datetime object
for row in ask_posts:
    date = row[6]
    dt_obj = dt.datetime.strptime(date, date_format)
    hour = dt_obj.strftime('%H') #Extracting the hour from datetime object
    num_comments = int(row[4])
    result_list.append([hour, num_comments])
    
print(result_list[:10])

[['09', 6], ['13', 29], ['10', 1], ['14', 3], ['16', 17], ['23', 1], ['12', 4], ['09', 1], ['17', 1], ['17', 2]]


From the above result, we can see we extracted the hour (24 hour format) in which each 'Ask HN' post was created and the no of comments that particular post got into a list of lists.

Now from the 'result_list' let us generate the frequency tables 'counts_by_hour' having no of posts generated in each hour and 'comments_by_hour' having no of comments generated for the posts created in each hour.

In [42]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0]
    num_comments = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

In [43]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [44]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Now using the above 2 frequency tables (dictionaries) let us compute the average no of comments per post for posts created during each hour of the day.

We can compute this average by dividing the no of comments generated for posts created for each hour from 'comments_by_hour' dictionary with no of posts created in that hour from 'counts_by_hour' dictionary.

In [47]:
avg_by_hour = []

for hour in counts_by_hour:
    average = round(comments_by_hour[hour]/counts_by_hour[hour], 2)
    avg_by_hour.append([hour, average])

print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


From the above results we can see we got what we need. But it was difficult to understand in this format. 

Let us sort the 'average_by_hour' list on the average no of comments value in descending order, so that we can see which are in the top.

In [50]:
avg_by_hour_sorted = sorted(avg_by_hour, key = lambda x:x[1], reverse = True)

print(avg_by_hour_sorted)

[['15', 38.59], ['02', 23.81], ['20', 21.52], ['16', 16.8], ['21', 16.01], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['18', 13.2], ['17', 11.46], ['01', 11.38], ['11', 11.05], ['19', 10.8], ['08', 10.25], ['05', 10.09], ['12', 9.41], ['06', 9.02], ['00', 8.13], ['23', 7.99], ['07', 7.85], ['03', 7.8], ['04', 7.17], ['22', 6.75], ['09', 5.58]]


Now let us print the results in a more readable way.

In [57]:
print("Top 5 Hours for Ask Posts Comments\n")
for row in avg_by_hour_sorted[:5]:
    print('{}:00 : {} average comments per post'.format(row[0], row[1]))

Top 5 Hours for Ask Posts Comments

15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.8 average comments per post
21:00 : 16.01 average comments per post


If we check the documentation of the dataset, we can see the creation time mentioned in the dataset is EST (Eastern Standard Time).

Thus we can say, the 'Ask HN' posts created at 3:00 PM EST have a good chance of getting more no of comments.