# Data Analysis of Hacker News submissions

In this project, we'll analyze hacker news submissions of two main kinds:
- Posts that start with "Ask HN"
- Posts that start with "Show HN"

"Ask HN" posts are posts where the author asks the board a forum, such as "Is Tableau easier to use than Excel?" 

"Show HN" posts are those where the author either makes a showcase of something they've done, or just wants to tell the board something of interest. 

Our goals are to find out the following:

1) Do "Ask HN" or "Show HN" receive more comments on average?

2) Does the time of the day a submission is posted have any bearing on the amount of comments a post receives? 

In this project we'll work with a data-set that originally consisted of 300,000 entries, which was reduced to 20,000 after flitering out submissions that didn't receive any comments, and then randomly selecting 20,000 posts from the remaining samples. 


In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The dataset comprises the following header columns: 


- title: title of the post (self explanatory)

- url: the url of the item being linked to

- num_points: the number of upvotes the post received

- num_comments: the number of comments the post received

- author: the name of the account that made the post

- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

In [2]:
headers = hn[:1]

hn = hn[1:]

print(headers)
print('\n')
print(hn[:2])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


In [6]:
ask_posts = []
show_posts = []
other_posts = []

for each in hn:
    title = each[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(each)
    elif title.startswith('show hn'):
        show_posts.append(each)
    else:
        other_posts.append(each)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Now that we've split up the "ask hn" and "show hn" posts, we'll next move on to tallying up the total number of comments both lists received, and then working out the average.


In [11]:
total_ask_comments = 0
for each in ask_posts:
    comments = float(each[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print('Average number of posts "ask hn" posts receive: ', avg_ask_comments)

total_show_comments = 0
for each in show_posts:
    comments = float(each[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(show_posts)

print('Average number of posts "show hn" posts receive: ', avg_show_comments)

Average number of posts "ask hn" posts receive:  14.038417431192661
Average number of posts "show hn" posts receive:  10.31669535283993


On average, "ask hn" posts receive about 14.04 comments, whereas "show hn" posts receive about 10.32 comments. This is a fairly significant difference, with "ask hn" posts leading by almost 40% more comments.

It's possible that this might be because "ask hn" posts inherently arouse more discussion, where people chime in with their opinions and then comment on the suggestions to the question raised.

As such, the rest of the analysis will only center around "ask hn" posts.

# Finding out comments received based on time of day posted

We'll next aim to find out at which hour of the day a "ask hn" post has to be submitted for it to receive the most amount of comments.

In [16]:
import datetime as dt

result_list = []

for each in ask_posts:
    created = each[6]
    comments = int(each[4])
    result_list.append([created, comments])
    
counts_by_hour = {}
comments_by_hour = {}

for each in result_list:
    comments = each[1]
    date = each[0]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
print(comments_by_hour)

{'12': 687, '22': 479, '00': 447, '04': 337, '01': 683, '20': 1722, '09': 251, '15': 4477, '14': 1416, '23': 543, '11': 641, '10': 793, '06': 397, '17': 1146, '21': 1745, '08': 492, '13': 1253, '18': 1439, '03': 421, '19': 1188, '16': 1814, '05': 464, '02': 1381, '07': 267}


Now that we have the comments by hour and amount of posts by hour, we can determine the average number of comments received per post, by hour.

In [20]:
hours = []
for each in comments_by_hour:
    hours.append(each)


avg_by_hour = []

for each in hours:
    comments = comments_by_hour[each]
    counts = counts_by_hour[each]
    
    average = comments/counts
    avg_by_hour.append([each, average])
    
print(avg_by_hour)

[['12', 9.41095890410959], ['22', 6.746478873239437], ['00', 8.127272727272727], ['04', 7.170212765957447], ['01', 11.383333333333333], ['20', 21.525], ['09', 5.5777777777777775], ['15', 38.5948275862069], ['14', 13.233644859813085], ['23', 7.985294117647059], ['11', 11.051724137931034], ['10', 13.440677966101696], ['06', 9.022727272727273], ['17', 11.46], ['21', 16.009174311926607], ['08', 10.25], ['13', 14.741176470588234], ['18', 13.20183486238532], ['03', 7.796296296296297], ['19', 10.8], ['16', 16.796296296296298], ['05', 10.08695652173913], ['02', 23.810344827586206], ['07', 7.852941176470588]]


Above, we obtained a list of lists that indicates the hour, as well as the average number of comments a "ask hn" post submitted at that hour received.

Now, we'll sort this list out in a more readable format.

In [22]:
swap_avg_by_hour = []

for each in avg_by_hour: 
    swap_avg_by_hour.append([each[1], each[0]])
    
print(swap_avg_by_hour)

[[9.41095890410959, '12'], [6.746478873239437, '22'], [8.127272727272727, '00'], [7.170212765957447, '04'], [11.383333333333333, '01'], [21.525, '20'], [5.5777777777777775, '09'], [38.5948275862069, '15'], [13.233644859813085, '14'], [7.985294117647059, '23'], [11.051724137931034, '11'], [13.440677966101696, '10'], [9.022727272727273, '06'], [11.46, '17'], [16.009174311926607, '21'], [10.25, '08'], [14.741176470588234, '13'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.8, '19'], [16.796296296296298, '16'], [10.08695652173913, '05'], [23.810344827586206, '02'], [7.852941176470588, '07']]


In [27]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [32]:
print("Top 5 Hours for Ask Posts Comments:")

for each in sorted_swap[:5]:
    hour = each[1]
    hour_datetime = dt.datetime.strptime(hour, "%H")
    hour_string = hour_datetime.strftime("%H:%M")
    string = ("{hour}: {comments:.2f} average comments per post").format(hour = hour_string, comments = each[0])
    print(string)

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Above, we see the top 5 most ideal times to make a "ask hn" post that generates the most responses. 

These times are Eastern times in the US.  

The highest time by a margin was 15:00 Eastern US time. Posts submitted at this time received 38.59 comments on average, which is almost twice the amount of comments received by the next best time, which is 2:00 Eastern US time.

The popularity of posts submitted at 15:00 US time might have to do with the fact that at this time, it is evening in most of Europe, and around lunch hour in most of the US. 

# Conclusion

The most ideal way to make a post that generates the most responses is by making an "Ask HN" flagged post around the 15:00-16:00 est time. This corresponds to 3a.m. - 4a.m. Singapore time, which is unfortunately a very poor choice of time for us here in South East Asia. 

The 2nd best time slot was 02:00 est, which corresponds to 2.00pm Singapore Time. This might be more suitable for most of us. 

It should be noted that because this data excluded the submissions which received no comments, the average number of comments per submission with the comment-less submissions included might be slightly different. 