![hacker_news](hacker_news.png)

## Exploring Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

*id: The unique identifier from Hacker News for the post
    
*title: The title of the post
    
*url: The URL that the posts links to, if it the post has a URL

*num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    
*num_comments: The number of comments that were made on the post
    
*author: The username of the person who submitted the post
    
*created_at: The date and time at which the post was submitted
     
Here is the example of one of the rows in the data set:

*id: 12224879	
*title: Interactive Dynamic Video	
*url: http://www.interactivedynamicvideo.com/
*num_points: 386	
*num_comments: 52	
*author: ne0phyte	
*created_at: 8/4/2016 11:52


We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:



*Ask HN: How to improve my personal website?

*Ask HN: Am I the only one outraged by Twitter shutting down share counts?

*Ask HN: Aby recent changes to CSS that broke mobile?





Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:



*Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'

*Show HN: Something pointless I made

*Show HN: Shanhu.io, a programming playground powered by e8vm



We'll compare these two types of posts to determine the following:


*Do Ask HN or Show HN receive more comments on average?

*Do posts created at a certain time receive more comments on average?


## Data Exploration & Preparation

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [2]:
import datetime as dt
from csv import reader

open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

Since we are only concerned about post titles beginning with Ask HN or Show HN, we will extract the rows containing just the data for those titles. 

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = (row[1].lower())
    if (title.startswith('ask hn')):
        ask_posts.append(row)
    elif (title.startswith('show hn')):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [4]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average amount of ask comments per post: ",avg_ask_comments)

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)
print("Average amount of show comments per post: ",avg_show_comments)

Average amount of ask comments per post:  14.038417431192661
Average amount of show comments per post:  10.31669535283993


We can see that the average amount of comments for ask posts is ~14 and precedes the amount of show posts i.e. ~10.

As we know, on average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1.Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2.Calculate the average number of comments ask posts receive by hour created.

In [5]:
import datetime as dt

result_list = []

# looping over the ask posts and
# appending the create-date and number of comments as a list to the result_list 
for item in ask_posts:
    created_at = item[6]
    comments = int(item[4])
    result_list.append([created_at, comments])

counts_by_hour = {}
comments_by_hour = {}

# Looping throught the result_list
# extracting the created_at-time from the date
# creating a datetime-object for this time
# Select just the hour from our date-time object
for item in result_list:
    created_at = item[0]
    comments = item[1]
    dt_hour = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = dt_hour.strftime("%H")
    # check if the hour is in the dictionaries
    # if not, create an entry in both.  If it is, add an entry in both
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

# let's see how that went
print ("Counts by hour:")
for key, value in counts_by_hour.items():
    print(key, value)
print() # blank line to separate
print ("Comments by hour:")
for key, value in comments_by_hour.items():
    print(key, value)

Counts by hour:
18 109
14 107
22 71
02 58
21 109
09 45
05 46
03 54
06 44
10 59
00 55
19 110
08 48
13 85
15 116
23 68
20 80
07 34
11 58
17 100
04 47
16 108
12 73
01 60

Comments by hour:
18 1439
14 1416
22 479
02 1381
21 1745
09 251
05 464
03 421
06 397
10 793
00 447
19 1188
08 492
13 1253
15 4477
23 543
20 1722
07 267
11 641
17 1146
04 337
16 1814
12 687
01 683


Now that we have the number of Ask posts created each hour of the day, and the number of comments each received, we can proceed with step 2:

*calculate the average number of comments Ask posts received each hour of the day.

For example, on the hour of 15 PM there were 116 Ask posts that received 4477 comments, so on average 4477/116 = 38,59 comments per post. To calculate this for each hour we will iterate over the two dictionaries we created in the previous step: For each hour in comments by hour we will get the hour-key and the corresponding comments-value, and divide this value by the corresponding posts-value for the same hour-key in counts_by_hour. Every iteration is appended to a new list of lists.

In [48]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/(counts_by_hour[hour])])

for element in avg_by_hour:
    print(element, sep = '\n') 
print(type(avg_by_hour[0][0]))

['18', 13.20183486238532]
['14', 13.233644859813085]
['22', 6.746478873239437]
['02', 23.810344827586206]
['21', 16.009174311926607]
['09', 5.5777777777777775]
['05', 10.08695652173913]
['03', 7.796296296296297]
['06', 9.022727272727273]
['10', 13.440677966101696]
['00', 8.127272727272727]
['19', 10.8]
['08', 10.25]
['13', 14.741176470588234]
['15', 38.5948275862069]
['23', 7.985294117647059]
['20', 21.525]
['07', 7.852941176470588]
['11', 11.051724137931034]
['17', 11.46]
['04', 7.170212765957447]
['16', 16.796296296296298]
['12', 9.41095890410959]
['01', 11.383333333333333]
<class 'str'>


In [65]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap = (sorted(swap_avg_by_hour, reverse = True))
print()    
print("Top 5 hours for Ask Posts Comments:")
print()
for element in sorted_swap[0:5]:
    avg_comments = ("{:.2f}".format(element[0]))
    hour = element[1]
    hour_obj = dt.datetime.strptime(hour,"%H")
    hour_str = hour_obj.strftime("%H:%M")
    print("{0}: {1} average comments per post".format(hour_str, avg_comments))


Top 5 hours for Ask Posts Comments:

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

So to conclude there seem to be three time slots for creating Ask posts that have a high chance of receiving comments:

Between 15-17h in the afternoon Eastern Time U.S. (21-23h CET)
Between 20-22h in the evening Eastern Time U.S. (02-04h CET)
Between 02-03h at night Eastern Time U.S. (8-9h CET)
That should give you plenty of options for choosing the right moment for creating your Ask post and sit back to enjoy seeing the comments roll in.

That's all on this analysis for now!