# Exploring Hacker News Posts

Hacker News is a popular website amongst technology and startup audiences. Users submit stories and can be voted and commented upon by the community. Successful stories or posts can get many views. 

This [data set](https://www.kaggle.com/hacker-news/hacker-news-posts) can be culled from over 300,000 posts down to posts which only include posts with comments. Within these posts, there are two main categories: 'Ask HN' and 'Show HN', where users either ask the community for advice, or show the community something they have done.

We will compare the two types of posts, 'Ask HN' and 'Show HN', to determine which type gets more comments on average, and to examine whether posts get more comments based on the time they were posted.

### Import Data

In [10]:
from csv import reader

opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(opened_file)
hn_data = list(read_file)
hn_header = hn_data[0]
hn = hn_data[1:]

print('Header\n', hn_header, '\n')
print('Sample Rows: \n', hn[:3])
print('\nRows of Data: ',len(hn), '\n')

Header
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

Sample Rows: 
 [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]

Rows of Data:  293119 



We can see there are a few useful column categories to help prune the data down from the nearly 300,000 entries. Most of these columns are self explanitory but to note:
- 'num_points' is the number of positive votes a post gets
- 'created_at' is the date and time at which the post was created according to the Eastern Time (ET) zone in the US

### Select Comment Provoking Posts

This is done by creating a new list for the selected data to stay, then using a for loop to check where the 'num_comments' value is larger than zero, to only include posts which provoked users to comment, as below.

In [15]:
hn_commented_posts = []

for row in hn:
    if int(row[4]) > 0:
        hn_commented_posts.append(row)
        
print('There are {} posts with comments.'.format(len(hn_commented_posts)))

There are 80401 posts with comments.


### Select Relevent Posts

Since we are only interested in posts that begin with'Ask HN'  or 'Show HN', we can further separate our date based on this criteria.

In [23]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_commented_posts:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('There are {} \'Ask HN\' posts.'.format(len(ask_posts)))
print('\n\t', ask_posts[:2], '\n')
print('There are {} \'Show HN\' posts.'.format(len(show_posts)))
print('\n\t', show_posts[:2], '\n')
print('There are {} other posts.'.format(len(other_posts)))

There are 6911 'Ask HN' posts.

	 [['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']] 

There are 5059 'Show HN' posts.

	 [['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']] 

There are 68431 other posts.


### A Wager for Ask or Show

With the data separated into 'Ask HN' or 'Show HN' we can find which type of post gets more votes and comments on *average*. This is done by simply counting the amount of votes or comments for each type and dividing by the total amounts of each post type as we have calculated above.

In [30]:
total_ask_votes = 0
total_ask_comments = 0
total_show_votes = 0
total_show_comments = 0

for row in ask_posts:
    total_ask_votes += int(row[3])
    total_ask_comments += int(row[4])
    
for row in show_posts:
    total_show_votes += int(row[3])
    total_show_comments += int(row[4])
    
avg_ask_votes = total_ask_votes / len(ask_posts)
avg_show_votes = total_show_votes / len(ask_posts)
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print('The average votes per \'Ask HN\' Post are {:.2f}'.format(avg_ask_votes))
print('The average votes per \'Show HN\' Post are {:.2f}'.format(avg_show_votes))
print('\nThe average comments per \'Ask HN\' Post are {:.2f}'.format(avg_ask_comments))
print('The average comments per \'Show HN\' Post are {:.2f}'.format(avg_show_comments))

The average votes per 'Ask HN' Post are 14.40
The average votes per 'Show HN' Post are 19.49

The average comments per 'Ask HN' Post are 13.74
The average comments per 'Show HN' Post are 9.81


As shown, there are more votes on average for 'Show HN' posts than 'Ask HN' posts. However the opposite is true for the amount of comments on average. This makes some sense -- If a user is asking HN something, the community can only give an answer through comments while showing HN something doesn't require it; instead users may just vote in favor of what's been shown and move on.

Based on how these votes and comments stem from an innate design of the website it is worth exploring both 'ask' and 'show' posts further.

### Time, Dr. Freeman?

What is time? Rather, when *is* the time... to post? Let's find out.

This can be done as follows:
 - Calculate the number of both ask and show posts created each hour of the day, along with their votes/comments
 - Calculate the average number of votes/comments posts receive by the hour they were created.

First let's look at the 'created_at' column and sort both types of posts by time. While we are at it let's create frequency tables as well, which will be discussed more below.

In [37]:
import datetime as dt

### ASK HN POSTS ###
ask_result_list = [] #list of lists. Votes, Comments, Created
for row in ask_posts:
    votes = int(row[3])
    comments = int(row[4])
    created = row[6]
    ask_result_list.append([votes, comments, created]) #pulls the votes, comments, time for each post
    
ask_counts_by_hour = {}
ask_votes_by_hour = {}
ask_comments_by_hour = {}

for row in ask_result_list:
    time = dt.datetime.strptime(row[2], '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(time, '%H') #convert time of post to hour value
    
    if hour in ask_counts_by_hour: #creates frequency table for counts, votes, and comments of each ask post
        ask_counts_by_hour[hour] += 1
        ask_votes_by_hour[hour] += row[0]
        ask_comments_by_hour[hour] += row[1]
    else:
        ask_counts_by_hour[hour] = 1
        ask_votes_by_hour[hour] = row[0]
        ask_comments_by_hour[hour] = row[1]
    

### SHOW HN POSTS ###
show_result_list = [] #list of lists. votes, comments, created
for row in show_posts:
    votes = int(row[3])
    comments = int(row[4])
    created = row[6]
    show_result_list.append([votes, comments, created])
    
show_counts_by_hour = {}
show_votes_by_hour = {}
show_comments_by_hour = {}

for row in show_result_list:
    time = dt.datetime.strptime(row[2], '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(time, '%H')
    
    if hour in show_counts_by_hour:
        show_counts_by_hour[hour] += 1
        show_votes_by_hour[hour] += row[0]
        show_comments_by_hour[hour] += row[1]
    else:
        show_counts_by_hour[hour] = 1
        show_votes_by_hour[hour] = row[0]
        show_comments_by_hour[hour] = row[1]
    
    

Three dictionaries were created for each type of post (6 total) which hold different information, as follows:
- `ask_counts_by_hour` -- The number of ask posts created for each hour of the day
- `ask_votes_by_hour` -- The corresponding number of votes those ask posts received 
- `ask_comments_by_hour` -- The corresponding number of comments those ask posts received


- `show_counts_by_hour` -- The number of show posts created for each hour of the day
- `show_votes_by_hour` -- The corresponding number of votes those show posts received
- `show_comments_by_hour` -- The corresponding number of comments those show posts received


Each set of three dictionaries can be used to calculate the average number of votes *and* comments for posts created during each hour of the day. As shown below:

In [61]:
### ASK HN POSTS ###
ask_avg_by_hour = []
for hour in ask_counts_by_hour:
    #Ordered in list as comments, votes, hour
    ask_avg_by_hour.append([ask_comments_by_hour[hour], ask_votes_by_hour[hour], hour])    

ask_avg_by_hour = sorted(ask_avg_by_hour, reverse=True) #Sort the list high-low by number of comments
print('\n-Top 5 Hours for Posting Ask HN Posts-\n')
for row in ask_avg_by_hour[:5]: #Format and print
    hr = dt.datetime.strptime(row[2], '%H')
    hr = dt.datetime.strftime(hr, '%H:%M')
    print('{hr}: {com} average comments per post\n       {vot} average votes per post\n'.format(hr=hr, com=row[0], vot=row[1]))



### SHOW HN POSTS ###
show_avg_by_hour = []
for hour in show_counts_by_hour:
    #Ordered in list as votes, comments, hour
    show_avg_by_hour.append([show_votes_by_hour[hour], show_comments_by_hour[hour], hour])
    
show_avg_by_hour = sorted(show_avg_by_hour, reverse=True) #Sort list high-low by number of votes
print('\n-Top 5 Hours for Posting Show HN Posts-\n')
for row in ask_avg_by_hour[:5]:
    hr = dt.datetime.strptime(row[2], '%H')
    hr = dt.datetime.strftime(hr, '%H:%M')
    print('{hr}: {vot} average votes per post\n       {com} average comments per post\n'.format(hr=hr, com=row[1], vot=row[0]))



-Top 5 Hours for Posting Ask HN Posts-

15:00: 18525 average comments per post
       13689 average votes per post

13:00: 7245 average comments per post
       7749 average votes per post

17:00: 5547 average comments per post
       6853 average votes per post

14:00: 4972 average comments per post
       5172 average votes per post

18:00: 4877 average comments per post
       6570 average votes per post


-Top 5 Hours for Posting Show HN Posts-

15:00: 18525 average votes per post
       13689 average comments per post

13:00: 7245 average votes per post
       7749 average comments per post

17:00: 5547 average votes per post
       6853 average comments per post

14:00: 4972 average votes per post
       5172 average comments per post

18:00: 4877 average votes per post
       6570 average comments per post



# Conclusion

The best time to post either 'Show HN' or 'Ask HN' type posts are sometime from 1 - 3 pm ET. Overall, it seems the average amount of votes and comments at these times are fairly close to the same amount, meaning users are voting and commenting about the same amount for posts overall. This also shows the highest amount of activity within the Hacker News community of users posting, voting, and commenting is in the afternoon into the evening. Likely towards the end of the work day.