# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

In this notebook, I will determine whether Ask HN or Show HN posts tend to get more comments, and whether there is any specific time of day that will maximize the number of comments made on a post.

The dataset we are working with below consists of a 20,000 row sample of Hacker News posts.

Before getting started, I'll load in the dataset and take a look at the header row, plus the first five rows of data go get a feel for how the data is structured.

In [1]:
# Open and format the dataset as a list of lists
import csv  
opened = open('hacker_news.csv')
read = csv.reader(opened)
hn = list(read)
headers = hn[0] # Move header row to its own list
hn = hn[1:]    

# Explore dataset format
print(headers)
print()

for row in hn[0:5]:
    print(row)
    print()

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']



## Do Ask HN or Show HN recieve more comments on average?

To answer this question I must set aside both the ask and show posts so that I can compare them.  I do this below by splitting the dataset into three lists.

In [2]:
# Split the dataset into three smaller sublists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Ask Posts: " + str(len(ask_posts)))
print("Show Posts: " + str(len(show_posts)))
print("Other Posts: " + str(len(other_posts)))
            

Ask Posts: 1744
Show Posts: 1162
Other Posts: 17194


Next step is to calculate the average number of comments for both ask and show posts.

In [3]:
# Calculate average ask comments
total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
print(total_ask_comments)

avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)


# Calculate average show comments
total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
print(total_show_comments)

avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

24483
14.038417431192661
11988
10.31669535283993


After the analysis above, I now know that Hacker News ask posts get more comments on average (14.04) than show posts do (10.32).

## Are Ask Posts Created at a Certain Time More Likely to Recieve Comments?

Next, to determine if there is an optimal time to submit Ask HN posts to maximize the number od comments made.

In [9]:
# Isolate the number of comments made to each post, and the hour of creation as a datetime object
import datetime as dt
result_list = []

for row in ask_posts:
    time = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M')
    result_list.append([time, int(row[4])])
 
# Calculate average number of comments by created hour
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0].strftime('%H')
    comments = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    
avg_by_hour = []  

for key in counts_by_hour:
    avg = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg])


In [5]:
# Sort the list largest to smallest, then format it nicely
swapped = []

for row in avg_by_hour:
    swapped.append([row[1], row[0]])

sorted_swap = sorted(swapped, reverse=True)

print("Top 5 Hours for Ask Post Comments")
    
for row in sorted_swap[0:5]:
    hour_f = dt.datetime.strptime(row[1], "%H")
    print(str(hour_f.strftime("%H:%M")) + ": " + str(round(row[0], 2)) + " average comments per post")

Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


## Conclusion

Our findings show that on average, Ask HN posts that are made around 3PM, 2AM, 8PM, 4PM, and 9PM local time accumulate the most comments.  The dataset documentation specifies that all times are EST.  As I am in the Pacific Time Zone, to generate the maximum number of comments I should submit Ask HN posts at 12PM, 11PM, 5PM, 1PM, or 4PM.  

### Appendix A

Do Ask or Show posts recieve more points on average?

In [6]:
# Calculate average ask points
total_ask_points = 0

for row in ask_posts:
    points = int(row[3])
    total_ask_points += points

avg_ask_points = total_ask_points / len(ask_posts)

print(avg_ask_points)


# Calculate average show points
total_show_points = 0

for row in show_posts:
    points = int(row[3])
    total_show_points += points

avg_show_points = total_show_points / len(show_posts)

print(avg_show_points)

15.061926605504587
27.555077452667813


Show posts recieve more points on average (27.56) than ask posts do (15.06)

### Appendix B

Are Show posts at a certain time of day more likely to recieve more comments?

In [7]:
# Isolate the number of comments made to each post, and the hour of creation as a datetime object
import datetime as dt
result_list = []

for row in show_posts:
    time = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M')
    result_list.append([time, int(row[3])])
    
# Calculate average number of points by created hour
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row[0].strftime('%H')
    comments = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    
avg_by_hour = []  

for key in counts_by_hour:
    avg = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg])


In [8]:
# Sort the list largest to smallest, then format it nicely
swapped = []

for row in avg_by_hour:
    swapped.append([row[1], row[0]])

sorted_swap = sorted(swapped, reverse=True)

print("Top 5 Hours for Show Post Points")
    
for row in sorted_swap[0:5]:
    hour_f = dt.datetime.strptime(row[1], "%H")
    print(str(hour_f.strftime("%H:%M")) + ": " + str(round(row[0], 2)) + " average points per post")

Top 5 Hours for Show Post Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post
