# Hacker News Post Statistics

We're going to be using a dataset that provides us with the Hacker News posts. Hacker News is a site started where user submit posts which are voted and commented on (similar to reddit). Hacker News is popular in the tech and startup ccommunities.

We'll be focusing on posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting.

We'll compare these two types of posts to determine the following:
- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We'll start by reading in the file and seperating the header row from the rest of the data.

**Note:** The dataset we're using has been reduced from around 300K rows to approximately 20K rows by removing all posts that do not have any comments, and then randomly sampling from the remaining submissions. You can find the original dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

In [1]:
from csv import reader
file = 'hacker_news.csv'
opened_file = open(file)
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We'll iterate over the rows and see how many of the posts are "ASK HN", and "SHOW HN".

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of posts with ASK HN: ", len(ask_posts))
print("Number of posts with SHOW HN: ", len(show_posts))
print("Number of other posts: ", len(other_posts))


Number of posts with ASK HN:  1744
Number of posts with SHOW HN:  1162
Number of other posts:  17194


Now we'll calculate a few things:
1. How many comments are there in total on the ask posts
2. What is the average number of comments on ask posts
3. How many comments are there in total on the show posts
4. What is the average number of comments on show posts

In [4]:
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


On average posts with `ASK HN` have more comments than posts with `SHOW HN`.

Now we're going to calculate the amount of posts and the amount of comments per hour. We'll follow the following steps:
1. Extract the `created_at` value and the number of comments.
2. Create frequency tables for the number of posts at each hour and the number of comments at each hour.

In [7]:
import datetime as dt

result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    dt_obj = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = dt_obj.strftime("%H")
    num_comments = int(row[1])
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

print(counts_by_hour)
print("-"*50)
print(comments_by_hour)

{'15': 116, '16': 108, '09': 45, '14': 107, '00': 55, '12': 73, '22': 71, '05': 46, '07': 34, '08': 48, '04': 47, '21': 109, '03': 54, '11': 58, '02': 58, '23': 68, '17': 100, '06': 44, '10': 59, '01': 60, '19': 110, '18': 109, '20': 80, '13': 85}
--------------------------------------------------
{'15': 4477, '16': 1814, '09': 251, '14': 1416, '00': 447, '12': 687, '22': 479, '05': 464, '07': 267, '08': 492, '04': 337, '21': 1745, '03': 421, '11': 641, '02': 1381, '23': 543, '17': 1146, '06': 397, '10': 793, '01': 683, '19': 1188, '18': 1439, '20': 1722, '13': 1253}


Now we'll calculate the average number of comments per post at each hour of the day
- number of comments / number of posts

In [10]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

avg_by_hour.sort()
print(avg_by_hour)

[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]


Now we'll switch around the data a little so wee cann sort it by the average value in descreasing order. This is done so we can easily see the top 5 hours someone can post a "ask" on Hacker News.

In [11]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[8.127272727272727, '00'], [11.383333333333333, '01'], [23.810344827586206, '02'], [7.796296296296297, '03'], [7.170212765957447, '04'], [10.08695652173913, '05'], [9.022727272727273, '06'], [7.852941176470588, '07'], [10.25, '08'], [5.5777777777777775, '09'], [13.440677966101696, '10'], [11.051724137931034, '11'], [9.41095890410959, '12'], [14.741176470588234, '13'], [13.233644859813085, '14'], [38.5948275862069, '15'], [16.796296296296298, '16'], [11.46, '17'], [13.20183486238532, '18'], [10.8, '19'], [21.525, '20'], [16.009174311926607, '21'], [6.746478873239437, '22'], [7.985294117647059, '23']]


In [12]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour_obj = dt.datetime.strptime(row[1], "%H")
    pretty_hour = hour_obj.strftime("%H:%M")
    print("{} : {:.2f} average comments per post".format(pretty_hour, row[0]))

Top 5 Hours for Ask Posts Comments
15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


If a user wants to submit an `ASK HN` post and wants to get alot of comments, they should post during one of the top hours of the day