# Comparison of Hacker News Posts #

Hacker News is a technology site where users post content related to tech and users vote and comment on those posts. Hacker news has two type of posts, Ask Hacker News( Ask Hn) and Show Hacker News(Show HN). Ask HN posts, users submit posts to ask the HN community questions. Where as Show HN users post projects or products to the HN community.

We will answer two questions.
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)




In [2]:
def preview_data(data):
    for row in data[:5]:
        print(row)

# Introduction #

Let's preview the data by showing the first five rows.

In [3]:
preview_data(hn)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Here the header is extracted then assigned to the variable header.

In [4]:
headers = hn[:1]
preview_data(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


We'll remove the header row from hn variable.

In [5]:
hn = hn[1:]
preview_data(hn)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


# Filter Data #

We first need to filter post titles into two categories, Ask HN or Show HN.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        

print('Ask HN posts: ', len(ask_posts))
print('Show HN posts: ', len(show_posts))
print('Other posts: ', len(other_posts))
print('Total rows in file: ', len(hn))
print('Total posts found in file: ', len(ask_posts) + len(show_posts) + len(other_posts))

Ask HN posts:  1744
Show HN posts:  1162
Other posts:  17194
Total rows in file:  20100
Total posts found in file:  20100


# Question: 1 #
Do Ask HN or Show HN receive more comments on average?

For both types of posts, we'll calculate the average by the total amount of posts per post type then divide the total amount of comments per post type.

Let us compute the average for Ask HN posts.

In [7]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Avg Ask HN comments per post:', avg_ask_comments)

Avg Ask HN comments per post: 14.038417431192661


Now for Show HN posts.

In [8]:
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments / len(show_posts)

print('Avg Show HN comments per post:', avg_show_comments)


Avg Show HN comments per post: 10.31669535283993


Ask HN  have more avg posts per comment than Show HN posts.
Since Ask HN post generates more comments than Show HN posts, we will focus our analysis on those posts.

# Question 2 #

Do posts created at a specific time receive more comments on average?

We'll accomplish this in two steps.

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by the hour created.

Step 1 will be first.

Create a list of lists called result_list. Then iterate over ask_posts.
For each iteration,  append a two-element sub-list. The first element is the time the post was created, and the second is the number of comments for the post.

There're Two empty dictionaries counts_by_hour and comments_by_hour.
    - counts_by_hour: Total number of Ask HN post each hour of the day.
    - comments_by_hour: Total number of Ask HN comments for each hour of the day


For each iteration of the result_list,  A DateTime object is created from the time posted, then the hour from that DateTime object is added to counts_by_hour. This serves as a key. The value of that key is a total of the posts created for that hour. The comments_by_hour has the same key, but the value is the total amount of comments for that hour.

In [44]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    date_dt = dt.datetime.strptime(result[0], '%m/%d/%Y %H:%M')
    hour = date_dt.hour
    if not hour in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]

Now we'll need to calculate the average number of comments per post for each hour of the day.

In [45]:
avg_by_hour = []


for hour, total in counts_by_hour.items():
    avg = comments_by_hour[hour] / total
    avg_by_hour.append([hour, avg])

print(avg_by_hour)

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


Here the data is sorted by average posts with the hour being the value.

In [47]:
swap_avg_by_hour = []

for time in avg_by_hour:
    swap_avg_by_hour.append([time[1], time[0]])

print(swap_avg_by_hour)

[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


In [68]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
preview_data(sorted_swap)

[38.5948275862069, 15]
[23.810344827586206, 2]
[21.525, 20]
[16.796296296296298, 16]
[16.009174311926607, 21]


In [69]:
print("Top 5 Hours for Ask Posts Comments")

for data in sorted_swap[:5]:
    time_datetime_obj = dt.datetime.strptime(str(data[1]), '%H')
    hour = time_datetime_obj.strftime('%H:%M:')
    print('{} {: .2f} average comments per post'.format(hour, data[0]))

Top 5 Hours for Ask Posts Comments
15:00:  38.59 average comments per post
02:00:  23.81 average comments per post
20:00:  21.52 average comments per post
16:00:  16.80 average comments per post
21:00:  16.01 average comments per post


# Conclusion #

This analysis only pertains to posts that received comments. Of those posts that received comments, the best time to post a question on Hacker News is between 3:00 pm est and 4:00 pm est.