# Hacker News Posts Data Exploration

This project will focus on exploring a dataset of posts from the forum Hacker News. Conclusions on how post time and type affect engagement will be drawn.


In [1]:
from csv import reader
with open('hacker_news.csv', 'r') as read_file:
    hn = list(reader(read_file))
    
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Let's separate the headers from the dataset.

In [2]:
headers = hn[0]
del hn[0]
print(headers)
print(hn[0:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


There are three types of posts. Here they are separated into individual lists `ask_hn`, `show_hn`, and `other_posts`.


In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    if row[1].lower().startswith('ask hn'):
        ask_posts.append(row)
    elif row[1].lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Let's identify what types of posts receive more comments.

In [4]:
total_ask_comments = 0
for ask_post in ask_posts:
    total_ask_comments += int(ask_post[4])
average_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0
for show_post in show_posts:
    total_show_comments += int(show_post[4])
average_show_comments = total_show_comments / len(show_posts)

total_other_comments = 0
for other_post in other_posts:
    total_other_comments += int(other_post[4])
average_other_comments = total_other_comments / len(other_posts)

print(f'Average comments per ask post: {average_ask_comments}')
print(f'Average comments per show post: {average_show_comments}')
print(f'Average comments per other post: {average_other_comments}')

Average comments per ask post: 14.038417431192661
Average comments per show post: 10.31669535283993
Average comments per other post: 26.8730371059672


Here we can see that Ask HN posts receive on average more than 30% more comments than Show HN posts. 

Let's explore how the time an ask post is created affects the number of comments it receives. 

In [5]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

counts_by_hour = {}
comments_by_hour = {}
for post in result_list:
    hour_of_post = dt.datetime.strptime(post[0], '%m/%d/%Y %H:%M').hour
    if hour_of_post in counts_by_hour:
        counts_by_hour[hour_of_post] += 1
        comments_by_hour[hour_of_post] += post[1]
    else:
        counts_by_hour[hour_of_post] = 1
        comments_by_hour[hour_of_post] = post[1]

avg_by_hour = []
for hour in counts_by_hour:
    average_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average_comments])

avg_by_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

In [6]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, 15],
 [23.810344827586206, 2],
 [21.525, 20],
 [16.796296296296298, 16],
 [16.009174311926607, 21],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [13.20183486238532, 18],
 [11.46, 17],
 [11.383333333333333, 1],
 [11.051724137931034, 11],
 [10.8, 19],
 [10.25, 8],
 [10.08695652173913, 5],
 [9.41095890410959, 12],
 [9.022727272727273, 6],
 [8.127272727272727, 0],
 [7.985294117647059, 23],
 [7.852941176470588, 7],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [6.746478873239437, 22],
 [5.5777777777777775, 9]]

We are now able to observe the hours with the highest engagement on Ask HN posts.

The dataset specifies that the times are all in the EST time zone.

In [7]:
for i in sorted_swap[:5]:
    time = dt.datetime.strptime(str(i[1]), '%H').strftime('%H:%M')
    print(f'{time} EST : {i[0]:.2f} comments per post.')

15:00 EST : 38.59 comments per post.
02:00 EST : 23.81 comments per post.
20:00 EST : 21.52 comments per post.
16:00 EST : 16.80 comments per post.
21:00 EST : 16.01 comments per post.


According to this dataset, the hours of 3pm, 2am, and 8pm EST generate the highest amount of engagement on Ask Hacker News posts. 