# Hacker News Analysis

In this project, we will analyze the site "Hacker news" where users can post stuff related to technology and startup circle. Users can upvote and downvote posts, and comment on then. Kind of like Reddit.

The type of posts on Hacker News can be divided into three parts:
- Regular news posts where users post a news article with its source site
- Ask HN posts where users can ask stuff about technology from the Hacker Rank community
- Show HN where users can show their creations to the HackerRank community.

# Objective

Our objective is to find out whether Ask HN and Show HN posts get more engagement from the community (Or either one of them) and what is the best time period to submit your post for the highest engagement.

# Loading the dataset

In [1]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

hn_header = hn[0]
hn = hn[1:]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


# Splitting different type of posts

As we mentioned before, we can divide the type of posts in Hacker News into three types. To achieve our objective we will divide our dataset into three different parts.

In [2]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.startswith("Ask HN"):
        ask_posts.append(row)
    elif title.startswith("Show HN"):
        show_posts.append(row)
    else:
        other_posts.append(row)


In [3]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1742
1161
17197


We can see that most posts are regular posts at 17k which deliver news to the community. At distant second place is ask posts at 1.7k which seeks advice from the community and at last place is show posts which shows a user creation

Let's see which type of posts gets the most engagement. 

In [4]:
total_ask_comments = 0
total_ask_points = 0
for row in ask_posts:
    
    total_ask_comments += float(row[4])
    total_ask_points += float(row[3])
    
print(total_ask_comments)
print(total_ask_points)
    
    

24466.0
26264.0


The comments are in impressive numbers. The ratio of comments to points doesn't seem too impressive though

In [5]:
ask_num_to_comm_ratio = total_ask_points / total_ask_comments
print(ask_num_to_comm_ratio)

1.0734897408648738


That means for every one comment, the post gets 1.07 points

Let's check the average points and comments

In [6]:
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_comments)
print(avg_ask_points)

14.044776119402986
15.076923076923077


So, the average number of comment on an ask posts is 14 while average upvote is 15. That's some good engagement on ask posts!

Now let's check on other types of posts

In [7]:
total_show_comments = 0
total_show_points = 0
for row in show_posts:
    
    total_show_comments += float(row[4])
    total_show_points += float(row[3])
    
print(total_show_comments)
print(total_show_points)

11987.0
32015.0


While the total comments are low, the number of upvotes are way to immpressive. The ratio of comments to points is high too.

In [8]:
show_num_to_comm_ratio = total_show_points / total_show_comments
print(show_num_to_comm_ratio)

2.6708100442145657


That means for every comment on a show post, it got 2.67 upvotes. This means people tend to upvote a show post more than they like to comment. The reason for this seem to be self explanatory since asks post demand users to give their advice or opinions.

Let's check the average comments and points.

In [9]:
avg_show_comments = total_show_comments / len(show_posts)
avg_show_points = total_show_points / len(show_posts)
print(avg_show_comments)
print(avg_show_points)

10.324720068906116
27.575366063738155


While the average number of comments on a show post are low, the average upvotes are higher compared to ask posts.

Let's see how these compare to regular posts

In [10]:
total_other_comments = 0
total_other_points = 0
for row in other_posts:
    
    total_other_comments += float(row[4])
    total_other_points += float(row[3])
    
print(total_other_comments)
print(total_other_points)

462073.0
952672.0


The number of regular posts are higher which in turn generates way higher number of comments and points.

In [11]:
other_num_to_comm_ratio = total_other_points / total_other_comments
print(other_num_to_comm_ratio)

2.061734834106299


The ratio of points to comments seems similar to show posts.

In [12]:
avg_other_comments = total_other_comments / len(other_posts)
avg_other_points = total_other_points / len(other_posts)
print(avg_other_comments)
print(avg_other_points)

26.86939582485317
55.39756934349014


The engagement on regular posts seems the highest, though it is possible that a few popular posts may be skewing the result. After all, the chance of a regular post going viral is higher compared to other types of posts.

# Best time to post

Now let's determine the best time to submit a post that usually gets better engagement.

In [13]:
import datetime as dt

In [16]:
result_list = []
for row in ask_posts:
    result_list.append([row[6], float(row[4])])
    
print(result_list[:5])
    

[['8/16/2016 9:55', 6.0], ['11/22/2015 13:43', 29.0], ['5/2/2016 10:14', 1.0], ['8/2/2016 14:20', 3.0], ['10/15/2015 16:38', 17.0]]


In [17]:
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    datetime_format = dt.datetime.strptime(date, date_format)
    time = datetime_format.strftime("%H")
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    

In [18]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 108, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 54, '06': 44, '07': 34, '11': 58}


In [19]:
print(comments_by_hour)

{'09': 251.0, '13': 1253.0, '10': 793.0, '14': 1416.0, '16': 1814.0, '23': 543.0, '12': 687.0, '17': 1146.0, '15': 4477.0, '21': 1745.0, '20': 1722.0, '02': 1381.0, '18': 1430.0, '03': 421.0, '05': 464.0, '19': 1188.0, '01': 683.0, '22': 479.0, '08': 492.0, '04': 337.0, '00': 439.0, '06': 397.0, '07': 267.0, '11': 641.0}


Now let's calculate the average number of comments we got by hour on ask HN posts.

In [21]:
avg_comm_hr_ask = []
for row in comments_by_hour:
    avg_comm_hr_ask.append([row, comments_by_hour[row] / counts_by_hour[row]] )

In [22]:
print(avg_comm_hr_ask)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.24074074074074], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.12962962962963], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


We can see that most comments on Ask HN posts occur in the '15' hour period, i.e. 3 PM - 4 PM.

Let's sort these values so that we can read it better

In [23]:
swap_avg_comm_hr_ask = []
for row in avg_comm_hr_ask:
    swap_avg_comm_hr_ask.append([row[1], row[0]])
    
swap_sorted = sorted(swap_avg_comm_hr_ask, reverse=True)
print(swap_sorted)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.24074074074074, '18'], [13.233644859813085, '14'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.12962962962963, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [24]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in swap_sorted[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg) )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion

We can conclude a couple of things from our analysis of Hacker News data
- The most number of posts are regular posts. Also the one with highest number of upvotes and comments, i.e. highest engagement
- The ask HN posts gets the most engagement if posted between 3 PM - 4 PM (Eastern US time)