# Hackernews Post Analysis

In this project, we'll work with the data, which consists of posts from Hacker News, a popular site where technology related stories (or 'posts') are voted and commented on.

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [6]:
from csv import reader

hn = list(reader(open('hacker_news.csv')))

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Notice that the first list in the inner lists contains the column headers and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [8]:
headers = hn[0]
hn = hn[1:]

print(headers)

print(hn[:5])

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http:

Now that we removed the headers from hn, we're ready to filter our data. Since we're only concerned with posts whose titles begin with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use regular expressions. We can find strings that begin with a certain word or words by using the beginning anchor, ^ , at the start of our regular expression. For example, the regular expression below can be used to match words beginning with Red.

`pattern = r"^Red"`

Let's use regular expressions to separate posts beginning with Ask HN and Show HN into different lists next.

In [11]:
import re

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    match1 = re.search(r'ask hn',title,re.I)
    match2 = re.search(r'show hn',title,re.I)
    if match1:
        ask_posts.append(row)
    elif match2:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1745
1165
17189


Next, let's determine if Ask HN or Show HN posts receive more comments on average.

In [13]:
ask_comments = [int(row[4]) for row in ask_posts]
show_comments = [int(row[4]) for row in show_posts]

avg_ask_comments = sum(ask_comments) / len(ask_comments)
avg_show_comments = sum(show_comments) / len(show_comments)

print(avg_ask_comments,avg_show_comments)

14.031518624641834 10.302145922746782


From the above, we can understand that on average Ask Posts have more comments than Show Posts.

Since Ask HN posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of Ask HN posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments Ask HN posts receive by hour created.

Now, we'll tackle the first step — calculating the amount of Ask HN posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column.

In [15]:
import datetime as dt

created_date = [row[6] for row in ask_posts]

counts_by_hour = {}
comments_by_hour = {}

result_list = zip(created_date,ask_comments)

for date,comment in result_list:
    hour_obj = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = hour_obj.strftime("%H")
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [22]:
avg_by_hour = [[hour, (comments_by_hour[hour] / counts_by_hour[hour])] for hour in comments_by_hour]

print(avg_by_hour)

[['20', 21.525], ['22', 6.746478873239437], ['05', 10.08695652173913], ['16', 16.796296296296298], ['07', 7.852941176470588], ['09', 5.5777777777777775], ['18', 13.20183486238532], ['02', 23.810344827586206], ['14', 13.233644859813085], ['21', 16.009174311926607], ['01', 11.383333333333333], ['08', 10.25], ['15', 38.5948275862069], ['12', 9.41095890410959], ['03', 7.796296296296297], ['19', 10.8], ['10', 13.440677966101696], ['13', 14.741176470588234], ['17', 11.46], ['00', 8.127272727272727], ['06', 9.022727272727273], ['04', 7.170212765957447], ['11', 11.051724137931034], ['23', 7.898550724637682]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [25]:
avg_by_hour = sorted(avg_by_hour, key = lambda row: row[1], reverse = True)

print("Top 5 Hours for 'Ask HN' Comments")

for row in avg_by_hour:
    print("{}:00: {:.2f} average comments per post".format(row[0],row[1]))

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.90 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


# Conclusion

As per the above data, we can infer that, the best time to post a "Ask Post" is at 15:00. Based on the timezone, any individual can know which is the best time in a day to post for more interaction with the post in terms of comment.