# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

**id:** The unique identifier from Hacker News for the post  
**title:** The title of the post  
**url:** The URL that the posts links to, if it the post has a URL  
**num_points:** The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes  
**num_comments:** The number of comments that were made on the post  
**author:** The username of the person who submitted the post  
**created_at:** The date and time at which the post was submitted  

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.  

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.  

## Importing the Dataset

In [11]:
import pandas as pd


hn = pd.read_csv("hacker_news.csv")

# Displaying the first 5 rows

hn.head(6)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
5,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


## Subsetting the Dataset for "ask hn" and "show hn" posts 

In [12]:
# Converting the title column into lower for subsetting
hn['title'] = hn['title'].str.lower()

# Subsetting the ask hn posts
boolean_ask = hn['title'].str.startswith('ask hn')
hn_ask_posts = hn[boolean_ask]
hn_ask_posts.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
7,12296411,ask hn: how to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
17,10610020,ask hn: am i the only one outraged by twitter ...,,28,29,tkfx,11/22/2015 13:43
22,11610310,ask hn: aby recent changes to css that broke m...,,1,1,polskibus,5/2/2016 10:14
30,12210105,ask hn: looking for employee #3 how do i do it?,,1,3,sph130,8/2/2016 14:20
31,10394168,ask hn: someone offered to buy my browser exte...,,28,17,roykolak,10/15/2015 16:38


In [13]:
# Subsetting the show hn posts
boolean_show = hn['title'].str.startswith('show hn')
hn_show_posts = hn[boolean_show]
hn_show_posts.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
13,10627194,show hn: wio link esp8266 based web of things...,https://iot.seeed.cc,26,22,kfihihc,11/25/2015 14:03
39,10646440,show hn: something pointless i made,http://dn.ht/picklecat/,747,102,dhotson,11/29/2015 22:46
46,11590768,"show hn: shanhu.io, a programming playground p...",https://shanhu.io,1,1,h8liu,4/28/2016 18:05
84,12178806,show hn: webscope easy way for web developers...,http://webscopeapp.com,3,3,fastbrick,7/28/2016 7:11
97,10872799,show hn: geoscreenshot easily test geo-ip bas...,https://www.geoscreenshot.com/,1,9,kpsychwave,1/9/2016 20:45


## Determine if ask posts or show posts receive more comments on average

In [14]:
# Ask Comments on average
total_ask_comments = hn_ask_posts['num_comments'].sum() / len(hn_ask_posts)
print("Total number of average comments on ask posts are: ",total_ask_comments)

# show Comments on average
total_show_comments = hn_show_posts['num_comments'].sum() / len(hn_show_posts)
print("Total number of average comments on show posts are: ",total_show_comments)



Total number of average comments on ask posts are:  14.038417431192661
Total number of average comments on show posts are:  10.31669535283993


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.  


## Calculate the amount of ask posts created per hour, along with the total amount of comments 

In [18]:
import datetime as dt

result_list = hn_ask_posts[["num_comments","created_at"]]

result_list['datetime'] = pd.to_datetime(result_list['created_at'])

result_list['freq'] = result_list['datetime'].dt.strftime('%H')

result_list.head()
count_by_hour = result_list.groupby('freq').count()
count_by_hour = count_by_hour.drop(["created_at","datetime"], axis = 1)
#count_by_hour = count_by_hour.rename(columns = {'num_comments':'count'}, inplace = True)
count_by_hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,num_comments
freq,Unnamed: 1_level_1
0,55
1,60
2,58
3,54
4,47
5,46
6,44
7,34
8,48
9,45


In [6]:
# Comments by Hour
comments_by_hour = result_list.groupby('freq').sum()
comments_by_hour = comments_by_hour.sort_values(by = 'num_comments')
comments_by_hour

Unnamed: 0_level_0,num_comments
freq,Unnamed: 1_level_1
9,251
7,267
4,337
6,397
3,421
0,447
5,464
22,479
8,492
23,543


In [7]:
# Counts by hour
count_by_hour = result_list.groupby('freq').count()
count_by_hour = count_by_hour.drop(["created_at","datetime"], axis = 1)
count_by_hour = count_by_hour.sort_values(by = 'num_comments')
count_by_hour

Unnamed: 0_level_0,num_comments
freq,Unnamed: 1_level_1
7,34
6,44
9,45
5,46
4,47
8,48
3,54
0,55
11,58
2,58


## Calculate the average number of comments per post for posts created during each hour of the day

In [8]:
avg_by_hour = comments_by_hour / count_by_hour
avg_by_hour = avg_by_hour.sort_values(by = 'num_comments')
avg_by_hour

Unnamed: 0_level_0,num_comments
freq,Unnamed: 1_level_1
9,5.577778
22,6.746479
4,7.170213
3,7.796296
7,7.852941
23,7.985294
0,8.127273
6,9.022727
12,9.410959
5,10.086957


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

## Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).