# Exploring Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set <a href = "https://www.kaggle.com/hacker-news/hacker-news-posts">here</a>. Below are descriptions of the columns:

*  __id__: The unique identifier from Hacker News for the post
*  __title__: The title of the post
*  __url__: The URL that the posts links to, if it the post has a URL
*  __num_points__: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
*  __num_comments__: The number of comments that were made on the post
*  __author__: The username of the person who submitted the post
*  __created_at__: The date and time at which the post was submitted

The dataset has been reduced from almost 300,000 rows to approximately 80,000 rows by removing all submissions that did not receive any comments:

In [16]:
import pandas as pd
import numpy as np

HN_Posts = pd.read_csv('HN_posts.csv')
HN_Posts = HN_Posts.query("num_comments !=0")
HN_Posts.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
5,12578975,Saving the Hassle of Shopping,https://blog.menswr.com/2016/09/07/whats-new-w...,1,1,bdoux,9/26/2016 3:13
10,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,9/26/2016 2:53
17,12578822,Amazons Algorithms Dont Find You the Best Deals,https://www.technologyreview.com/s/602442/amaz...,1,1,yarapavan,9/26/2016 2:26
28,12578694,Emergency dose of epinephrine that does not co...,http://m.imgur.com/gallery/th6Ua,2,1,dredmorbius,9/26/2016 1:54
34,12578624,Phone Makers Could Cut Off Drivers. So Why Don...,http://www.nytimes.com/2016/09/25/technology/p...,4,1,danso,9/26/2016 1:37


In [23]:
HN_Posts.shape[0]

80401

### Choosing the relevant posts:

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

We'll compare these two types of posts to determine the following:

*  Do __Ask HN__ or __Show HN__ receive more comments on average?
*  Do posts created at a certain time receive more comments on average?

We're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll filter our dataset containing just the data for those titles. Let's start with posts with Ask HN titles:

In [17]:
search = 'Ask HN'
Ask_HN = HN_Posts["title"].str.startswith(search, na=False)
Ask_HN_filtered = HN_Posts[Ask_HN]
Ask_HN_filtered[:3]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
10,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,9/26/2016 2:53
42,12578522,Ask HN: How do you pass on your work when you ...,,6,3,PascLeRasc,9/26/2016 1:17
80,12577870,Ask HN: Why join a fund when you can be an angel?,,1,3,anthony_james,9/25/2016 22:48


Then posts with Show HN titles:

In [18]:
search ='Show HN'
Show_HN = HN_Posts["title"].str.startswith(search, na=False)
Show_HN_filtered = HN_Posts[Show_HN]
Show_HN_filtered[:3]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
140,12577142,Show HN: Jumble Essays on the go #PaulInYourP...,https://itunes.apple.com/us/app/jumble-find-st...,1,1,ryderj,9/25/2016 20:06
177,12576813,Show HN: Learn Japanese Vocab via multiple cho...,http://japanese.vul.io/,1,1,soulchild37,9/25/2016 19:06
246,12576090,Show HN: Markov chain Twitter bot. Trained on ...,https://twitter.com/botsonasty,3,1,keepingscore,9/25/2016 16:50


In [40]:
print(' There is {} posts starting with Ask HN.'.format(Ask_HN_filtered.shape[0]))
print(' There is {} posts starting with Show HN.'.format(Show_HN_filtered.shape[0]))

 There is 6899 posts starting with Ask HN.
 There is 5052 posts starting with Show HN.


From 80,000 there are only almost 12,000 posts their titles starting with Ask HN or Show HN.

### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, let's determine if ask posts or show posts receive more comments on average:

In [43]:
avg_comments_ask_hn = Ask_HN_filtered['num_comments'].mean()
avg_comments_show_hn = Show_HN_filtered['num_comments'].mean()

print('There is {} comments on Ask HN posts in average.'.format(avg_comments_ask_hn))
print('There is {} comments on Show HN posts in average.'.format(avg_comments_show_hn))

There is 13.759965212349616 comments on Ask HN posts in average.
There is 9.82125890736342 comments on Show HN posts in average.


### Finding the Amount of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if we can maximize the amount of comments an ask post receives by creating it at a certain time. First, we'll find the amount of ask posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments ask posts created at each hour of the day receive.

In [19]:
# Change the created_at column from string to datetime, then create the hours column:

Ask_HN_filtered['created_at'] =  pd.to_datetime(Ask_HN_filtered['created_at'])
Ask_HN_filtered['hours'] = Ask_HN_filtered['created_at'].dt.hour
Ask_HN_filtered[:3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at,hours
10,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,2016-09-26 02:53:00,2
42,12578522,Ask HN: How do you pass on your work when you ...,,6,3,PascLeRasc,2016-09-26 01:17:00,1
80,12577870,Ask HN: Why join a fund when you can be an angel?,,1,3,anthony_james,2016-09-25 22:48:00,22


In [64]:
# Dataframe of number of posts per hour:
ask_posts_by_hour_two = pd.DataFrame(Ask_HN_filtered.groupby(['hours'])['title'].agg('count'))
# Dataframe of sum of comments per hour:
ask_comments_by_hour = pd.DataFrame(Ask_HN_filtered.groupby(['hours'])['num_comments'].agg('sum'))

In [87]:
#Merge the two dataframe:
ask_posts_by_hour_final = pd.merge(ask_posts_by_hour_two, ask_comments_by_hour, left_on = 'hours', right_on = 'hours')

#Create average comment per hour column:
ask_posts_by_hour_final['avg_comm_per_hour'] = ask_posts_by_hour_final['num_comments'] / ask_posts_by_hour_final['title']

In [88]:
ask_posts_by_hour_final

Unnamed: 0_level_0,title,num_comments,avg_comm_per_hour
hours,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,229,2265,9.89083
1,223,2089,9.367713
2,227,2996,13.198238
3,211,2153,10.203791
4,185,2358,12.745946
5,165,1838,11.139394
6,176,1587,9.017045
7,156,1584,10.153846
8,190,2362,12.431579
9,176,1477,8.392045


#### Top 5 Hours for Ask Posts Comments

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the dataframe and printing the five highest values.

In [90]:
sorted_dataframe = ask_posts_by_hour_final.sort_values(by=['avg_comm_per_hour'], ascending=False)
sorted_dataframe.loc['total'] = sorted_dataframe.sum()
sorted_dataframe

Unnamed: 0_level_0,title,num_comments,avg_comm_per_hour
hours,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
15,467.0,18525.0,39.668094
13,324.0,7227.0,22.305556
12,274.0,4234.0,15.452555
10,219.0,3013.0,13.757991
17,404.0,5547.0,13.730198
2,227.0,2996.0,13.198238
14,377.0,4970.0,13.183024
4,185.0,2358.0,12.745946
8,190.0,2362.0,12.431579
22,286.0,3369.0,11.77972


### Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00.

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.