# Exploring Hacker News Posts

In this project, we will exploring [Hacker News](https://news.ycombinator.com/) posts and analyze the feedback on 'Ask HN' and 'Show HN' posts to see which one is more popular using the comments average.

Users submit Ask HN posts to ask the Hacker News community a specific question, Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?


### Dataset:
We'll work with a data set of submissions to Hacker News site. You can find the data set [here.](https://www.kaggle.com/hacker-news/hacker-news-posts) note that, we reduced the dataset by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 


 

## 1.Opening the data

In [20]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hacker_news = list(read_file)
headers = hacker_news[0]
hacker_news = hacker_news[1:]

print(headers)
print('\n')
for x in hacker_news[:5]:
    print(x)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




## 2. Extracting Ask HN and Show HN Posts

Now after removed the headers, we're ready to filter our data. We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.    Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. 

So, we'll create new lists of lists containing just the data for those titles. To find the posts that begin with either Ask HN or Show HN, we'll use the string method **startswith**. Let's start by separating the posts beginning with Ask HN and Show HN :

In [21]:
ask_posts = []
show_posts = []
other_posts = []

for row in hacker_news:
    title = row[1]
    title_lower = title.lower()
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lower.startswith('show hn'): 
        show_posts.append(row)
    else:
        other_posts.append(row) 

print('The number of posts in ask_posts: ' , len(ask_posts))
print('The number of posts in show_posts: ' , len(show_posts))
print('The number of posts in other_posts: ' , len(other_posts))

The number of posts in ask_posts:  1744
The number of posts in show_posts:  1162
The number of posts in other_posts:  17194


In [24]:
for x in ask_posts[:3]:
    print(x)
    print('\n')
    

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']




In [25]:
for x in show_posts[:3]:
    print(show_posts[:3])
    print('\n')

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/

## 3. Calculating the Average Number of Comments

Next, let's determine if ask posts or show posts receive more comments on average.

In [26]:
total_ask_comments = 0
for row in ask_posts:
    comments = row[4]
    comments = int(comments)
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print('The Average Number of Comments for Ask HN Posts: ' , avg_ask_comments)

The Average Number of Comments for Ask HN Posts:  14.038417431192661


In [27]:
total_show_comments = 0
for row in show_posts:
    comments = row[4]
    comments = int(comments)
    total_show_comments += comments
    
avg_show_comments = total_show_comments/len(show_posts)
print('The Average Number of Comments for Show HN Posts: ', avg_show_comments)

The Average Number of Comments for Show HN Posts:  10.31669535283993


On average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.


## 4. Finding the Amount of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2- Calculate the average number of comments ask posts receive by hour created.

We'll use the datetime module to work with the date in the **created_at** column.

In [30]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    comments_num = int(row[4])
    result_list.append([ created_at,comments_num ])
    
counts_by_hour = {}
comments_by_hour = {}
    
for row in result_list:
    
    date = dt.datetime.strptime(row[0] , "%m/%d/%Y %H:%M" )
    hour = dt.datetime.strftime(date, "%H" )
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

We created two dictionaries:
- **counts_by_hour:** contains the number of ask posts created during each hour of the day.
- **comments_by_hour:** contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

## 5. Calculating the Average Number of Comments for Ask HN Posts by Hour

In [35]:
avg_by_hour = []
for row in comments_by_hour:
    avg = comments_by_hour[row]/counts_by_hour[row]
    avg_by_hour.append([row ,avg ])
    
print(len(avg_by_hour))

for item in avg_by_hour:
    print(item)

24
['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


Now, **avg_by_hour** contain the average number of comments for posts created during each hour of the day. Let's finish by sorting values in **avg_by_hour** and printing the highest values in a format that's easier to read.


## 6. Sorting Values 

In [40]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

for row in swap_avg_by_hour:
    print(row) 

[5.5777777777777775, '09']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[16.796296296296298, '16']
[7.985294117647059, '23']
[9.41095890410959, '12']
[11.46, '17']
[38.5948275862069, '15']
[16.009174311926607, '21']
[21.525, '20']
[23.810344827586206, '02']
[13.20183486238532, '18']
[7.796296296296297, '03']
[10.08695652173913, '05']
[10.8, '19']
[11.383333333333333, '01']
[6.746478873239437, '22']
[10.25, '08']
[7.170212765957447, '04']
[8.127272727272727, '00']
[9.022727272727273, '06']
[7.852941176470588, '07']
[11.051724137931034, '11']


In [44]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True) 

for row in sorted_swap:
    print(row) 


[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


In [56]:
print('Top 5 Hours for Ask Posts Comments')
print('\n')

for average , hour in sorted_swap[0:5]:
    hour_obj = dt.time(hour = int(hour))
    hour = hour_obj.strftime("%H:%M")
    time = "At {h}, the average comments per post is {avg:.2f}. ".format(h =hour , avg= average)
    print(time)
    print('\n')
    
#increase = ((38.59 - 23.81)/23.81)*100 = 62%

Top 5 Hours for Ask Posts Comments


At 15:00, the average comments per post is 38.59. 


At 02:00, the average comments per post is 23.81. 


At 20:00, the average comments per post is 21.52. 


At 16:00, the average comments per post is 16.80. 


At 21:00, the average comments per post is 16.01. 




From the above result, there's a different between the hours with the highest and the second highest average number of comments. We have 38.59 comments per post were received at 15:00, while for 2:00, we have 23.81 comments per post .There's about a 60% increase in the number of comments between them.


## Conclusions

On average, you should create a post during the following hours to have a higher chance of receiving comments: 
**15, 2,** and **20** EST(US), Those times are **11pm, 10am** and **4am** in **Riyadh, Saudi Arabia.**