# Exploring Hacker News Posts

In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

We will find on a lighter version of this [database](https://www.kaggle.com/hacker-news/hacker-news-posts) (300 000 rows vs 20 000 rows)

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Ask HN posts correspond to a specific question asked to the community, Show HN posts correspond to a project, product... shown to the community.

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader

opened = open('hacker_news.csv')
read_file = reader(opened)
hacker_data = list(read_file)
header = hacker_data[0]

hn = hacker_data[1:]

In [2]:
print(header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
i = 0

for row in hn : 
    if i < 5:
        print(row)
        i +=1
    

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Now, we will extract Ask HN and Show HN Posts

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn : 
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn') :
        ask_posts.append(row)
    elif title.startswith('show hn') :
        show_posts.append(row)
    else : 
        other_posts.append(row)
    
print('Ask Posts : ', len(ask_posts))
print('Show Posts :',len(show_posts))
print('Other Posts :',len(other_posts))

Ask Posts :  1744
Show Posts : 1162
Other Posts : 17194


In [5]:
total_ask_comments = 0

for row in ask_posts : 
    comment = int(row[4])
    total_ask_comments += comment

avg_ask_comment = total_ask_comments / len(ask_posts)
print('Ask Comment AVG : ', avg_ask_comment)

total_show_comments = 0

for row in show_posts : 
    comment = int(row[4])
    total_show_comments += comment
    
avg_show_comments = total_show_comments / len(show_posts)
print('Show Comment AVG : ', avg_show_comments)

Ask Comment AVG :  14.038417431192661
Show Comment AVG :  10.31669535283993


Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

We will now work on time data (using datetime module) to find if ask posts created at a certain time are more likely to attract comments. 

In [6]:
import datetime as dt

result_lists = []

for row in ask_posts : 
    result_lists.append([row[6],int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_lists :
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    
    if time in counts_by_hour : 
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment       
    else :
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment   
        

avg_by_hour = []

for time in counts_by_hour : 
    avg_by_hour.append([time, (comments_by_hour[time]/counts_by_hour[time]) ])

In [7]:
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [8]:
swap_avg_by_hour = []

for row in avg_by_hour : 
    time = row[0]
    avgcomment = row[1]
    swap_avg_by_hour.append([avgcomment,time])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [9]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [10]:
print('Top 5 Hours for Ask Posts Comments')

Top 5 Hours for Ask Posts Comments


In [15]:
top5_sorted = sorted_swap[0:6]

for row in top5_sorted : 
    hour = row[1]
    avg = row[0]
    new_hour = dt.datetime.strptime(hour, '%H').strftime('%H:%S')
    sentence = '{} : {:.2f} average comments per post'.format(new_hour, avg)
    print(sentence)

15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post
13:00 : 14.74 average comments per post


The hour that receives the most comments per post(based on an avergage) is 15:00, with an average of 38.59 comments per posts. 

Accordint to the documentation of the dataset, the timezone used is Eastern Time in the US so we need to add 6h to have French time zone.