# Exploring Hackers News Posts

Hacker News is a social news website focusing on computer science and entrepreneurship, where users-submitted stories are voted and commented upon.

Specifically we are going to focus on two types of posts.

1. Ask HN posts, were questions are asked to the Hacker News members
2. Show HN posts, were the members posts their project, product or anything in general which is intresting.

We'll compare these two types of posts to determine the following:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Initially we are going to import the csv file and going to display the first 5 rows in the dataset

In [1]:
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing headers from a list of lists

In [2]:
#Extract the header
headers = hn[0]
#Extract the data
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we are concentrating on posts on **Ask HN** and **Show HN**, we would be filtering out using `startswith()` function

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Length of Ask_Posts   : ", len(ask_posts))
print("Length of Show_Posts  : ", len(show_posts))
print("Length of Other_Posts : ", len(other_posts))
        
        

Length of Ask_Posts   :  1744
Length of Show_Posts  :  1162
Length of Other_Posts :  17194


## Calculate the average number of comments for Ask_Posts and Show_posts 

In [4]:
#To determine the average number of comments in Ask_Posts
total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
                              
avg_ask_comments = total_ask_comments / len(ask_posts)                             
print("Average number of comments on ask posts : ", avg_ask_comments)   

Average number of comments on ask posts :  14.038417431192661


In [5]:
#To determine the average number of comments in Show_Posts
total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
                              
avg_show_comments = total_show_comments / len(show_posts)                             
print("Average number of comments on show posts : ", avg_show_comments)   

Average number of comments on show posts :  10.31669535283993


From the analysis above it is clear that the Ask_Posts has a higher average (approximately 14) than the Show_Posts (approximately 10). So we will be focusing on the Ask_Posts in the remaining analysis.

## Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

In [6]:
import datetime as dt
result_list = []

#We will be seperating the date and number of comments of the ask
#posts in a list of lists
for post in ask_posts:
    result_list.append([post[6],int(post[4])])
    
    
#Dictionaries are created to determine the comments by hour
#and counts by hour
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
#Seperating the hour
    time = dt.datetime.strptime(date,date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

print("Counts by hour")
print(counts_by_hour)
print("Comments by hour")
print(comments_by_hour)

Counts by hour
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Comments by hour
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculate the average number of comments for posts created during each hour of the day 

In [7]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour] / 
                        counts_by_hour[hour]])
print(avg_by_hour)    

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [8]:
#Swapping is done for sorting based on comments column
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print("After swapping")
print(swap_avg_by_hour)

print("\nAfter Sorting")
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)


After swapping
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

After Sorting
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'

In [9]:
print("Top 5 Hours for Ask Posts Comments") 
for row in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.
        strptime(row[1],"%H").strftime("%H:%M"),row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


From the above result, it is clear that posts that has been published at 3:00 pm EST recieves more number of comments. And also there is nearly a 60% hike in the number of comments between top 1 & 2.