# Project: Hacker News Data Exploration/Analytics

This project examines two types of posts from Hacker News to find posts that receive the most votes based off comments. Two types of posts are Ask HN and Show HN. Essentially, Ask HN is a type of post that asks a question from the Hacker News community. Show HN, is a post that shares something interesting to the Hacker News community, projects, videos, images, etc. 

In this project, we will target two questions:
1) Which of those two types of questions receive more comments on average?
2) What is the ideal time for a post to be published, to receive more comments on average?


This data set is Hacker News posts from the last 12 months (up to September 26 2016).
It includes the following columns:

title: title of the post (self explanatory)
url: the url of the item being linked to
num_points: the number of upvotes the post received
num_comments: the number of comments the post received
author: the name of the account that made the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)

License of the dataset: https://creativecommons.org/publicdomain/zero/1.0/

# Ingest MRF

First, we will ingest the MRF and remove the headers. Note: this dataset is from 2016, and contains almost 300.000 rows that includes Ask HN, Show HN, and "Other" type of posts. We have processed this dataset such that it contains only posts that have received 1 or more comments. 

In [1]:
# Ingest the data
from csv import reader
#opened_file = open('hacker_news.csv')
raw_ingest_csv = open('hacker_news_2016_edited.csv')
read_csv = reader(raw_ingest_csv)

hn = list(read_csv)

print(hn[:5])

[['ard-writing/', '3', '0', 'Ciotti', '9/8/2015 11:33'], ['10185057', 'Why we set up Codified Security', 'https://codifiedsecurity.com/2015/09/03/why-we-setup-codified-security/', '2', '1', 'martinald', '9/8/2015 11:33'], ['10185045', 'The Machine Stops', 'http://archive.ncsa.illinois.edu/prajlich/forster.html', '2', '0', 'jal278', '9/8/2015 11:30'], ['10185041', "Inside Popcorn Time  the world's fastest growing piracy site", 'http://www.dn.no/magasinet/2015/09/07/1606/Popcorn-Time/inside-popcorn-time--the-worlds-fastest-growing-piracy-site', '337', '336', 'sleepyhead', '9/8/2015 11:29'], ['10185040', 'Why Are Women Less Likely to Become Entrepreneurs', 'http://www.npr.org/2015/09/08/438473573/why-are-women-less-likely-to-become-entrepreneurs-than-men', '2', '0', 'dynofuz', '9/8/2015 11:29']]


# Removing the headers

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['ard-writing/', '3', '0', 'Ciotti', '9/8/2015 11:33']
[['10185057', 'Why we set up Codified Security', 'https://codifiedsecurity.com/2015/09/03/why-we-setup-codified-security/', '2', '1', 'martinald', '9/8/2015 11:33'], ['10185045', 'The Machine Stops', 'http://archive.ncsa.illinois.edu/prajlich/forster.html', '2', '0', 'jal278', '9/8/2015 11:30'], ['10185041', "Inside Popcorn Time  the world's fastest growing piracy site", 'http://www.dn.no/magasinet/2015/09/07/1606/Popcorn-Time/inside-popcorn-time--the-worlds-fastest-growing-piracy-site', '337', '336', 'sleepyhead', '9/8/2015 11:29'], ['10185040', 'Why Are Women Less Likely to Become Entrepreneurs', 'http://www.npr.org/2015/09/08/438473573/why-are-women-less-likely-to-become-entrepreneurs-than-men', '2', '0', 'dynofuz', '9/8/2015 11:29'], ['10185034', 'The American Dream Is Dead', 'http://www.fastcoexist.com/3049643/the-american-dream-is-dead-heres-where-it-went', '3', '0', 'fsethi', '9/8/2015 11:27']]


We could observe that data contains the title, the # of comments for a given post, and the date, respectively. To answer the first question, we shall target the # of comments for a given post.

# Extracting ONLY `Ask HN` and `Show HN` Posts From The Dataset

We have to extract only Ask HN and Show HN posts from the dataset. In this cell, we are separating Ask HN, Show HN, and Other HN in different lists. This block of code iterates through the dataset, finds the posts that start with either Ask HN or Show HN, and puts them in a relevant list. This step makes the further analytics easier.

In [3]:
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
#ask_posts = []
#show_posts = []
#other_posts = []

ask_hn_posts = [] #Ask HN list
show_hn_posts = [] #Show HN List
other_hn_posts = [] #Other HN List


for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'): #If the title of the post is ask hn, append 
        ask_hn_posts.append(post)
    elif title.lower().startswith('show hn'): #If the title of the post is show hn, append 
        show_hn_posts.append(post)
    else:
        other_hn_posts.append(post) #If the title of the post is neither ask hn nor show hn, put it into the other list
        
        
#Print the lengths of the list to see how many posts of each type we have.

print(len(ask_hn_posts))
print(len(show_hn_posts))
print(len(other_hn_posts))


39
42
1062


So based off this output, we have:
39 Ask HN posts
42 Show HN posts
1062 Other posts

# Total Number of Comments for `Ask HN` and `Show HN` posts

The idea here is to take each of our declared lists and calculate the average # of comments partitioned by the type of the post.


In [4]:
total_ask_hn_comments = 0  # Set the counter of the total ask comments to 0
for post in ask_hn_posts:
    total_ask_hn_comments += int(post[4]) # Loop through the number of ask comments and increment the counter 
avg_ask_hn_comments = total_ask_hn_comments / len(ask_hn_posts) # Find the average (total # of comments partitioned by the legth of our ask gn list.
print(avg_ask_hn_comments) 

7.256410256410256


In [5]:
total_show_hn_comments = 0 
for post in show_hn_posts:
    total_show_hn_comments += int(post[4])
avg_show_hn_comments = total_show_hn_comments / len(show_hn_posts)
print(avg_show_hn_comments)

8.119047619047619


It is a close comparison, but `Show HN` posts receive more comments than `Ask HN` posts. Consequently, we will be targeting `Show HN` posts for the remaining of this project.

# Finding the Amount of `Show HN` Posts and Comments by Hour Created

After determining what type of the post receives more comments. Our next task is to determine whether `Show HN` post can receive more comments if posted at the certain time. First, we need to find the amount of `Show HN` post created per hour of the day and amount of comments those posts receive. With that data, we will be able to find the mean amount of comments `Show HN` posts created at each hour of the day receive. We will follow the 24-hour format for convenience.

In [6]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received.
import datetime as d #To match up the attribute's format

results = []

for post in show_hn_posts:
    results.append(
        [post[6], int(post[4])]
    )
    

dateformat = "%m/%d/%Y %H:%M"
comment_per_hr = {}
count_per_hr = {} 

for each_row in results:
    date = each_row[0]
    comment = each_row[1]
    time = d.datetime.strptime(date, dateformat).strftime("%H")
    
    if time in count_per_hr:
        comment_per_hr[time] += comment # number of comments received
        count_per_hr[time] += 1 # per each hour of the day
    else:
        comment_per_hr[time] = comment 
        count_per_hr[time] = 1

comment_per_hr

{'11': 45,
 '10': 0,
 '08': 0,
 '06': 36,
 '05': 166,
 '04': 27,
 '03': 2,
 '22': 1,
 '21': 3,
 '19': 14,
 '18': 0,
 '16': 0,
 '13': 9,
 '12': 32,
 '07': 0,
 '02': 6,
 '00': 0,
 '23': 0,
 '15': 0,
 '14': 0}

# Average Number of `Show HN` Posts by Hour

In [7]:
avg_by_hr = []

for hr in comment_per_hr:
    avg_by_hr.append([hr, comment_per_hr[hr] / count_per_hr[hr]]) # Take number of comments partitioned by hour per day
avg_by_hr    

[['11', 11.25],
 ['10', 0.0],
 ['08', 0.0],
 ['06', 12.0],
 ['05', 83.0],
 ['04', 27.0],
 ['03', 1.0],
 ['22', 0.5],
 ['21', 3.0],
 ['19', 7.0],
 ['18', 0.0],
 ['16', 0.0],
 ['13', 1.8],
 ['12', 8.0],
 ['07', 0.0],
 ['02', 3.0],
 ['00', 0.0],
 ['23', 0.0],
 ['15', 0.0],
 ['14', 0.0]]

# Sorting Values from our List

In [8]:
sort_avg_by_hr = []
for row in avg_by_hr:
    sort_avg_by_hr.append([row[1], row[0]])
    
print(sort_avg_by_hr)

reverse_sort = sorted(sort_avg_by_hr, reverse = True) # column

reverse_sort

[[11.25, '11'], [0.0, '10'], [0.0, '08'], [12.0, '06'], [83.0, '05'], [27.0, '04'], [1.0, '03'], [0.5, '22'], [3.0, '21'], [7.0, '19'], [0.0, '18'], [0.0, '16'], [1.8, '13'], [8.0, '12'], [0.0, '07'], [3.0, '02'], [0.0, '00'], [0.0, '23'], [0.0, '15'], [0.0, '14']]


[[83.0, '05'],
 [27.0, '04'],
 [12.0, '06'],
 [11.25, '11'],
 [8.0, '12'],
 [7.0, '19'],
 [3.0, '21'],
 [3.0, '02'],
 [1.8, '13'],
 [1.0, '03'],
 [0.5, '22'],
 [0.0, '23'],
 [0.0, '18'],
 [0.0, '16'],
 [0.0, '15'],
 [0.0, '14'],
 [0.0, '10'],
 [0.0, '08'],
 [0.0, '07'],
 [0.0, '00']]

In [9]:
# Sort and print 24 hours, desc.

print("Top Hours for 'Show HN' Comments\n")
for avg, hr in reverse_sort[:24]:
      print(
          "\n{}: {:.2f} Average comments per single 'Show HN' post".format(
             d.datetime.strptime(hr, "%H").strftime("%H:%M"), avg
          )
      )

Top Hours for 'Show HN' Comments


05:00: 83.00 Average comments per single 'Show HN' post

04:00: 27.00 Average comments per single 'Show HN' post

06:00: 12.00 Average comments per single 'Show HN' post

11:00: 11.25 Average comments per single 'Show HN' post

12:00: 8.00 Average comments per single 'Show HN' post

19:00: 7.00 Average comments per single 'Show HN' post

21:00: 3.00 Average comments per single 'Show HN' post

02:00: 3.00 Average comments per single 'Show HN' post

13:00: 1.80 Average comments per single 'Show HN' post

03:00: 1.00 Average comments per single 'Show HN' post

22:00: 0.50 Average comments per single 'Show HN' post

23:00: 0.00 Average comments per single 'Show HN' post

18:00: 0.00 Average comments per single 'Show HN' post

16:00: 0.00 Average comments per single 'Show HN' post

15:00: 0.00 Average comments per single 'Show HN' post

14:00: 0.00 Average comments per single 'Show HN' post

10:00: 0.00 Average comments per single 'Show HN' post

08:00: 0.

The hour that receives the highest number of comments per post on average is 05:00, with an average of 83 comments per post. It is nearly a 78% increase in the number of comments between the first top 2 hours. There's about a 60% increase in the number of comments between the hours with highest and the second highest average number of comments.

NOTE: the timezone that this data was published in is Eastern Time in the United States of America. 

# Conclusion


In this project, we took a look at two types of posts on HackerNews, `Ask HN` and `Show HN`. Based on our conclusions, the best time to post and receive the most comments is 4:00 AM - 5:00 AM.