# Exploring Hacker News Posts 

In this notebook, we will be exploring the Hacker News dataset on Kaggle. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- For the type of post (either Ask HN or Show HN) receiving more comments on average, do posts created at a certain time receive more comments on average?

## 1. Library and Data Import

In [1]:
import csv
open_file = open('C:/Users/Lenovo/Downloads/HN_posts_year_to_Sep_26_2016.csv', encoding = 'utf8')
read_file = csv.reader(open_file)
hn = list(read_file)
print(hn[0:5])
print(len(hn))

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
293120


In [2]:
# Extract and remove header row
headers = hn[0]
hn = hn[1:]
print(hn[0:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## 2. Data Cleansing

In [3]:
# Filter data to keep only Show HN and Ask HN posts

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check the number of records per set to ensure they tally to the original number of posts 
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


## 3. Data Analysis

### 3.1 Do Ask HN or Show HN receive more comments on average?

In [4]:
# Find the total number of comments in ask posts

total_ask_comments = 0

for row in ask_posts:
    number_of_comments = int(row[4])
    total_ask_comments += number_of_comments
    
# Find the average number of comments per ask post

avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)
    

10.393478498741656


In [5]:
# Find the total number of comments in show posts

total_show_comments = 0

for row in show_posts:
    number_of_comments = int(row[4])
    total_show_comments += number_of_comments
    
# Find the average number of comments per ask post

avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)
    

4.886099625910612


**Finding: Based on the above results, Ask HN posts received more comments on average than Show HN posts.**

### 3.2. Are Ask HN posts at a certain time more likely to attract comments?

In [6]:
# Isolate post creation time and number of comments for each Ask HN Post
result_list = []

for row in ask_posts:
    create_time = row[6]
    number_of_comments = int(row[4])
    record = [create_time, number_of_comments]
    result_list.append(record)

print(result_list[0:5])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2]]


In [10]:
# Calculate the amount of ask posts and comments by hour created
counts_by_hour = {}
comments_by_hour = {}

import datetime as dt
for row in result_list:
    date_created = row[0]
    comments = row[1]
    hour_created = dt.datetime.strptime(date_created, '%m/%d/%Y %H:%M').strftime('%H')
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments
print(comments_by_hour)


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [11]:
print(counts_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


In [12]:
# Calculate the average number of comments by hour

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour,((comments_by_hour[hour])/counts_by_hour[hour])])

print(avg_by_hour)
                             

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [15]:
# Rearrange the hour based on descending average comments

sort_avg_by_hour = []

for each_hour in avg_by_hour:
    sort_avg_by_hour.append([each_hour[1], each_hour[0]])
sort_avg_by_hour = sorted(sort_avg_by_hour, reverse = True)
print(sort_avg_by_hour)

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]


In [16]:
# Print top 5 hours with the highest average number of comments

print("Top 5 Hours for Ask HN Comments")

for avg, hr in sort_avg_by_hour[:5]:
    print('{}: {:.2f} average comments per post'.format(dt.datetime.strptime(hr, '%H').strftime('%H:%M'), avg))

Top 5 Hours for Ask HN Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


**Finding: Based on the above result, 15:00 is the hour that generates the most comments on average. In order to maximise the number of comments, it is recommended to post the Ask HN post between 3PM and 3:59PM (timezone is Eastern Time in the US).**