# Analyzing Hacker News Posts

Hacker news is a site similar to reddit where user submit posts asking technology related questions, the posts are the voted and comment on.

I will be analyzing a subset of a dataset of the same found on kaggle. Below are the descriptions of the columns:

id: The unique identifier from Hacker News for the post

title: The title of the post

url: The URL that the posts links to, if it the post has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post

created_at: The date and time at which the post was submitted

Im specifically interested in posts whose titles begin with 'Show HN' or 'Ask HN'
The goal for this analysis is to:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at certain time receive more comments on average?

## Read-in data and print first 5 rows

In [19]:
from csv import reader
opened = open(r'hacker_news.csv')
read = reader(opened)
hn = list(read)

for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




## Extract and isolate the first row as header from the rest of the dataset

In [20]:
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




## Filtering relevant data

since we are interested in Posts that have titles begining with Ask Hn or Show Hn, I isolate these

In [21]:
ask_posts = []
show_posts =[]
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Below are the first 5 rows for both ask_posts and show_posts

In [22]:
for post in ask_posts[:5]:
    print(post)                 
    print('\n')

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']




In [23]:
for post in show_posts[:5]:
    print(post)                 
    print('\n')

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']


['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']


['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']




### 1. Determine if ask_posts or show_posts recieve more comments on average

In [24]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts) 
print(avg_ask_comments)  

14.038417431192661


In [25]:
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts) 
print(avg_show_comments) 

10.31669535283993


on average asks posts receive more comments than show posts. Since asks posts receive more comments on average I focus on these

 ## 2. Do ask posts created at acertain time receive more comments
 
 the first step — calculating the amount of ask posts and comments by hour created

In [26]:
import datetime as dt

result_list = []
for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )
    
print(result_list[:5])    


[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [27]:
counts_by_hour = {}    
comments_by_hour = {}

for row in result_list:
    date = row[0]
    datetime_obj = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    hour = datetime_obj.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
comments_by_hour   

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

In [28]:
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

Below, I calculate the average number of comments a post created each hour receives

In [29]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour,
                        comments_by_hour[hour] / counts_by_hour[hour]
                        
                       ])
avg_by_hour    

[['06', 9.022727272727273],
 ['13', 14.741176470588234],
 ['15', 38.5948275862069],
 ['02', 23.810344827586206],
 ['04', 7.170212765957447],
 ['20', 21.525],
 ['23', 7.985294117647059],
 ['17', 11.46],
 ['18', 13.20183486238532],
 ['09', 5.5777777777777775],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['03', 7.796296296296297],
 ['12', 9.41095890410959],
 ['05', 10.08695652173913],
 ['14', 13.233644859813085],
 ['00', 8.127272727272727],
 ['21', 16.009174311926607],
 ['10', 13.440677966101696],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034],
 ['08', 10.25],
 ['19', 10.8],
 ['16', 16.796296296296298]]

I sort the list to best visualize it 

In [30]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour    

[[9.022727272727273, '06'],
 [14.741176470588234, '13'],
 [38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [7.170212765957447, '04'],
 [21.525, '20'],
 [7.985294117647059, '23'],
 [11.46, '17'],
 [13.20183486238532, '18'],
 [5.5777777777777775, '09'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [7.796296296296297, '03'],
 [9.41095890410959, '12'],
 [10.08695652173913, '05'],
 [13.233644859813085, '14'],
 [8.127272727272727, '00'],
 [16.009174311926607, '21'],
 [13.440677966101696, '10'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11'],
 [10.25, '08'],
 [10.8, '19'],
 [16.796296296296298, '16']]

In [31]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [32]:
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [33]:
for avg, hr in sorted_swap[:5]:
    hr = dt.datetime.strptime(hr,"%H").strftime("%H:%M")
    print('{} {:.2f} average comments per post'.format(hr, avg))

15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


### Conclusion

It appears that an Ask Hn post done during the 15:00 hour will receive the most number of comments followed by 02:00 and 20:00 at a distant second and third respectively

#### next steps.

Determine if show or ask posts receive more points on average.

Determine if posts created at a certain time are more likely to receive more points.

Compare your results to the average number of comments and points other posts receive.