# Analysis of Hacker News data

## Introduction

In this project, we will be working on a dataset from a popular technology site, ``Hacker News`` with the following aims:

* To determine between the ``Ask HN`` and ``Show HN`` posts which recieves the most comments.
* To find out wether posts created at certain time recieve more comments on average.

*``Ask HN posts`` are posted to ask the Hacker News community a specific question while ``Show HN post`` show the Hacker News community a project, product, or just something interesting.*

The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result. Below are descriptions of the columns:

* ``id:`` the unique identifier from Hacker News for the post
* ``title:`` the title of the post
* ``url:`` the URL that the posts links to, if the post has a URL
* ``num_points:`` the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* ``num_comments:`` the number of comments on the post
* ``author:`` the username of the person who submitted the post
* ``created_at:`` the date and time of the post's submission.

In [22]:
# importing necessary libraries
import datetime as dt
from csv import reader

opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we are only interested in the Ask HN and Show HN,we will create new lists of lists at same time make correction for tittle in the upper case.

In [4]:
# we sort the posts into 3 categories under the followin
# list
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    tittle = row[1]
    if tittle.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif tittle.lower().startswith('show hn'):
        show_posts.append(row)

    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Below are the first five(5) rows in ``ask_posts, and show_posts`` respectively:

In [5]:
print('First 5 rows in ask_posts:')
print(ask_posts[:5])
print('\n')
print('First 5 rows in show_posts:')
print(show_posts[:5])

First 5 rows in ask_posts:
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


First 5 rows in show_posts:
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playg

### Post with most Comment

 We will:
* find the total number of comment for the above comment categories. The comments is found on the 5th collumn for each list of list above.
* Then we will compute their respective averages after exraction.

In [70]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    ask_num_comments = int(row[4])
    total_ask_comments = total_ask_comments + ask_num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)

for row in show_posts:
    show_num_comments = int(row[4])
    total_show_comments = total_show_comments + show_num_comments
avg_show_comments = total_show_comments/len(show_posts)
print('The average ask comments is:', avg_ask_comments)
print('\n')
print('The average show comments is:', avg_show_comments)
    
    

The average ask comments is: 14.038417431192661


The average show comments is: 10.31669535283993


We can see from the averages computed above that the ask post recieved more comments than the show posts.As a result of this, we will be paying  attention to the ask posts in our subsequent analysis.

We will nowgo on to check if ask posts created at a certain time are more likely to attract more comments. To do this, we will:
* Calculate the number of ask posts created per hour of the day along with the number of comments recieved.
* Calculate the average number of comments ask posts recieve by hour created.

### Number of ask posts  and comments by hour created

In [29]:
result_list = [] # list of list containing time ask post was created and number of comments of the post
counts_by_hour = {} # to contain the number of ask posts created during each hour
comments_by_hour = {} # to contain correstponding number of comments ask posts created at each hour recieved

for row in ask_posts:
    created_at = row[6]
    ask_num_comments = int(row[4])
    result_list.append([created_at,ask_num_comments])

date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date =  row[0]
    comment =row[1]
    # we create an object date_dt
    date_dt = dt.datetime.strptime(date, date_format)
    # we extract the time from the object
    hour_dt = date_dt.strftime("%H")
    if hour_dt not in counts_by_hour:
        counts_by_hour[hour_dt] = 1
        comments_by_hour[hour_dt] = comment
    else:
        counts_by_hour[hour_dt] += 1
        comments_by_hour[hour_dt] += comment
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next we will use the two dictionaries, ``counts_by_hour`` , and ``comments_by_hour`` to calculate the average number of comments for posts created during each hour of the day.

We will do this by creating a list of lists containing the hours during which posts were created and the average number of comments those posts received:

In [43]:
avg_by_hour = []
# iterated over the two dic. and appended the hours as a first component to the
# list and the average as the second component. 
for hours in comments_by_hour and counts_by_hour:
    avg_by_hour.append([hours,(comments_by_hour[hours])/counts_by_hour[hours]])
    
print(avg_by_hour)


[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Next we will sort the above result so that it will be easier to identify the highest value. First we need to swap the avg_by_hour list:

In [44]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [62]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)

Top 5 Hours for Ask Posts Comments

In [65]:
print('\n')
print("Top 5 Hours for Ask Posts Comments")
print('\n')

for time in sorted_swap[0:5]:
    time_dt = dt.datetime.strptime(time[1],"%H")
    hour = time_dt.strftime("%H:%M")
    date = str(time[0])
    comment = time[0]
    avg_c = "{h} : {c:.2f} average comments per post"
    print(avg_c.format(c=comment,h=hour))



Top 5 Hours for Ask Posts Comments


15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


#### Findings

From the above analysis, we can see that on average, significant amount of comments(38.59) was recorded for Ask posts made at 15:00 hour. This is really huge when compared to the average Ask comment for the entire Ask posts which stood at 14.04. 

# Conclusion

In conclusion analysis of the Hacker News data shows that ``ask post`` attracts more comments. Most expercially ask those made at 15:00.