# [Hacker News](https://news.ycombinator.com/)

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit.

Here is a small project about finding out the insights from the data from hacker news posts and answering the following questions

- Do **Ask HN** or **Show HN** receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Below are descriptions of the data columns:

***id:*** the unique identifier from Hacker News for the post  
***title:*** the title of the post  
***url:*** the URL that the posts links to, if the post has a URL  
***num_points:*** the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes   
***num_comments:*** the number of comments on the post    
***author:*** the username of the person who submitted the post   
***created_at:*** the date and time of the post's submission    

### Ask HN vs Show HN

Let's start by importing the libraries we need and read the dataset into a list of lists.

In [8]:
from csv import *

Read and print the first few lines of data from the dataset

In [9]:
open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

The first row in the dataset is column headers, so assign them to header variable and remove the header row so that the dataset will have just the data rows.

In [10]:
headers = hn[0]
hn = hn[1:]
print(headers)
hn[0:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

From the dataset now create individual datasets for **Ask HN** posts , **Show HN** posts and other posts

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if(title.startswith("ask hn")):
        ask_posts.append(row)
    elif(title.startswith("show hn")):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Now, lets findout the average of **Ask HN** post comments

In [14]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)

In [15]:
print(avg_ask_comments)

14.038417431192661


Lets do same for the **Show HN** post comments

In [16]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


Now compare both the average values, **its clear that ask posts are having higher average than show posts**

The dataset analysis till here answers the first question, now lets move on to the second part of the puzzle

### Best time for post to have more than average comments

Start reading the dataset values for post created time and number of comments

In [17]:
import datetime as dt

In [18]:
print(ask_posts[0:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Create empty dictionaries for ***post*** count per and added ***comments*** per hour

In [21]:
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    crDate = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = crDate.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])

##### creating average number of comments per post at a certain hour from the above data

In [22]:
avg_by_hour = []
for hour in counts_by_hour:
    average = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, average])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


create a list and swap the data above for better view and later on for sorting

In [23]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [25]:
sorted_swap = sorted(swap_avg_by_hour)
sorted_swap

[[5.5777777777777775, '09'],
 [6.746478873239437, '22'],
 [7.170212765957447, '04'],
 [7.796296296296297, '03'],
 [7.852941176470588, '07'],
 [7.985294117647059, '23'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [9.41095890410959, '12'],
 [10.08695652173913, '05'],
 [10.25, '08'],
 [10.8, '19'],
 [11.051724137931034, '11'],
 [11.383333333333333, '01'],
 [11.46, '17'],
 [13.20183486238532, '18'],
 [13.233644859813085, '14'],
 [13.440677966101696, '10'],
 [14.741176470588234, '13'],
 [16.009174311926607, '21'],
 [16.796296296296298, '16'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [38.5948275862069, '15']]

Sort above dataset according to the decending order of comment average per hour

In [27]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

Top 5 Hours for Ask Posts Comments


In [32]:
print("Top 5 Hours for ask Posts to get more comments")
for row in sorted_swap[:5]:
    daytime = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(daytime, row[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
