# Work with a data set of submissions to Hacker News

## Background

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We'll be examining two types of posts from Hacker News. Ask HN are posts that users submit to ask the Hacker News community a specific question. Show HN are posts by users to show the Hacker News community a project, product, or just generally something interesting.

We'll compare Ask HN and Show HN to answer the following questions:

A. Do `Ask HN` or `Show HN` receive more comments on average?

B. Do posts created at a certain time receive more comments on average?

C. Do either `Ask HN` or `Show HN` receive more points?

D. During which hours are the posts more likely to receive higher points?



## Step 1: Opening and Exploring the Data

You can find the data set at Kaggle here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post 
created_at: The date and time at which the post was submitted

Let's start by importing the libraries we need and reading the data set into a list of lists, hn.

In [1]:
opened_file = open("hacker_news.csv",encoding = "utf-8")
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row.

## Step 2: In order to analyze our data, we'll remove the first row of column headers:

In [2]:
header = hn[0]
hn = hn[1:]
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

## Step 3: Filter our data to find the posts we're interested in

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True, otherwise it will return False.

In [3]:
print('dataquest'.startswith('Data'))
print('dataquest'.startswith('data'))

False
True


In the example above, the first print call gives us False because dataquest does not start with Data. The second print call prints True because dataquest does start with data. Capitalization matters.

If we wish to control for case, we can use the lower method which returns a lowercase version of the starting string. Here's an example:

In [4]:
print('DataQuest'.lower())

dataquest


Let's use these methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.

In [5]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Total number of ask posts:", len(ask_posts))
print("Total number of show posts:", len(show_posts))
print("Total number of other posts:", len(other_posts))

Total number of ask posts: 1744
Total number of show posts: 1162
Total number of other posts: 17194


Above, we separated the "ask posts" and the "show posts" into two list of lists named ask_posts and show_posts.

We note that the majority of posts falls into the category of other posts. It might be of interest if we wanted to extend our sample to have a further look. For the moment, we will accept the results and work with our filtered data.

Below are the first five rows for each of the list of lists ask_posts and show_posts:

In [6]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [7]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


## Step 4: Determine if ask posts or show posts receive more comments on average

In [8]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


### Answer A: Our calculation indicates that on average, ask hn posts receive more comments.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Step 5. Determine if there is a certain time ask posts are more likely to attract comments.
We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts received by hour created.

We'll use datetime to work with the data in the created_at column. Note that the time data is EST, we'll use that information later to calculate the best time to create posts in our own timezone.

Below, we'll create a two element list corresponding to the time data and comments. This will allow us to focus on the information necessary to build a frequency table.

In [9]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    in_list = [created_at, num_comments]
    result_list.append(in_list)
print(result_list[:4])    

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3]]


We can now create the frequency table with the date and comments data:

In [10]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    dt_object = row[0]
    dt_parsed = dt.datetime.strptime(dt_object, "%m/%d/%Y %H:%M")
    hr = dt.datetime.strftime(dt_parsed, "%H")
    # print(hr)

    if hr not in counts_by_hour:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = int(row[1])
    else:
        counts_by_hour[hr] = counts_by_hour[hr] + 1
        comments_by_hour[hr] = comments_by_hour[hr] + int(row[1])

print(comments_by_hour)
print(counts_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


## Step 6: Average number of comments in an hour

We can now use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. Below, we will build a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [11]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values.

## Step 7: Sort the list of lists and print the five highest values so it's easier to read

We'll swap the elements to display the average by hour below.

In [12]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Now we can find the top 5 hours for posting comments:

In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hr_obj = dt.datetime.strptime(row[1], "%H")
    hr_obj_string = dt.datetime.strftime(hr_obj, "%H:%M")
    Template = "{}: {:.2f} average comments per post."
    print(Template.format(hr_obj_string, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


### Answer B: On average, the majority of comments are created at 15:00 EST.

As previously noted, the above times are in EST (Eastern Standard Time). For my Pakistan Standard Time, I should post 9 hours ahead (depending on daylight savings time which is different for my time zone as well). So if I create posts at 00:00, 11:00, 5:00, 01:00, and 12:00 I will have the best chance for highest comment rate per hour.

## Step 8: Find out if either Ask HN or Show HN receive more points

To know this, we will calculate the average number of points received by Ask and Show posts, in the code below:

In [14]:
total_ask_points = 0
for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
avg_ask_points = total_ask_points/len(ask_posts)

total_show_points = 0
for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
avg_show_points = total_show_points/len(show_posts)
print("Average like or points per ask post:",avg_ask_points)
print("Average like or points per show post:",avg_show_points)

Average like or points per ask post: 15.061926605504587
Average like or points per show post: 27.555077452667813


### Answer C: We've determined that Show HN posts receive more points.

## Step 9: During which hours are Show HN posts more likely to receive higher points

As the average number of points per Show HN post is greater, we will continue our analysis on the times that they are most likely to receive higher points.

We'll now construct a list to hold the data we're interested in - similar to what we did previously for ask_posts in Step 5.

In [15]:
show_result_list = []
for row in show_posts:
    created_at = row[6]
    num_points = int(row[3])
    show_result_list.append([created_at, num_points])
print(show_result_list[:4])

[['11/25/2015 14:03', 26], ['11/29/2015 22:46', 747], ['4/28/2016 18:05', 1], ['7/28/2016 7:11', 3]]


We can now create the frequency table with the date and posts data:

In [16]:
counts_by_show_hours = {}
points_by_hours = {}
for row in show_result_list:
    dt_object = row[0]
    dt_parsed = dt.datetime.strptime(dt_object,"%m/%d/%Y %H:%M")
    hr = dt.datetime.strftime(dt_parsed, "%H")
    if hr not in counts_by_show_hours:
        counts_by_show_hours[hr] = 1
        points_by_hours[hr] = row[1]
    else:
        counts_by_show_hours[hr] += 1
        points_by_hours[hr] += row[1]
print(counts_by_show_hours)
print(points_by_hours)

{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}
{'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


## Step 10: Average number of points for show post in any hour

We can now use these two dictionaries for our calculation, very similar to Step 6.

In [17]:
avg_by_show_hour = []
for hour in points_by_hours:
    avg_by_show_hour.append([hour, points_by_hours[hour]/counts_by_show_hours[hour]])
print(avg_by_show_hour)

[['14', 25.430232558139537], ['22', 40.34782608695652], ['18', 36.31147540983606], ['07', 19.0], ['20', 30.316666666666666], ['05', 5.473684210526316], ['16', 28.322580645161292], ['19', 30.945454545454545], ['15', 28.564102564102566], ['03', 25.14814814814815], ['17', 27.107526881720432], ['06', 23.4375], ['02', 11.333333333333334], ['13', 24.626262626262626], ['08', 15.264705882352942], ['21', 18.425531914893618], ['04', 14.846153846153847], ['11', 33.63636363636363], ['12', 41.68852459016394], ['23', 42.388888888888886], ['09', 18.433333333333334], ['01', 25.0], ['10', 18.916666666666668], ['00', 37.83870967741935]]


## Step 11: Sort the list of lists and print the five highest values so it's easier to read

We'll swap the elements as in Step 7 and arrange them below:

In [18]:
swap_show_avg_by_hour = []
for row in avg_by_show_hour:
    swap_show_avg_by_hour.append([row[1],row[0]])
swap_show_avg_by_hour

[[25.430232558139537, '14'],
 [40.34782608695652, '22'],
 [36.31147540983606, '18'],
 [19.0, '07'],
 [30.316666666666666, '20'],
 [5.473684210526316, '05'],
 [28.322580645161292, '16'],
 [30.945454545454545, '19'],
 [28.564102564102566, '15'],
 [25.14814814814815, '03'],
 [27.107526881720432, '17'],
 [23.4375, '06'],
 [11.333333333333334, '02'],
 [24.626262626262626, '13'],
 [15.264705882352942, '08'],
 [18.425531914893618, '21'],
 [14.846153846153847, '04'],
 [33.63636363636363, '11'],
 [41.68852459016394, '12'],
 [42.388888888888886, '23'],
 [18.433333333333334, '09'],
 [25.0, '01'],
 [18.916666666666668, '10'],
 [37.83870967741935, '00']]

In [19]:
swap_sorted = sorted(swap_show_avg_by_hour, reverse = True)

print("Top 5 Hours for SHow Posts Comments")
for row in swap_sorted[:5]:
    hr_obj = dt.datetime.strptime(row[1], "%H")
    hr_obj_string = dt.datetime.strftime(hr_obj, "%H:%M")
    Template = "{}: {:.2f} average Likes or Points per post."
    print(Template.format(hr_obj_string, row[0]))

Top 5 Hours for SHow Posts Comments
23:00: 42.39 average Likes or Points per post.
12:00: 41.69 average Likes or Points per post.
22:00: 40.35 average Likes or Points per post.
00:00: 37.84 average Likes or Points per post.
18:00: 36.31 average Likes or Points per post.


### Answer D: The best time for a show HN post is at 23:00 EST.

The times listed above are in Eastern Standard Time, so for Central European Time we'll advance 9 hours.

8:00, 21:00, and 7:00 are the best times, followed by 9:00 and 3:00, to acquire the highest likes or point totals.

## Summary
To summarize our results:

A. Do `Ask HN` or `Show HN` receive more comments on average?

    ### Our calculation indicates that on average, `Ask HN` posts receive more comments (14 vs 10).

B. Do posts created at a certain time receive more comments on average?

    ### On average, the majority of comments are created at 15:00 EST.

C. Do either `Ask HN` or `Show HN` receive more points?

    ### `Show HN` posts receive more points (27 points vs. 15 for `Ask HN`)

D. During which hours are the posts more likely to receive higher points?

    ### The most points are received by posts created at 23:00, followed closely by those at 12:00 and 22:00