# Analyzing Popularity of Hacker News Posts
## Introduction
The aim of this project is to identify what kind of user-submitted posts on a popular technology site Hacker News receive more comments/points. In particular, we are interested in posts whose titles begin with either *Ask HN* (submitted to ask the Hacker News community a specific question) or *Show HN* (submitted to show a project, product, or just generally something interesting). We'll compare these two types of posts to determine the following:

- Do *Ask HN* or *Show HN* receive more comments/points on average?
- Do posts created at a certain time receive more comments/points on average?

The [original data set](https://www.kaggle.com/hacker-news/hacker-news-posts) for our analysis was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. For descriptions of the columns please consult the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts).

Let's start by opening the data set and reading it into a list of lists.

## 1. Data Downloading

In [1]:
import csv

opened_file = open("hacker_news.csv")
read_file = csv.reader(opened_file)
hn = list(read_file)

In [2]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In order to analyze our data, we need to first remove the row containing the column headers.

In [3]:
headers = hn[0]
hn = hn[1:]

In [4]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## 2. Extracting Ask HN Posts and Show HN Posts
Since we're only concerned with post titles beginning with *Ask HN* or *Show HN*, we'll create new lists of lists containing just the data for those titles.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("The number of ask posts:", len(ask_posts))
print("The number of show posts:", len(show_posts))
print("The number of other posts:", len(other_posts))

The number of ask posts: 1744
The number of show posts: 1162
The number of other posts: 17194


In [7]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


In [8]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


## 3. Analyzing Comments for Ask HN and Show HN Posts
### 3.1. Calculating the Average Number of Comments

Let's determine if ask posts or show posts receive more comments on average.

In [9]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)


total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print("The average number of comments on ask posts:", avg_ask_comments)
print("The average number of comments on show posts:", avg_show_comments)

The average number of comments on ask posts: 14.038417431192661
The average number of comments on show posts: 10.31669535283993


We can see that ask post receive on average about 1.4 times more comments than show posts. One possible explanation here could be that people who use this site to find answers on their questions, apart from submitting their own posts, most probably also look through other submissions with similar topics, especially those which have already received some comments. As a result, they can participate in the available discussions, sharing their own experience and problems encountered. Hence the average number of comments on ask posts increases.

Since ask posts are more likely to receive comments, we'll focus our further analysis just on these posts.

Let's determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

### 3.2. Finding the Amount of Ask HN Posts and Comments by Hour Created

In [10]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
    
# Creating frequency tables for number of posts and comments per hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = row[0]
    comment = row[1]
    date_datetime = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    counts_by_hour[date_datetime] = counts_by_hour.get(date_datetime, 0) + 1  
    comments_by_hour[date_datetime] = comments_by_hour.get(date_datetime, 0) + comment

In [11]:
counts_by_hour

{'00': 55,
 '01': 60,
 '02': 58,
 '03': 54,
 '04': 47,
 '05': 46,
 '06': 44,
 '07': 34,
 '08': 48,
 '09': 45,
 '10': 59,
 '11': 58,
 '12': 73,
 '13': 85,
 '14': 107,
 '15': 116,
 '16': 108,
 '17': 100,
 '18': 109,
 '19': 110,
 '20': 80,
 '21': 109,
 '22': 71,
 '23': 68}

In [12]:
comments_by_hour 

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

### 3.3. Calculating the Average Number of Comments for Ask HN Posts by Hour
Now we'll use these dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [13]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['07', 7.852941176470588],
 ['21', 16.009174311926607],
 ['18', 13.20183486238532],
 ['20', 21.525],
 ['10', 13.440677966101696],
 ['15', 38.5948275862069],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['01', 11.383333333333333],
 ['06', 9.022727272727273],
 ['05', 10.08695652173913],
 ['17', 11.46],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['03', 7.796296296296297],
 ['22', 6.746478873239437],
 ['12', 9.41095890410959],
 ['04', 7.170212765957447],
 ['19', 10.8],
 ['16', 16.796296296296298],
 ['02', 23.810344827586206],
 ['00', 8.127272727272727],
 ['11', 11.051724137931034],
 ['23', 7.985294117647059]]

This format makes it hard to identify the hours with the highest values. Let's sort the list of lists and print the 5 highest values in a format that is easier to read.

In [14]:
# Swapping the list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

# Sorting the list
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [15]:
print("Top 5 Hours for Ask HN Comments")

# Finding the 5 highest values
for row[0], row[1] in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    average = row[0]
    print("{}: {:.2f} average comments per post".format(hour, average))

Top 5 Hours for Ask HN Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Thus, from all ask posts that received comments the most commented ones are those created in the following time ranges: 15.00-17.00, 2.00-3.00, 20.00-22.00, with the most favorable time range (with a big gap from its runner-up) being from 15.00 till 16.00. Accordiing to the [data set documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the time is  related to the time zone Eastern Time in the US. Hence, taking into account our time zone (Europe/Rome), to have a higher chance of receiving comments on our *Ask HN* post, we should create it between the midnight and 1.00.

## 4. Analyzing Points for Ask HN and Show HN Posts
### 4.1. Calculating the Average Number of Points
Now we will conduct a similar analysis for the number of points received by ask posts and show posts to see if we can find some other insights.

Since the columns for the number of comments and posts are identical, we will use the same code, changing only the column number and "comments" for "points" in the corresponding lists and dictionaries.

Again, we will start with determining which of these 2 groups receive more points on average.

In [16]:
total_ask_points = 0

for post in ask_posts:
    num_points = int(post[3])
    total_ask_points += num_points
    
avg_ask_points = total_ask_points / len(ask_posts)


total_show_points = 0

for post in show_posts:
    num_points = int(post[3])
    total_show_points += num_points
    
avg_show_points = total_show_points / len(show_posts)

print("The average number of points on ask posts:", avg_ask_points)
print("The average number of points on show posts:", avg_show_points)

The average number of points on ask posts: 15.061926605504587
The average number of points on show posts: 27.555077452667813


Hence with points we observe just the opposite picture than with comments: show posts receive on average about 1.8 times more points than ask posts. It can be explained by 2 things:

- Show posts tend to represent some kind of new findings, projects, or at least something less common and less discussed. They attract attention of those people who right now are not searching for an answer on their question but just want to reinforce their technological skills and learn something new. So they probably don't have much to comment but can express their interest by making points to the interesting posts.
- The number of points the post acquired is calculated as the total number of upvotes minus the total number of downvotes. Even though downvotes can be a frustrating thing for the author of a post, they still can take place, and most probably they are more common in ask posts, which are prone to debates, agreements and disagreements. This leads to ask posts having more comments but at the same time less points.

Since show posts are more likely to receive points, we'll proceed by analyzing only this group.
### 4.2. Finding the Amount of Show HN Posts and Points by Hour Created

In [17]:
result_list = []

for post in show_posts:
    created_at = post[6]
    num_points = int(post[3])
    result_list.append([created_at, num_points])
    
# Creating frequency tables for number of posts and points per hour    
counts_by_hour = {}
points_by_hour = {}
for row in result_list:
    date = row[0]
    point = row[1]
    date_datetime = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    counts_by_hour[date_datetime] = counts_by_hour.get(date_datetime, 0) + 1  
    points_by_hour[date_datetime] = points_by_hour.get(date_datetime, 0) + point

In [18]:
counts_by_hour

{'00': 31,
 '01': 28,
 '02': 30,
 '03': 27,
 '04': 26,
 '05': 19,
 '06': 16,
 '07': 26,
 '08': 34,
 '09': 30,
 '10': 36,
 '11': 44,
 '12': 61,
 '13': 99,
 '14': 86,
 '15': 78,
 '16': 93,
 '17': 93,
 '18': 61,
 '19': 55,
 '20': 60,
 '21': 47,
 '22': 46,
 '23': 36}

In [19]:
points_by_hour 

{'00': 1173,
 '01': 700,
 '02': 340,
 '03': 679,
 '04': 386,
 '05': 104,
 '06': 375,
 '07': 494,
 '08': 519,
 '09': 553,
 '10': 681,
 '11': 1480,
 '12': 2543,
 '13': 2438,
 '14': 2187,
 '15': 2228,
 '16': 2634,
 '17': 2521,
 '18': 2215,
 '19': 1702,
 '20': 1819,
 '21': 866,
 '22': 1856,
 '23': 1526}

### 4.3. Calculating the Average Number of Points for Show HN Posts by Hour

In [20]:
avg_by_hour = []

for hour in points_by_hour:
    avg_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['07', 19.0],
 ['21', 18.425531914893618],
 ['18', 36.31147540983606],
 ['20', 30.316666666666666],
 ['22', 40.34782608695652],
 ['15', 28.564102564102566],
 ['08', 15.264705882352942],
 ['09', 18.433333333333334],
 ['23', 42.388888888888886],
 ['06', 23.4375],
 ['05', 5.473684210526316],
 ['17', 27.107526881720432],
 ['13', 24.626262626262626],
 ['14', 25.430232558139537],
 ['03', 25.14814814814815],
 ['12', 41.68852459016394],
 ['01', 25.0],
 ['04', 14.846153846153847],
 ['19', 30.945454545454545],
 ['16', 28.322580645161292],
 ['10', 18.916666666666668],
 ['02', 11.333333333333334],
 ['00', 37.83870967741935],
 ['11', 33.63636363636363]]

Let's sort the list of lists and print the 5 highest values.

In [21]:
# Swapping the list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
# Sorting  the list    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18'], [33.63636363636363, '11'], [30.945454545454545, '19'], [30.316666666666666, '20'], [28.564102564102566, '15'], [28.322580645161292, '16'], [27.107526881720432, '17'], [25.430232558139537, '14'], [25.14814814814815, '03'], [25.0, '01'], [24.626262626262626, '13'], [23.4375, '06'], [19.0, '07'], [18.916666666666668, '10'], [18.433333333333334, '09'], [18.425531914893618, '21'], [15.264705882352942, '08'], [14.846153846153847, '04'], [11.333333333333334, '02'], [5.473684210526316, '05']]


In [22]:
print("Top 5 Hours for Show HN Points")

# Finding the 5 highest values
for row[0], row[1] in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    average = row[0]
    print("{}: {:.2f} average points per post".format(hour, average))

Top 5 Hours for Show HN Points
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post


We see that from all show posts that received comments the biggest number of points got those created in the following time ranges (Eastern Time): 22.00-1.00, 12.00-13.00, 18.00-19.00. The most favorable time range is from 23.00 till the midnight, which for our time zone is 6.00-7.00. However, since for many people an early morning doesn't seem to be a convenient time to write posts, and since the difference between the first and the second highest values is not significant, the time range 19.00-20.00 is also perfect for having a higher chance of receiving points on our *Show HN* post.
## Conclusions
All in all, ask posts stimulate more discussions and receive on average more comments than show posts, while show posts, being somehow innovative, receive on average more points. To have a higher chance to receive comments on our ask post, we should submit it between the midnight and 1.00. For our show post to receive more points, the best time to submit it is from 6.00 til 7.00 or from 19.00 till 20.00 (Europe/Rome time zone).