# Exploring Hacker News Posts

In this project, we'll examine a dataset of submissions to the popular technology site, *[Hacker News](https://news.ycombinator.com/). Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity."* [<sup>1</sup>](#fn1)

In this analysis, we're specifically interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Similarly, users submit `Show HN` posts to show the community a project, product, or something interesting.

We'll compare these two types of posts to determine the following:
1. Do `Ask HN` or `Show HN` posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

1 <span id="fn1">[Wikipedia: Hacker News](https://en.wikipedia.org/wiki/Hacker_News)</span>

## Import Data

The full datset of Hacker News posts for the twelve-month period ending on September 26, 2016 can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). For this analysis, we have reduced the dataset from almost 300,000 rows to about 20,000 rows by removing all submissions that didn't recieve any comments and then randomly sampling from the remaining submissions.

### Open File

First, we open the file and look at the first five rows of the dataset. The first row in the dataset contains descriptions for the columns. They are:

* `id`: the unique identifier from Hacker News for the post
* `title`: the title of the post
* `url`: the URL linked in the post, if the post contains a URL
* `num_points`: the total number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission to Hacker News

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)


for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




### Remove Header

In order to better work with the data, we remove the first row containing the column descriptions and separate it out as `header`. Then we look at `header` and the new five first rows of the dataset to verify that the column-descriptions header was removed correctly.

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
for row in hn[:5]:
    print('\n')
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## Analyze Data

### Separate `Ask HN` and `Show HN` Posts

Our first goal is to see whether `Ask HN` or `Show HN` posts have more comments on average. To do this, we'll separate out `Ask HN` and `Show HN` posts from all of the posts and see how many of each we have.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('\n')
print("Number of Ask HN posts: " + str(len(ask_posts)))
print("Number of Show HN posts: " + str(len(show_posts)))
print("Number of other posts: " + str(len(other_posts)))
print('\n')



Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194




### Calculate Average Number of Comments Per Type of Post

Now that we know how many `Ask HN` and `Show HN` posts we are working with and have separated them into their own list, we need to find how many comments each type of post receives on average. To do this, we'll simply add up all of the comments in each list and divide by the total number of posts in the respective lists.

In [4]:
total_ask_comments = 0
total_show_comments = 0
total_other_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len(show_posts)

for post in other_posts:
    total_other_comments += int(post[4])
avg_other_comments = total_other_comments / len(other_posts)

print('\n')
print('\033[1m' + "Average number of comments per type of Hacker News post" + '\033[0m')
print("Average number of comments on Ask HN posts: " + str(round(avg_ask_comments, 2)))
print("Average number of comments on Show HN posts: " + str(round(avg_show_comments, 2)))
print("Average number of comments on all other posts: " + str(round(avg_other_comments, 2)))



[1mAverage number of comments per type of Hacker News post[0m
Average number of comments on Ask HN posts: 14.04
Average number of comments on Show HN posts: 10.32
Average number of comments on all other posts: 26.87


## Do `Ask HN` or `Show HN` Posts Receive More Comments on Average?

As we can see, `Ask HN` posts receive, on average, about 35% more comments than do `Show HN` posts. Posts which do not fall into either category receive about 91% more comments than `Ask HN` posts and 160% more comments than `Show HN` posts. However, since we're only interested in `Ask HN` and `Show HN` posts and since ask posts generally receive more comments than show posts, we'll focus our remaining analysis just on these posts.

### Analyzing `Ask HN` Comments and Posts by Hour of Post Creation

Our next step will be to determine whether `Ask HN` posts created at a certain time are more likely to attract comments. To do this, we'll look at how many ask posts were created in each hour of the day along with the total number of comments these posts received. Note that times are listed in Eastern Standard Time.

In [5]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    time_created = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
    hour_created = time_created.strftime("%H")
    if hour_created not in counts_by_hour:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = result[1]
    else:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += result[1]


### Average Number of Comments for `Ask HN` Posts by Hour

Now that we have frequency tables for how many posts were created in each hour of the day as well as how many comments these hour-separated posts received, some simple arithmetic can calculate the average number of comments each post received.

In [6]:
avg_by_hour = []

for c in comments_by_hour:
    avg_by_hour.append([c, (comments_by_hour[c] / counts_by_hour[c])])
    
for item in avg_by_hour:
    print(item)
    

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


### Sorting and Printing the Data for Ease of Use

As we can see, the information above isn't particularly easy to understand, so let's take a bit of time to better organize the data before attempting an analysis. We'll begin by swapping the hour and the average number of comments.

In [7]:
swap_avg_by_hour = []

for hour, comments in avg_by_hour:
    swap_avg_by_hour.append([comments, hour])
    
for item in swap_avg_by_hour:
    print(item)

[5.5777777777777775, '09']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[16.796296296296298, '16']
[7.985294117647059, '23']
[9.41095890410959, '12']
[11.46, '17']
[38.5948275862069, '15']
[16.009174311926607, '21']
[21.525, '20']
[23.810344827586206, '02']
[13.20183486238532, '18']
[7.796296296296297, '03']
[10.08695652173913, '05']
[10.8, '19']
[11.383333333333333, '01']
[6.746478873239437, '22']
[10.25, '08']
[7.170212765957447, '04']
[8.127272727272727, '00']
[9.022727272727273, '06']
[7.852941176470588, '07']
[11.051724137931034, '11']


This is only slightly, if at all, better, so let's sort the swap so that we can more easily determine the hours which received the most comments.

In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

for item in sorted_swap:
    print(item)

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


Now that we have the data in good order, let's format the data for easier reading and separate out the five highest values to help us draw some conclusions.

## Do Posts Created at a Certain Time Receive More Comments on Average?

In [9]:
print('\n')
print('\033[1m' + "Top 5 Hours for Ask Posts Comments" + '\033[0m')
for comments, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hour, '%H').strftime('%H:%M'), comments))



[1mTop 5 Hours for Ask Posts Comments[0m
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusions

Looking at the data, `Ask HN` posts receive more comments than `Show HN` posts. This is possibly because asking a question is more likely to elicit a response than simply showing something. A post that is showing or demonstrating something could fulfill its purpose by getting lots of views and generate a lot of interest without any comments. A post that is asking a question, however, by definition needs at least one answer to complete its stated purpose.

As to the question of whether posting at a certain time receives more comments, the data suggests that it is best to post in the afternoon and evening as the top five hours to receive the most comments all occurred between 3:00PM Eastern (12:00 PM Pacific, 8:00 PM BST) and 02:00AM Eastern (11:00PM Pacific, 07:00 AM BST).