# Exploring Hacker News Posts

## 1. Project Introduction

In this project, I will work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

Some posts can easily attract a lot of views, and comments. In this study we will explore aspects that impact the amount of comments for a post.

Post title: when creating posts, users can - optionally - add Ask HN or Show HN to the title of the post. They do so to explicitly 'ask' or 'show' something to the Hacker News community. We'll analyze whether posts with these tags receive more comments on average.

Post timing: also, we will explore whether posts published at certain times receive more comments on average.

## 2. Opening and Exploring Data

The data set for this project can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). It contains almost 300,000 rows, each row representing a post. The data is of 2016. The data set  has been reduced to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

Let's start by opening and exploring the dataset:

In [47]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:5]:
    print(row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



Let's split of the headers in in headers, and keep the data itself in hn. (And print to check the results)

In [48]:
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [49]:
for row in hn[:3]:
    print(row, '\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



Next, let us split the data into three new lists:

- ask_posts (the one who posted added 'ask hn' or similar)
- show_posts (the one who posted added 'show hn' or similar)
- other_posts (the remainder)

In [50]:
# Create three empty lists

ask_posts = []
show_posts = []
other_posts = []

# Fill the lists by using a for-loop

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn") == True:
        ask_posts.append(row)
    elif title.lower().startswith("show hn") == True:
        show_posts.append(row)
    else:
        other_posts.append(row)

Let's check the number of ask posts:

In [51]:
print(len(ask_posts))

1744


Let's check the number of show posts:

In [52]:
print(len(show_posts))

1162


Let's check the number of other posts:

In [53]:
print(len(other_posts))

17194


And now let's print as sample the first 5 ask posts:

In [54]:
for row in ask_posts[:5]:
    print(row, '\n')

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] 

['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] 

['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'] 

['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38'] 



Next, let's check if "ask posts" or "show posts" receive more comments on average:

In [55]:
total_ask_comments = 0 # set total_ask_comments to 0

for item in ask_posts:
    value = int(item[4]) # number of comments is the fifth column, convert value to an interger for sum calucation
    total_ask_comments += value # add value to total_ask_comments

print('Total number of comments for ask posts is {:}'.format(total_ask_comments))

Total number of comments for ask posts is 24483


In [56]:
avg_ask_comments = total_ask_comments / len((ask_posts)) 

In [57]:
print('Average number of comments for ask posts is {:.2f}'.format(avg_ask_comments))

Average number of comments for ask posts is 14.04


In [58]:
total_show_comments = 0

for item in show_posts:
    value = int(item[4])
    total_show_comments += value

avg_show_comments = total_show_comments / len((show_posts))
    
print('Total number of comments for show posts is {:}'.format(total_show_comments))
print('Average number of comments for show posts is {:.2f}'.format(avg_show_comments))

Total number of comments for show posts is 11988
Average number of comments for show posts is 10.32


"Ask posts" receive more comments then "Show posts". The reason is probably, because "Ask posts" are asking for a comment for another user in general 

To analyze whether particular times of the day attact more comments, we will continue with these "ask" posts.

In [59]:
import datetime as dt

In [60]:
# Create a list that contains the creation times and number of comments (ask-posts only)

result_list = []

for row in ask_posts:
    creation = row[6]
    comments = int(row[4])
    result_list.append([creation,comments])

In [61]:
for row in result_list[:5]:
    print(row, '\n')

['8/16/2016 9:55', 6] 

['11/22/2015 13:43', 29] 

['5/2/2016 10:14', 1] 

['8/2/2016 14:20', 3] 

['10/15/2015 16:38', 17] 



In [62]:
# Build frequency tables for the number of posts and for the number of comments, per hour of the day

counts_by_hour = {}
comments_by_hour = {}

for element in result_list:
    hour = dt.datetime.strptime(element[0], "%m/%d/%Y %H:%M").strftime("%H")
    comment = element[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

In [63]:
print(counts_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [64]:
print(comments_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [65]:
# Create a table that contains the hours of day and the average number of comments per posts

avg_by_hour = []

for row in counts_by_hour:
    avg_by_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])

In [66]:
# Sort the list (on its first element, being the hour of day)
avg_by_hour.sort()

In [67]:
# Print the result
output = "For hour {:} the average number of comments per post is {:.2f}"
for row in avg_by_hour:
    print (output.format(row[0], row[1]))  

For hour 00 the average number of comments per post is 8.13
For hour 01 the average number of comments per post is 11.38
For hour 02 the average number of comments per post is 23.81
For hour 03 the average number of comments per post is 7.80
For hour 04 the average number of comments per post is 7.17
For hour 05 the average number of comments per post is 10.09
For hour 06 the average number of comments per post is 9.02
For hour 07 the average number of comments per post is 7.85
For hour 08 the average number of comments per post is 10.25
For hour 09 the average number of comments per post is 5.58
For hour 10 the average number of comments per post is 13.44
For hour 11 the average number of comments per post is 11.05
For hour 12 the average number of comments per post is 9.41
For hour 13 the average number of comments per post is 14.74
For hour 14 the average number of comments per post is 13.23
For hour 15 the average number of comments per post is 38.59
For hour 16 the average number 

It appears there are significant differences indeed. Let's visualize this a bit clearer, and show which are the hours of day where posts (on average) attract most comments.

In [68]:
# Create a list that is sorted on the average number of comments instead
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
# Created a sorted version of this list
sorted_swap = sorted (swap_avg_by_hour, reverse = True)

In [69]:
# Display the results
print ('Top 5 Hours for Ask Posts Comments', '\n')
output = "{}: {:.2f} average comments per post"
for row in sorted_swap[:5]:
    thetime = dt.datetime.strptime(str(row[1]), '%H')
    thetime = thetime.strftime('%H:%M')
    print ( output.format(thetime,row[0] ))

Top 5 Hours for Ask Posts Comments 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


So those are the best times of days to post if you want to attract comments. What is interesting to see is that the top 5 hours are on very different hours during the day. One possible explanation could be that commenters are located across the globe, and that these different hours represent peak times for different time zones. (That would require further study though.)

Note that the times above are for the US Eastern Time. (As per the [dataset documentation](https://www.kaggle.com/hacker-news/hacker-news-posts).)

For our time zone (Central European Time), you'll need to add six hours to that.

## 3. Conclusion

Refering back to the goal of this study, let's summarize the conclusions.

Post title: when creating posts, adding Ask HN to your post title will do better for attracting comments than adding Show HN:

- Ask HN: 14.04 average comments per post
- Show HN: 10.32 average comments per post

(It has not been compared with posts for not adding a tag at all.)

Post timing: the time of day of posting appears to have significant impact on the number of comments that you will attract. Based on an analysis of the Ask HN posts, the top hours (in Central European Time) are:

- 21:00 - 22:00: 38.59 average comments per post
- 08:00 - 09:00: 23.81 average comments per post
- 02:00 - 03:00: 21.52 average comments per post