# Finding the Best Time to Post on Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result. The dataset, linked [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), includes 12 months of Hacker News posts ending September 26, 2017.

In this project, we are interested in posts with titles that begin with either "Ask HN" or "Show HN". Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting.

We'll compare these two types of posts to determine the following:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

First let's look at the dataset.

In [1]:
opened_file = open("C:/Users/admin/Documents/Data Science/hacker_news_posts/HN_posts_year_to_Sep_26_2016.csv", encoding = "utf8")
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

In [2]:
#Displays headers (column names)
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
print(hn[:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Extracting "Ask" posts and "Show" posts

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [5]:
#Simple confirmation checking
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


In [6]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments per Ask HN post: " ,round(avg_ask_comments, 2))


Average number of comments per Ask HN post:  10.39


In [7]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments per Show HN post: " ,round(avg_show_comments, 2))


Average number of comments per Show HN post:  4.89


On average one can see that "Ask HN" posts receive more comments per post than "Show HN" posts. Simply by its purpose, Ask HN posts drive engagement as people offer different answers to questions posed or offer opinions based on those questions. Show HN posts may garner less engagement via the comment section but may receive more engagement via points.

In this project, however, we are more focussed on timing of posts and comments so we will focus our energy on that side of the data exploaration. As Ask HN posts receive on average more comments than Show HN posts, we shall spend time looking more at these.

We can determine if ask posts created at a certain time are more likely to attract comments. We will use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

Let's begin by importing the `datetime` module and looking at the data more carefully.

In [8]:
import datetime as dt

In [9]:
result_list = []

for post in ask_posts:
    created_at = post[6] 
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])
   

In [10]:
print(result_list[:12])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2], ['9/25/2016 19:30', 1], ['9/25/2016 19:22', 22], ['9/25/2016 17:55', 3], ['9/25/2016 15:48', 0], ['9/25/2016 15:35', 13], ['9/25/2016 15:28', 0], ['9/25/2016 14:43', 0]]


In [11]:
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    postdate = row[0]
    comment = row[1]
    posthour = dt.datetime.strptime(postdate, date_format).strftime("%H")
    
    if posthour not in counts_by_hour:
        counts_by_hour[posthour] = 1
        comments_by_hour[posthour] = comment
    else:
        counts_by_hour[posthour] += 1
        comments_by_hour[posthour] += comment   


In [12]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour])])
    
avg_by_hour = sorted(avg_by_hour)
print(avg_by_hour)

[['00', 7.5647840531561465], ['01', 7.407801418439717], ['02', 11.137546468401487], ['03', 7.948339483394834], ['04', 9.7119341563786], ['05', 8.794258373205741], ['06', 6.782051282051282], ['07', 7.013274336283186], ['08', 9.190661478599221], ['09', 6.653153153153153], ['10', 10.684397163120567], ['11', 8.96474358974359], ['12', 12.380116959064328], ['13', 16.31756756756757], ['14', 9.692007797270955], ['15', 28.676470588235293], ['16', 7.713298791018998], ['17', 9.449744463373083], ['18', 7.94299674267101], ['19', 7.163043478260869], ['20', 8.749019607843136], ['21', 8.687258687258687], ['22', 8.804177545691905], ['23', 6.696793002915452]]


Now we have the average comments by hour, it is still a but unclear which hours have the highest average comments. We shall re-sort this list to show this using the sorted function.

In [13]:
swap_avg_by_hour = []

for hour in avg_by_hour:
    hr = hour[0]
    avg = hour[1]
    swap_avg_by_hour.append([avg, hr])

print(swap_avg_by_hour)

[[7.5647840531561465, '00'], [7.407801418439717, '01'], [11.137546468401487, '02'], [7.948339483394834, '03'], [9.7119341563786, '04'], [8.794258373205741, '05'], [6.782051282051282, '06'], [7.013274336283186, '07'], [9.190661478599221, '08'], [6.653153153153153, '09'], [10.684397163120567, '10'], [8.96474358974359, '11'], [12.380116959064328, '12'], [16.31756756756757, '13'], [9.692007797270955, '14'], [28.676470588235293, '15'], [7.713298791018998, '16'], [9.449744463373083, '17'], [7.94299674267101, '18'], [7.163043478260869, '19'], [8.749019607843136, '20'], [8.687258687258687, '21'], [8.804177545691905, '22'], [6.696793002915452, '23']]


In [14]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [15]:
# Showing the 5 hours with the highest comment averages 

print("Top 5 Hours for Ask HN Comments")

for avg,hour in sorted_swap[:5]:
    hr = dt.datetime.strptime(hour, "%H").strftime("%H:%M")
    output = "{hr}: {avg:.2f} average comments per post".format(hr = hr, avg = avg)
    print(output)

Top 5 Hours for Ask HN Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


The hour that receives the most comments per post on average is 15:00 with almost 29 comments per post as opposed to the next highest, at 13:00, with about 16.

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

## Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, a post should be categorized as an Ask HN post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

This is, of course, dependent on where one lives, as if one delves deeper into the information on a local level, the most popular times may vary. For example, 15:00 EST is 5:00 in Japan (*varies due to daylight savings), and would be highly unlikely to receive the engagement from the local Japanese speaking audience that a poster might be looking to reach.

A point of note is the dataset we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.