# Exploring Hacker News Posts - Finding the Best Time to Post a Question

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. In this project, we will analyse some of the patterns of successful posts. 

You can find the data set [here](https://www.kaggle.com/santiagobasulto/all-hacker-news-posts-stories-askshow-hn-polls). It includes all Hacker News posts since 2006 until 2019.

Below are descriptions of the columns:

|Columns             |Description                                      |
|--------------------|-------------------------------------------------|
|Object ID           |The post ID from the API                         |
|Title               |Title of the story                               |
|Post Type           |Story (regular post), ask_hn, show_hn, poll      |
|Author              |HN Username of the author                        |
|Created At          |Datetime created in format YYYY-MM-DD HH:MM:SS   |
|URL                 |URL Posted, can be null for ask_hn, show_hn, etc |
|Points              |Number of points the post received               |
|Number of Comments  |Number of total comments posted                  |

We're specifically interested in post types `Ask HN` and `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Let's start by importing the library we need and reading the data set into a list of lists.

In [1]:
from csv import reader

opened_file = open('hn.csv', encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)

## Exploring Dataset

In [2]:
# function for exploring data (to look at few chosen rows)
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(hn, 0, 5, True)

['Object ID', 'Title', 'Post Type', 'Author', 'Created At', 'URL', 'Points', 'Number of Comments']


['1', 'Y Combinator', 'story', 'pg', '2006-10-09 18:21:51', 'http://ycombinator.com', '61', '18.0']


['2', "A Student's Guide to Startups", 'story', 'phyllis', '2006-10-09 18:30:28', 'http://www.paulgraham.com/mit.html', '16', '1.0']


['3', 'Woz Interview: the early days of Apple', 'story', 'phyllis', '2006-10-09 18:40:33', 'http://www.foundersatwork.com/stevewozniak.html', '7', '1.0']


['4', 'NYC Developer Dilemma', 'story', 'onebeerdave', '2006-10-09 18:47:42', 'http://avc.blogs.com/a_vc/2006/10/the_nyc_develop.html', '5', '1.0']


Number of rows: 2833356
Number of columns: 8


## Removing Headers
The first row consist of headers, let's extract them to the variable `headers` and remove them from out set

In [4]:
headers = hn[0]      # DO NOT RUN MORE THAN ONCE
hn = hn[1:]

print(headers)
print(hn[:5])

['Object ID', 'Title', 'Post Type', 'Author', 'Created At', 'URL', 'Points', 'Number of Comments']
[['1', 'Y Combinator', 'story', 'pg', '2006-10-09 18:21:51', 'http://ycombinator.com', '61', '18.0'], ['2', "A Student's Guide to Startups", 'story', 'phyllis', '2006-10-09 18:30:28', 'http://www.paulgraham.com/mit.html', '16', '1.0'], ['3', 'Woz Interview: the early days of Apple', 'story', 'phyllis', '2006-10-09 18:40:33', 'http://www.foundersatwork.com/stevewozniak.html', '7', '1.0'], ['4', 'NYC Developer Dilemma', 'story', 'onebeerdave', '2006-10-09 18:47:42', 'http://avc.blogs.com/a_vc/2006/10/the_nyc_develop.html', '5', '1.0'], ['5', 'Google, YouTube acquisition announcement could come tonight', 'story', 'perler', '2006-10-09 18:51:04', 'http://www.techcrunch.com/2006/10/09/google-youtube-sign-more-separate-deals/', '7', '1.0']]


## Extracting Ask HN and Show HN Posts

Now that we've removed the headers from `hn`, we're ready to filter our data. Since we're only concerned with post types `ask_hn` or `show_hn`, we'll create new lists of lists containing just the data for those titles.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    post_type = row[2]
    if post_type == 'ask_hn':
        ask_posts.append(row) 
    elif post_type == 'show_hn':
        show_posts.append(row)  
    else:
        other_posts.append(row)
        
print("The number of 'ASK' posts:",len(ask_posts))
print("The number of 'SHOW' posts:",len(show_posts))
print("The number of 'OTHER' posts:",len(other_posts))

The number of 'ASK' posts: 107370
The number of 'SHOW' posts: 76726
The number of 'OTHER' posts: 2649259


Below are the first five rows in the `ask_posts` list of lists:

In [6]:
explore_data(ask_posts, 0, 5)

['121003', 'Ask HN: The Arc Effect', 'ask_hn', 'tel', '2008-02-22 02:33:40', '', '25', '16.0']


['127952', 'Ask HN: I want to make a webapp. Where do I start?', 'ask_hn', 'subhash', '2008-03-03 09:44:50', '', '14', '34.0']


['128917', 'Ask HN: Where do you stand on privacy?', 'ask_hn', 'h34t', '2008-03-04 16:31:54', '', '4', '11.0']


['131673', 'Ask HN: Could you implement News.YC in your favorite language?', 'ask_hn', 'aswanson', '2008-03-08 01:15:39', '', '4', '9.0']


['133543', 'Ask HN: Does anyone here play golf?', 'ask_hn', 'jgrahamc', '2008-03-10 22:41:44', '', '2', '7.0']




Below are the first five rows in the `show_posts` list of lists:

In [7]:
explore_data(show_posts, 0, 5)

['510264', 'Show HN: Our new online face recognition demo', 'show_hn', 'lbrandy', '2009-03-10 17:33:00', 'http://webdemo.pittpatt.com/recognition_demo/', '50', '25.0']


['512080', 'Show HN: JStartup - Hacker News for news people/journalists', 'show_hn', 'brandnewlow', '2009-03-11 17:25:49', 'http://jstartup.com/', '2', '0.0']


['521135', 'Show HN: Code draws better than you', 'show_hn', 'ktharavaad', '2009-03-18 03:51:27', 'http://blog.kpicturebooth.com/?p=25', '19', '7.0']


['577224', 'Show HN: Hacking is not Cracking', 'show_hn', 'hellweaver666', '2009-04-24 09:24:05', 'http://www.hackingisnotcracking.com/', '3', '15.0']


['870058', 'Show HN: Checkout the app my company helped launch', 'show_hn', 'clistctrl', '2009-10-08 21:53:19', 'http://commonwealthfund.org/Charts-and-Maps/State-Scorecard-2009.aspx', '1', '0.0']




## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next, let's determine if ask posts or show posts receive more comments on average.

In [8]:
# find the total number of comments in ask posts
total_ask_comments = 0
for row in ask_posts:
    num_comments = float(row[-1])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("The average number of comments on ask posts:", round(avg_ask_comments,2))

# find the total number of comments in show posts
total_show_comments = 0
for row in show_posts:
    num_comments = float(row[-1])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print("The average number of comments on show posts:", round(avg_show_comments,2))

The average number of comments on ask posts: 10.09
The average number of comments on show posts: 5.85


On average, ask posts in our sample receive approximately 10 comments, whereas show posts receive approximately 6.  It means people tend to answer questions more than to comment interesting posts or someone's projects. 

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour 

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [9]:
import datetime as dt

result_list = []

# append lists with 2 elements (date when the post was created and number of comments) to `result_list`
for post in ask_posts:
    created_at = post[4]
    n_comments = float(post[-1])
    result_list.append([created_at, n_comments])
    
comments_by_hour = {}
counts_by_hour = {}

# function to extract hours and add them to dictionaries
for row in result_list:
    dt_object = dt.datetime.strptime(row[0],  "%Y-%m-%d %H:%M:%S")    
    hour = dt_object.strftime("%H")                                  
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])   
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])
        
print("Posts by hour:", counts_by_hour)
print("\nComments by hour:", comments_by_hour)

Posts by hour: {'02': 3416, '09': 2916, '16': 7075, '01': 3567, '22': 4934, '08': 2812, '17': 7097, '18': 6938, '20': 6067, '07': 2592, '12': 3890, '19': 6551, '21': 5692, '05': 2854, '00': 3804, '11': 3346, '03': 3290, '14': 5806, '15': 6853, '23': 4301, '06': 2755, '13': 4687, '04': 3073, '10': 3054}

Comments by hour: {'02': 33068, '09': 25412, '16': 78156, '01': 32486, '22': 37992, '08': 27861, '17': 63069, '18': 57766, '20': 48257, '07': 22723, '12': 51702, '19': 49986, '21': 40450, '05': 25991, '00': 32528, '11': 41748, '03': 31643, '14': 68115, '15': 118650, '23': 35923, '06': 28070, '13': 69658, '04': 29347, '10': 32760}


 ## Calculating the Average Number of Comments for Ask HN Posts by Hour
 
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [10]:
# create a list of lists of 2 elements: hour and average number of comments 
avg_by_hour = []
for hour in counts_by_hour and comments_by_hour:
    avg_num = round((comments_by_hour[hour] / counts_by_hour[hour]), 2)
    avg_by_hour.append([hour, avg_num])

avg_by_hour.sort()

print("Average Number of Comments per Hour: \n")
avg_by_hour

Average Number of Comments per Hour: 



[['00', 8.55],
 ['01', 9.11],
 ['02', 9.68],
 ['03', 9.62],
 ['04', 9.55],
 ['05', 9.11],
 ['06', 10.19],
 ['07', 8.77],
 ['08', 9.91],
 ['09', 8.71],
 ['10', 10.73],
 ['11', 12.48],
 ['12', 13.29],
 ['13', 14.86],
 ['14', 11.73],
 ['15', 17.31],
 ['16', 11.05],
 ['17', 8.89],
 ['18', 8.33],
 ['19', 7.63],
 ['20', 7.95],
 ['21', 7.11],
 ['22', 7.7],
 ['23', 8.35]]

## Sorting and Printing Values from a List of Lists

In [11]:
# swap a list and sort in reverse order by comments
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour.sort(reverse = True)

#top 5 hours with the highest comments count
top_avg_by_hour = swap_avg_by_hour[:5]

# swap the order back (first hour, then comments count)
swapped_top = []
for row in top_avg_by_hour:
    swapped_top.append([row[1], row[0]])

swapped_top.sort()
print("Top 5 Hours for Ask Posts Comments:", swapped_top)

Top 5 Hours for Ask Posts Comments: [['11', 12.48], ['12', 13.29], ['13', 14.86], ['14', 11.73], ['15', 17.31]]


In [12]:
top_5 = []

for row in swapped_top:
    time = row[0]
    comments = row[1]
    top_5.append("{0}:00: {1} average comments per post".format(time, comments))

top_5

['11:00: 12.48 average comments per post',
 '12:00: 13.29 average comments per post',
 '13:00: 14.86 average comments per post',
 '14:00: 11.73 average comments per post',
 '15:00: 17.31 average comments per post']

We can easily see that the most popular time range is from 11 am until 3 pm with the highest value at 3 pm.

# Conclusions

We analysed which posts on Hacker News get more comments and what time is the most popular one. 

We compared `Ask HN` posts (with questions to the Hacker News community) and `Show HN` posts (with projects, products, etc.) and noticed that `Ask HN` posts tend to receive almost twice more comments on average. 

As well, we calculated the average number of comments on `Ask HN` posts by hour and found out that to have a higher chance of receiving comments we should create posts from 11 PM until 3 PM (with the highest chance at 3 PM and lowest at 2 PM). All hours are given for Eastern Standard Time (EST) or UTC/GMT -5.