# Guided Project : Exploring Hacker News Posts

## Introduction

In this project, we'll work with a data set of submissions to the popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The goal of this project is to determine what the best time to publish a post is in order to receive the most comments.

We'll focus our interest here on posts whose titles begin with either _Ask HN_ or _Show HN_. Users submit _Ask HN_ posts to ask the Hacker News community a specific question. Likewise, users submit _Show HN_ posts to show the Hacker News community a project, product, or just generally something interesting. These two types of posts are the ones we expect to generate the most interaction.

### Results Summay

Through the analysis of the data set, we were able to conclude the following:
- The data set contains more ask posts than show posts
- On average, ask posts receive more comments than show posts
- To increase the chances of getting comments, an ask post should be published at 3pm US time

## The Data

The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). It has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
- _id_ : The unique identifier from Hacker News for the post
- _title_ : The title of the post
- _url_ : The URL that the posts links to, if the post has a URL
- _num_points_ : The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- _num_comments_ : The number of comments that were made on the post
- _author_ : The username of the person who submitted the post
- _created_at_ : The date and time at which the post was submitted

We start by importing the librairies we need and reading the data set into a list of lists.

In [8]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0] # assigning the header row to a separate list of list
hn = hn[1:]

We print the header for easy index reference.

In [11]:
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


We then have a look at the first five rows of the hn list of list.

In [10]:
print(hn[0:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Since we're only concerned with post titles beginning with _Ask HN_ or _Show HN_ , we create new lists of lists that will only contain posts data for these respective types.

In [60]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_lowercase = title.lower()
    if title_lowercase.startswith("ask hn"):
        ask_posts.append(row)
    elif title_lowercase.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
hn_length = len(hn)
ask_posts_length = len(ask_posts)
show_posts_length = len(show_posts)
other_posts_length = len(other_posts)

ask_posts_percentage = (ask_posts_length/hn_length)*100
show_posts_percentage = (show_posts_length/hn_length)*100
other_posts_percentage = (other_posts_length/hn_length)*100

ask_posts_string = "Number of ask posts: {} - {:.1f} % of total data set posts".format(ask_posts_length, ask_posts_percentage)
show_posts_string = "Number of show posts: {} - {:.1f} % of total data set posts".format(show_posts_length, show_posts_percentage)
other_posts_string =  "Number of other posts: {} - {:.1f} % of total data set posts".format(other_posts_length, other_posts_percentage)

print(ask_posts_string)
print(show_posts_string)
print(other_posts_string)

Number of ask posts: 1744 - 8.7 % of total data set posts
Number of show posts: 1162 - 5.8 % of total data set posts
Number of other posts: 17194 - 85.5 % of total data set posts


We can see that _Ask HN_ and _Show HN_ posts represent nearly 15% of all the data set posts.

Below are the first five rows of the newly created `ask_posts` list of lists.

In [17]:
print(ask_posts[0:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Next, we want to know which posts, between ask posts and show posts, receive the more comments on average.

In the following two blocks of code, we iterate through our `ask_posts` and `show_posts` lists of lists and compute the average number of comments each type of post receive.


In [21]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print(round(avg_ask_comments,1))

14.0


In [22]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)
print(round(avg_show_comments,1))

10.3


As we can see, ask posts receive significantly more comments on average than show posts.
Our goal being to determine the best time to publish a post to get the most comments, we decide to focus on ask posts for the rest of the analysis.

Next, we then want to find if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

2. Calculate the average number of comments ask posts receive by hour created.

In the code block below, we start by creating a new list of lists containing only the date and time at which the posts were created and the number of comments they received.

After that, we create two separate frequency tables to respectively report the number of ask posts published and the number of comments they get per hour. 

In [62]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    num_comments = row[1]
    datetime_object = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = datetime_object.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next, we use these dictionnaries to calculate the average number of comments for posts during each hour of the day.

In [35]:
avg_by_hour = []

for hour in counts_by_hour:
    avg = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, round(avg,1)])
    
print(avg_by_hour)

[['09', 5.6], ['13', 14.7], ['10', 13.4], ['14', 13.2], ['16', 16.8], ['23', 8.0], ['12', 9.4], ['17', 11.5], ['15', 38.6], ['21', 16.0], ['20', 21.5], ['02', 23.8], ['18', 13.2], ['03', 7.8], ['05', 10.1], ['19', 10.8], ['01', 11.4], ['22', 6.7], ['08', 10.2], ['04', 7.2], ['00', 8.1], ['06', 9.0], ['07', 7.9], ['11', 11.1]]


Thsi format makes it hard to identify the hours with the highest values. Let's finish by sorting this list of lists and printing the five highest values in a format that's easier to read.

In [38]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[5.6, '09'], [14.7, '13'], [13.4, '10'], [13.2, '14'], [16.8, '16'], [8.0, '23'], [9.4, '12'], [11.5, '17'], [38.6, '15'], [16.0, '21'], [21.5, '20'], [23.8, '02'], [13.2, '18'], [7.8, '03'], [10.1, '05'], [10.8, '19'], [11.4, '01'], [6.7, '22'], [10.2, '08'], [7.2, '04'], [8.1, '00'], [9.0, '06'], [7.9, '07'], [11.1, '11']]


The number of comments is now swapped with the hour in the lists. We can sort this list of lists and print the top 5 hours for ask posts comments.

In [55]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    avg = row[0]
    datetime_object = dt.datetime.strptime(row[1], "%H")
    hour_formatted = datetime_object.strftime("%H:%M")
    string = "{} : {:.2f} comments per post on average".format(hour_formatted,avg)
    print(string)

Top 5 Hours for Ask Posts Comments
15:00 : 38.60 comments per post on average
02:00 : 23.80 comments per post on average
20:00 : 21.50 comments per post on average
16:00 : 16.80 comments per post on average
21:00 : 16.00 comments per post on average


To have a higher chance of receiving comments on an ask post, we should post it at 3pm US time.

## Conclusion

The goal of this project was to determine the best time to publish a post on Hacker News to get the most comments.

By analyzing a data set of Hackers News posts, we were able to come up with the following conlusions:
- The data set contains more ask posts than show posts
- On average, ask posts receive more comments than show posts
- To increase the chances of getting comments, an ask post should be published at 3pm US time