# Exploring Hacker News Posts

In this project, I'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

I'm specifically interested in posts whose titles begin with either **Ask HN** or **Show HN**. Users submit Ask HN posts to ask the Hacker News community a specific question; users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

My goal is to compare these two types of posts to determine the following:

- Do Ask HN or Show HN posts receive more comments on average?


- Do posts created at a certain time receive more comments on average?


- Do Ask HN or Show HN posts receive more points on average?


- Do posts created at a certain time receive more points on average?

## Opening and Exploring the Data

To begin working with the data set, I'll start by opening and reading it. I'll also make sure to separate the header row, which is the very first row, from the data set since it only contains the column headers/names &mdash; not real data.

In [1]:
from csv import reader

opened_file = open("../data_sets/hacker_news.csv")
read_file = reader(opened_file)

hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

Next, to make the data set easier to explore, I'll create a function named `explore_data()` that I can repeatedly use to print rows of data in a readable way. In addition, when I use this function, I'll have the option to output the number of rows and columns in the data set. After I define this function, I'll print the columns of the data set to understand what the data points represent. Lastly, I'll use this function to investigate a small selection of the data.

In [2]:
def explore_data(data_set, start, end, rows_and_columns=False):
    data_slice = data_set[start:end]
    for row in data_slice:
        print(row)
        
        # Print an extra line of space after each row for readability
        print()
        
    if rows_and_columns:
        print("Number of rows:", len(data_set))
        print("Number of columns:", len(data_set[0]))
        
print(hn_header)

print()

explore_data(hn, 0, 3, True) # First few rows of the data set

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']

Number of rows: 293119
Number of columns: 7


You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- `id`: The unique identifier from Hacker News for the post


- `title`: The title of the post


- `url`: The URL that the posts links to (if the post has a URL)


- `num_points`: The number of points the post acquired (the total number of upvotes minus the total number of downvotes)


- `num_comments`: The number of comments that were made on the post


- `author`: The username of the person who submitted the post


- `created_at`: The date and time at which the post was submitted

## Extracting Ask HN and Show HN Posts

Now, I'm ready to filter the data. Since I'm only concerned with post titles beginning with "Ask HN" or "Show HN" (including case variations of the two), I'll create sub data sets containing just the data for those titles.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

Finally, with the Ask HN, Show HN, and other posts separated, I'll explore the first five rows of each list to get a good sense of them and find out exactly how many posts are in each one.

In [4]:
print("Ask HN Posts:\n")

explore_data(ask_posts, 0, 5, True)

print("\nShow HN Posts:\n")

explore_data(show_posts, 0, 5, True)

print("\nOther Posts:\n")

explore_data(other_posts, 0, 5, True)

Ask HN Posts:

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']

['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']

['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']

['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']

['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']

Number of rows: 9139
Number of columns: 7

Show HN Posts:

['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']

['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']

['12578098', 'Show HN: WebGL visualization of DNA seque

## Part 1: Analyzing the Number of Comments for Ask HN and Show HN Posts

In this first analysis, I'll complete the first half of my goal by taking a look at the comments of the Ask HN and Show HN posts in the data set to answer these questions:

- Do Ask HN or Show HN posts receive more comments on average?

- Do posts created at a certain time receive more comments on average?

### Calculating the Average Number of Comments

Firstly, I'll find out if ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    
    total_ask_comments += num_comments

for row in show_posts:
    num_comments = int(row[4])
    
    total_show_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Average number of comments per Ask HN post:", round(avg_ask_comments)) # Round average to the ones place for readability
print("Average number of comments per Show HN post:", round(avg_show_comments))

Average number of comments per Ask HN post: 10
Average number of comments per Show HN post: 5


As you can see in the output above, Ask HN posts receive five more comments on average than Show HN posts.

Since ask posts are more likely to receive comments, I'll focus my remaining analysis just on these posts.

### Finding the Amount of Ask Posts and Comments by Hour Created

Next, I'll determine if ask posts created at a certain time are more likely to attract comments. I'll use the following steps to perform this analysis:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In this section, I'll tackle the first step — calculating the amount of ask posts and comments by hour created. I'll work with the data in the `created_at` column; I'll calculate the amount of ask posts created per hour, along with the total amount of comments.

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    
    result_list.append([created_at, n_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    n_comments = row[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments

### Calculating the Average Number of Comments for Ask HN Posts by Hour

Now that I've gathered the number of ask posts created during each hour of the day and the corresponding number of comments ask posts created at each hour received, I can calculate the average number of comments for posts created during each hour of the day.

After completing this task, I'll display the results.

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_n_comments = comments_by_hour[hour] / counts_by_hour[hour]
    
    avg_by_hour.append([hour, avg_n_comments])
    
template = "Average number of comments per post at {hour}:00: {avg_n_comments:.2f}" # Round average to two decimal places for readability

for avg in avg_by_hour:
    hour = avg[0]
    avg_n_comments = avg[1]
    
    print(template.format(hour=hour, avg_n_comments=avg_n_comments))

Average number of comments per post at 02:00: 11.14
Average number of comments per post at 01:00: 7.41
Average number of comments per post at 22:00: 8.80
Average number of comments per post at 21:00: 8.69
Average number of comments per post at 19:00: 7.16
Average number of comments per post at 17:00: 9.45
Average number of comments per post at 15:00: 28.68
Average number of comments per post at 14:00: 9.69
Average number of comments per post at 13:00: 16.32
Average number of comments per post at 11:00: 8.96
Average number of comments per post at 10:00: 10.68
Average number of comments per post at 09:00: 6.65
Average number of comments per post at 07:00: 7.01
Average number of comments per post at 03:00: 7.95
Average number of comments per post at 23:00: 6.70
Average number of comments per post at 20:00: 8.75
Average number of comments per post at 16:00: 7.71
Average number of comments per post at 08:00: 9.19
Average number of comments per post at 00:00: 7.56
Average number of comments 

### Sorting and Printing Values from a List of Lists

Although the results I need are now there, this format makes it hard to identify the hours with the highest values. I'll finish the analysis by sorting the list of lists (the collection of each hour and its corresponding average number of comments) and printing the five highest values in a format that's easier to read.

In [8]:
swap_avg_by_hour = []

for avg in avg_by_hour:
    hour = avg[0]
    avg_n_comments = avg[1]
    
    swap_avg_by_hour.append([avg_n_comments, hour])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments:\n")

template = "{hour}:00: {avg_n_comments:.2f} average comments per post" # Round average to two decimal places for readability

for avg in sorted_swap[:5]:
    hour = avg[1]
    avg_n_comments = avg[0]
    
    print(template.format(hour=hour, avg_n_comments=avg_n_comments))

Top 5 Hours for Ask Posts Comments:

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


As the output above presents, the best hour for creating a post and having the highest chance of receiving comments is 3:00 in the afternoon. Of course, posting at 1:00 p.m. or 12:00 p.m. will also give you a higher chance at receiving comments.

However, the time zone for these hours (as well as the rest of the time data in the data set) is Eastern Standard Time; if you happen to live in a region of the globe that does not pertain to this specific time zone, you can find out what these hours are in your time zone using [The Time Zone Converter](https://www.thetimezoneconverter.com/).

For example, my time zone is Mountain Standard Time; Eastern Standard Time is two hours ahead of my time zone, so 1:00 p.m. (not 3:00 p.m.) rather is the best hour of the day for me to create posts in hopes of collecting the most amount of comments.

## Part 2: Analyzing the Number of Points for Ask HN and Show HN Posts

I've finished analyzing the number of comments ask posts and show posts gather. Now, I'll perform the same analysis, but this time I'll be working with the number of points ask posts and show posts receive. Using this data, I'll accomplish the second half of my goal by solving these problems:

- Do Ask HN or Show HN posts receive more points on average?

- Do posts created at a certain time receive more points on average?

### Calculating the Average Number of Points

I'll start this new analysis by using the `num_points` column (fourth column) of the Hacker News data set to find out if ask posts or show posts earn more points on average.

In [9]:
total_ask_points = 0
total_show_points = 0

for row in ask_posts:
    n_points = int(row[3])
    
    total_ask_points += n_points
    
for row in show_posts:
    n_points = int(row[3])
    
    total_show_points += n_points
    
avg_ask_points = total_ask_points / len(ask_posts)
avg_show_points = total_show_points / len(show_posts)

print("Average number of points per ask post:", round(avg_ask_points)) # Round average to the ones place for readability
print("Average number of points per show post:", round(avg_show_points))

Average number of points per ask post: 11
Average number of points per show post: 15


From the result of the code above, you can observe that an ask post gets 11 points on average, while a show post collects 15 points on average.

Because the average amount of points each show post receives is higher than that of each ask post (by 4 to be exact), I'll concentrate on dealing with show posts for the rest of this analysis.

### Finding the Amount of Points in Show Posts by Hour Created

In this step, I'll compute the number of points accumulated during each hour of the day show posts were created. Just like in the previous analysis, I'll be using the data in the `created_at` column to perform this task.

In [10]:
result_list = []

for row in show_posts:
    created_at = row[6]
    n_points = int(row[4])
    
    result_list.append([created_at, n_points])
    
points_by_hour = {}

for row in result_list:
    date = row[0]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    n_points = row[1]
    
    if hour not in points_by_hour:
        points_by_hour[hour] = n_points
    else:
        points_by_hour[hour] += n_points

### Calculating the Average Number of Points for Show HN Posts by Hour

With each total amount of points collected by show posts throughout every hour of the day in place, I'm able to solve for the average number of points collected from show posts for each hour during the day (recall that I already have the quantity of posts created per hour of the day).

Once I gather this information, I'll output the results.

In [11]:
avg_by_hour = []

for hour in points_by_hour:
    avg_n_points = points_by_hour[hour] / counts_by_hour[hour]
    
    avg_by_hour.append([hour, avg_n_points])
    
template = "Average number of points per post at {hour}:00: {avg_n_points:.2f}" # Round average to two decimal places for readability
    
for avg in avg_by_hour:
    hour = avg[0]
    avg_n_points = avg[1]
    
    print(template.format(hour=hour, avg_n_points=avg_n_points))

Average number of points per post at 00:00: 4.26
Average number of points per post at 23:00: 4.21
Average number of points per post at 20:00: 4.28
Average number of points per post at 19:00: 5.06
Average number of points per post at 18:00: 5.28
Average number of points per post at 16:00: 6.51
Average number of points per post at 14:00: 7.48
Average number of points per post at 10:00: 4.35
Average number of points per post at 09:00: 6.36
Average number of points per post at 08:00: 6.89
Average number of points per post at 06:00: 3.86
Average number of points per post at 03:00: 3.45
Average number of points per post at 21:00: 3.40
Average number of points per post at 17:00: 5.51
Average number of points per post at 15:00: 5.92
Average number of points per post at 11:00: 7.73
Average number of points per post at 07:00: 6.98
Average number of points per post at 04:00: 4.02
Average number of points per post at 13:00: 7.46
Average number of points per post at 12:00: 10.55
Average number of p

### Sorting and Printing Values from a List of Lists

Once more, this unordered list of lists of hours and their corresponding averages makes it difficult to see the hours with the highest point values. To fix this, I'll sort the list of lists and present the five greatest values in a more readable format.

In [12]:
swap_avg_by_hour = []

for avg in avg_by_hour:
    hour = avg[0]
    avg_n_points = avg[1]
    
    swap_avg_by_hour.append([avg_n_points, hour])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Show Posts Points:\n")

template = "{hour}:00: {avg_n_points:.2f} average points per post" # Round average to two decimal places for readability

for avg in sorted_swap[:5]:
    hour = avg[1]
    avg_n_points = avg[0]
    
    print(template.format(hour=hour, avg_n_points=avg_n_points))

Top 5 Hours for Show Posts Points:

12:00: 10.55 average points per post
11:00: 7.73 average points per post
14:00: 7.48 average points per post
13:00: 7.46 average points per post
07:00: 6.98 average points per post


From the results above, you can see that the finest hour for making a post and likely receiving the largest amount of points is at noon. Yet posting at 11:00 a.m., 2:00 p.m., or 1:00 p.m. also isn't a bad idea if you want to have a higher chance of collecting points.

Again, these hours are specific to Eastern Standard Time, so they may not be the same depending on if your time zone is different or not. Here is the tool for converting times between separate time zones: <a href="https://www.thetimezoneconverter.com/" target="_blank">The Time Zone Converter</a>. For instance, since my time zone is Mountain Standard Time, 10:00 a.m. (not 12:00 p.m.) is instead the golden hour of the day for me to gain the largest number of points for each post I create.

## Conclusion

Throughout the whole of this project, I managed and manipulated a data set with thousands of records of posts to Hacker News.

My goal was to analyze the data to answer a set of questions, and in conclusion, I came to these answers:

- Do Ask HN or Show HN posts receive more comments on average?
    - Ask HN posts receive more comments on average.


- Do posts created at a certain time receive more comments on average?
    - Posts created during the hours of 3:00 p.m., 1:00 p.m., and 12:00 p.m. respectively receive more comments on average. However, these times apply particularly to Eastern Standard Time. In result, these hours may be different depending on what time zone you live in.


- Do Ask HN or Show HN posts receive more points on average?
    - Show HN posts receive more points on average.


- Do posts created at a certain time receive more points on average?
    - Posts created during the hours of 12:00 p.m., 11:00 a.m., and 2:00 p.m. respectively receive more points on average. Once again, these hours aren't exactly the same for everyone.