# "Optimizing Hacker News Posts"

> "Hacker News is a popular tech posting site. I used data analysis in order to determine the best practices for optimizing the number of comments on your post."

- author: Migs Germar
- toc: true
- branch: master
- badges: true
- comments: true
- categories: [python, pandas, numpy, altair, datetime]
- hide: false
- search_exclude: false

---

# Overview

[Hacker News](https://news.ycombinator.com/) is a popular website about technology. Specifically, it is a community choice aggregator of tech content. Users can:

- Submit tech articles that the user found online.
- Submit "Ask" posts to ask the community a question.
- Submit "Show" posts to show the community something that the user made.
- Vote and comment on other people's posts.

In this project, we will analyze and compare Ask posts and Show posts in order to answer the following questions:

- Between Ask posts and Show posts, **which type receives more comments** in terms of average number of comments per post? 
- Can posting at a **certain time of day** result in getting more comments?

This analysis can be helpful for Hacker News users who would like for their posts to reach a larger audience on the platform.

> Note: I wrote this notebook for the Dataquest course's [Guided Project: Exploring Hacker News Posts](https://app.dataquest.io/m/356/guided-project%3A-exploring-hacker-news-posts/8/next-steps). The research questions and general project flow came from Dataquest. However, all of the text and code here are written by me unless stated otherwise.

# Package Installs

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import datetime as dt

# Dataset

The dataset for this project is the [Hacker News Posts dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) on Kaggle, uploaded by Hacker News.

The following is quoted from the dataset's Description.

>This data set is Hacker News posts from the last 12 months (up to September 26 2016).
>
>It includes the following columns:

- title: title of the post (self explanatory)
- url: the url of the item being linked to
- num_points: the number of upvotes the post received
- num_comments: the number of comments the post received
- author: the name of the account that made the post
- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Let us view the first 5 rows of the dataset below.

In [20]:
def csv_from_gdrive(url):
    
    """Takes the URL of a CSV file in Google Drive.
    Returns a new URL that allows the file to be downloaded using `pd.read_csv()`."""
    
    file_id = url.split('/')[-2]
    download_url = 'https://drive.google.com/uc?id=' + file_id
    return download_url

my_url = "https://drive.google.com/file/d/1wlAtoz1TCIOAXJHjJVOfvJ4IzUU1AnkF/view?usp=sharing"
dl = csv_from_gdrive(my_url)

hn = pd.read_csv(dl)
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


Below is the shape of the dataset.

In [21]:
print(hn.shape)

(293119, 7)


There are 293,119 rows and 7 columns in the dataset.

Before the data can be analyzed, it must first be cleaned.

# Data Cleaning

## Duplicate Rows

Below, I use pandas to delete duplicate rows except for the first instance of each duplicate.

Rows will be considered as duplicates if they are exactly alike in all features. I decided on this because it is possible for two posts to have the same title and/or url but be posted at different times or by different users. Thus, we cannot identify duplicates based on one or two features alone.

In [22]:
hn = hn.drop_duplicates(keep = "first")
print(hn.shape)

(293119, 7)


No duplicates were found. All rows were kept.

## Posts without Comments

Our research questions involve the number of comments on each post. However, there are many posts with 0 comments.

To illustrate this, below a frequency table of the number of comments on each post.

In [23]:
def freq_comments(df = hn):
    
    """Function to make a frequency table of the number of comments per post
    specifically for the Hacker News dataset."""
    
    freq_dct = {}

    for num in df["num_comments"]:
        freq_dct.setdefault(num, 0)
        freq_dct[num] += 1

    freq_df = pd.DataFrame.from_dict(
        freq_dct,
        orient = "index",
    ).reset_index(
    ).rename(columns = {
        "index": "num_comments",
        0: "frequency",
    }).sort_values(
        by = "num_comments",
    ).reset_index(
        drop = True,
    )

    return freq_df

freq_df1 = freq_comments()

freq_df1

Unnamed: 0,num_comments,frequency
0,0,212718
1,1,28055
2,2,9731
3,3,5016
4,4,3272
...,...,...
543,1007,1
544,1120,1
545,1448,1
546,1733,1


The table above shows that posts with 0 comments are most frequent.

Let us plot the table on a histogram.

In [24]:
def hist_comments(df, title):

    """Function to make a histogram of the number of comments per post
    specifically for the Hacker News dataset."""
    
    chart = alt.Chart(df).mark_bar().encode(
        x = alt.X(
            "num_comments:Q",
            title = "Number of Comments",
            bin = alt.Bin(step = 1)
        ),
        y = alt.Y(
            "frequency:Q",
            title = "Frequency",
        ),
    ).properties(
        title = title,
        width = 700,
        height = 400,
    )
    
    return chart
    
hist_comments(freq_df1, "Histogram of Number of Comments per Post")

There are so many posts with 0 comments that we cannot see the histogram bins for other numbers of comments.

Considering that the dataset is large and most rows have 0 comments, it would be best to drop all rows with 0 comments. This would make analysis less computationally expensive and allow us to answer our research questions.

In [25]:
with_comments = hn["num_comments"] > 0
hn = hn.loc[with_comments].reset_index(drop = True)

print(hn.shape)

(80401, 7)


Now, the dataset is left with only 80,401 rows. This will be easier to work with.

Below is the new histogram.

In [26]:
freq_df2 = freq_comments()

hist_comments(freq_df2, "Histogram of Number of Comments per Post")

The distribution is still heavily right-skewed since many posts have very few comments. What's important is that unnecessary data has been removed.

## Missing Values

Finally, let us remove rows with missing values. In order to answer our research questions, we only need the following columns:

- title
- num_comments
- created_at

Thus, we will delete rows with missing values in this column.

In [28]:
hn.dropna(
    subset = ["title", "num_comments", "created_at"],
    inplace = True,
)

print(hn.shape)

(80401, 7)


The number of rows did not change from 80401. Therefore, no missing values were found in these columns, and no rows were dropped.

Data cleaning is now done.

# Filtering Posts

As mentioned earlier, the first research question involves comparing Ask posts to Show posts. In order to do this, we have to group the posts into three types:

- Ask Posts
- Show Posts
- Other Posts

Other posts are usually posts that share a tech article found online.

Ask and Show posts can be identified using the start of the post title. Ask posts start with "Ask HN: ".

In [29]:
ask_mask = [index
            for index, value  in hn["title"].iteritems()
            if value.startswith("Ask HN: ")
           ]

hn.loc[ask_mask].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
1,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,9/26/2016 2:53
6,12578522,Ask HN: How do you pass on your work when you ...,,6,3,PascLeRasc,9/26/2016 1:17
18,12577870,Ask HN: Why join a fund when you can be an angel?,,1,3,anthony_james,9/25/2016 22:48
27,12577647,Ask HN: Someone uses stock trading as passive ...,,5,2,00taffe,9/25/2016 21:50
41,12576946,"Ask HN: How hard would it be to make a cheap, ...",,2,1,hkt,9/25/2016 19:30


On the other hand, Show posts start with "Show HN: ".

In [30]:
show_mask = [index
            for index, value  in hn["title"].iteritems()
            if value.startswith("Show HN: ")
           ]

hn.loc[show_mask].head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
35,12577142,Show HN: Jumble Essays on the go #PaulInYourP...,https://itunes.apple.com/us/app/jumble-find-st...,1,1,ryderj,9/25/2016 20:06
43,12576813,Show HN: Learn Japanese Vocab via multiple cho...,http://japanese.vul.io/,1,1,soulchild37,9/25/2016 19:06
52,12576090,Show HN: Markov chain Twitter bot. Trained on ...,https://twitter.com/botsonasty,3,1,keepingscore,9/25/2016 16:50
68,12575471,"Show HN: Project-Okot: Novel, CODE-FREE data-a...",https://studio.nuchwezi.com/,3,1,nfixx,9/25/2016 14:30
88,12574773,Show HN: Cursor that Screenshot,http://edward.codes/cursor-that-screenshot,3,3,ed-bit,9/25/2016 10:50


Other posts do not start with any special label.

Below, I create a new column "post_type" and assign the appropriate value to each row.

In [31]:
hn["post_type"] = "Other"

hn.loc[ask_mask, "post_type"] = "Ask"
hn.loc[show_mask, "post_type"] = "Show"

hn[["title", "post_type"]]

Unnamed: 0,title,post_type
0,Saving the Hassle of Shopping,Other
1,Ask HN: What TLD do you use for local developm...,Ask
2,Amazons Algorithms Dont Find You the Best Deals,Other
3,Emergency dose of epinephrine that does not co...,Other
4,Phone Makers Could Cut Off Drivers. So Why Don...,Other
...,...,...
80396,My Keyboard,Other
80397,Google's new logo was created by Russian desig...,Other
80398,Why we aren't tempted to use ACLs on our Unix ...,Other
80399,Ask HN: What is/are your favorite quote(s)?,Ask


Each row has now been labeled as a type of post.

# Research Question 1: Comparing Ask and Show Posts

The first research question is, "Between Ask posts and Show posts, **which type receives more comments** in terms of average number of comments per post?"

Note that the data is not normally distributed; it is right-skewed. For example, here is the distribution of the number of comments per Ask post.

In [32]:
ask_freq = freq_comments(
    df = hn.loc[hn["post_type"] == "Ask"]
)

ask_freq

Unnamed: 0,num_comments,frequency
0,1,1383
1,2,1238
2,3,762
3,4,592
4,5,373
...,...,...
203,898,1
204,910,1
205,937,1
206,947,1


In [33]:
hist_comments(
    ask_freq,
    "Histogram of Number of Comments per Ask Post"
)

The histogram is similar for Show posts.

In [34]:
show_freq = freq_comments(
    df = hn.loc[hn["post_type"] == "Show"],
)

show_freq

Unnamed: 0,num_comments,frequency
0,1,1738
1,2,814
2,3,504
3,4,300
4,5,196
...,...,...
137,250,1
138,257,1
139,280,1
140,298,1


In [35]:
hist_comments(
    show_freq,
    "Histogram of Number of Comments per Show Post"
)

Therefore, the mean would not be a good measure of central tendency for the "average number of comments per post." Thus, we will use the median instead.

In [51]:
dct = {"Ask": None, "Show": None}

for key in dct:
    median = np.median(
        hn["num_comments"].loc[hn["post_type"] == key]
    )
    
    dct[key] = median

table = pd.DataFrame.from_dict(
    dct,
    orient = "index",
).reset_index(
).rename(columns = {
    "index": "post_type",
    0: "median_comments",
})

chart = alt.Chart(table).mark_bar().encode(
    x = alt.X("post_type:N", title = "Post Type"),
    y = alt.Y("median_comments:Q", title = "Median Number of Comments per Post"),
).properties(
    title = "Median Number of Comments for the Two Post Types",
)

chart

The bar graph shows that Show posts have a higher median number of comments per post, compared to Ask posts.

> Important: The results suggest that Show posts get more comments than Ask posts. It may be easier for users to reach a larger audience via Show posts.

# Research Question 2: Active Times

The second research question is, "Can posting at a certain time of day result in getting more comments?"

For this part of the analysis, we will only be using Show post data for simplicity.

We will divide the day into 24 one-hour periods, and then calculate the number of Show posts created in each period.

## String Template for Time

Before analying, we need to inspect the "created_at" column of the dataset.

In [37]:
hn_show = hn.loc[
    hn["post_type"] == "Show"
].reset_index(
    drop = True,
)

hn_show[["created_at"]].head()

Unnamed: 0,created_at
0,9/25/2016 20:06
1,9/25/2016 19:06
2,9/25/2016 16:50
3,9/25/2016 14:30
4,9/25/2016 10:50


The strings in this column appear to follow the following format:

> month/day/year hour:minute


With the datetime module, the following is the equivalent formatting template.

In [38]:
template = "%m/%d/%Y %H:%M"

## Parsing Times

The time data can now be parsed and used for analysis.

The `hours_posts` dictionary will count the **number of posts** at certain hours.
The `hours_comments` dictionary will count the **number of comments** received by posts made at certain hours.

In [39]:
hours_posts = {}
hours_comments = {}

for index, row in hn_show.iterrows():
    date_str = row["created_at"]
    num_comments = row["num_comments"]
    
    # datetime object
    date_dt = dt.datetime.strptime(
        date_str,
        template,
    )
    
    # extract hour
    hour = date_dt.hour
    
    # update dictionaries
    hours_posts.setdefault(hour, 0)
    hours_posts[hour] += 1
    
    hours_comments.setdefault(hour, 0)
    hours_comments[hour] += num_comments

The hours were parsed and mapped to their respective counts of posts and comments.

The code below transforms the dictionaries into DataFrames for ease of use.

In [40]:
def hour_to_df(dct, data_label):
    
    """Make a DataFrame from a dictionary that maps
    an 'hour' column to another column, named by `data_label`."""
    
    result = pd.DataFrame.from_dict(
        dct,
        orient = "index",
    ).reset_index(
    ).rename(columns = {
        "index": "hour",
        0: data_label,
    }).sort_values(
        by = "hour",
    ).reset_index(
        drop = True,
    )
    
    return result
    
hours_posts_df = hour_to_df(
    hours_posts,
    data_label = "num_posts",
)

hours_comments_df = hour_to_df(
    hours_comments,
    data_label = "num_comments",
)

hours_posts_df.head()

Unnamed: 0,hour,num_posts
0,0,141
1,1,133
2,2,103
3,3,97
4,4,90


In [41]:
hours_comments_df.head()

Unnamed: 0,hour,num_comments
0,0,1283
1,1,1001
2,2,1074
3,3,934
4,4,978


The hours have been parsed, and the tables have been generated.

Additionally, another DataFrame is created below. It calculates the **average number of comments per post** by the hour posted.

In [44]:
frames = [hours_posts_df, hours_comments_df]
hours_mean = pd.concat(frames, axis = 1).drop_duplicates()

# Drop duplicate columns
hours_mean = hours_mean.loc[:, ~hours_mean.columns.duplicated()]

means = []
for index, row in hours_mean.iterrows():
    hour, posts, comments = row
    means.append(comments / posts)
    
hours_mean["comments_per_post"] = means

hours_mean

Unnamed: 0,hour,num_posts,num_comments,comments_per_post
0,0,141,1283,9.099291
1,1,133,1001,7.526316
2,2,103,1074,10.427184
3,3,97,934,9.628866
4,4,90,978,10.866667
5,5,75,591,7.88
6,6,95,904,9.515789
7,7,125,1572,12.576
8,8,159,1770,11.132075
9,9,158,1411,8.93038


## Number of Posts by Hour of the Day

Below are the table and graph for the number of posts per hour of the day.

In [45]:
hours_posts_df

Unnamed: 0,hour,num_posts
0,0,141
1,1,133
2,2,103
3,3,97
4,4,90
5,5,75
6,6,95
7,7,125
8,8,159
9,9,158


This table is in 24-hour time. Hour 13 refers to 1:00 PM. The table shows how many posts are made for every hour in the day.

Below is a histogram that shows this visually.

In [46]:
chart = alt.Chart(hours_posts_df).mark_bar().encode(
    x = alt.X("hour:Q", title = "Hour of the Day", bin = alt.Bin(step = 1)),
    y = alt.Y("num_posts:Q", title = "Number of Posts"),
).properties(
    title = "Number of Posts by Hour of the Day",
    width = 700,
    height = 400,
)

chart

The histogram clearly shows that Hacker News users most actively make posts between 15:00 and 18:00, or from 3:00 PM to 6:00 PM.

The most active hour for posting is **3:00 PM - 4:00 PM**.

## Number of Comments by Hour Posted

Next, a similar analysis is done for the number of comments received by posts, grouped by the hour that they were created. Below is the table for this data.

In [47]:
hours_comments_df

Unnamed: 0,hour,num_comments
0,0,1283
1,1,1001
2,2,1074
3,3,934
4,4,978
5,5,591
6,6,904
7,7,1572
8,8,1770
9,9,1411


Below is the histogram that visualizes the data about the number of comments received by the hour posted.

In [48]:
chart = alt.Chart(hours_comments_df).mark_bar().encode(
    x = alt.X("hour:Q", title = "Hour of the Day", bin = alt.Bin(step = 1)),
    y = alt.Y("num_comments:Q", title = "Number of Comments"),
).properties(
    title = "Number of Comments by Hour Posted",
    width = 700,
    height = 400,
)

chart

The results in this histogram are similar to the previous one. Posts that are made from 14:00 to 17:00, or 2:00 PM - 5:00 PM, receive the most comments.

Posts made at **2:00 PM - 3:00 PM** receive an especially high number of comments.

## Mean Number of Comments per Post, by Hour Posted

The table below shows the mean number of comments per post, by the hour of posting.

In [49]:
hours_mean

Unnamed: 0,hour,num_posts,num_comments,comments_per_post
0,0,141,1283,9.099291
1,1,133,1001,7.526316
2,2,103,1074,10.427184
3,3,97,934,9.628866
4,4,90,978,10.866667
5,5,75,591,7.88
6,6,95,904,9.515789
7,7,125,1572,12.576
8,8,159,1770,11.132075
9,9,158,1411,8.93038


This is visualized in the histogram below, which looks quite different from the previous two graphs.

In [50]:
chart = alt.Chart(hours_mean).mark_bar().encode(
    x = alt.X("hour:Q", title = "Hour of the Day", bin = alt.Bin(step = 1)),
    y = alt.Y("comments_per_post:Q", title = "Mean Number of Comments per Post"),
).properties(
    title = "Mean Number of Comments per Post, by Hour Posted",
    width = 700,
    height = 400,
)

chart

This graph shows that the _mean_ number of comments per post is more consistent throughout the day. The statistic is highest at 7:00 AM - 8:00 AM, but the afternoon values are only slightly lower.

This brings us a **new question**: Why does the mean number of comments per post not peak in the afternoon?

A possible explanation is that since the site is oversaturated with new posts in the afternoon, only the very best posts receive attention. The rest are lost in the flood of new posts.

> Important: Based on the results, even if users are most active in the afternoon, it may be best to post in the morning.


If you post in the morning, you will receive at least some attention since you won't have to compete with many other new posts. You may get a few upvotes and comments. Then, the many users logging in during the afternoon would see that your posted has already received comments. Thus, they would be interested in it and bring more attention to it.

# Conclusion

In this project, we analyzed data about Hacker News posts, specifically regarding the number of comments that they receive. Below are the research questions, and the best answers that we could come up with from our analysis.

<br/>

- Between Ask posts and Show posts, **which type receives more comments** in terms of average number of comments per post?

_Show posts_ receive a higher median number of comments per post, compared to Ask posts. In order to reach a wider audience or converse with more users, it is better to make a Show post.

<br/>

- Can posting at a **certain time of day** result in getting more comments?

Hacker News users are very active in the afternoon, from 2:00 PM to 6:00 PM. However, if you post in the afternoon, your post may get lost in the flood of new posts. It is better to post in the _morning_, like at 7:00 AM, so that some people can notice your post. Then, your post can get more attention in the afternoon.

---

Thanks for reading!