# Analyzing Hacker News Post Interaction

Our goal is to see what kinds of Hacker News posts recieve the most comments, and when. There are two different post types we are concerned with - "Ask HN", which is a text post that asks a specific question to discuss in the comments, and "Show HN", which is a link to content found elsewhere on the internet.

The dataset includes the posts from the year leading up to September 26, 2016. The documentation can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

First, the data is read from the CSV file and split into the header line and data. The header and first 5 rows of data are shown below.

In [1]:
from csv import reader
hn_full = list(reader(open("datasets/hackernews.csv")))

hn_header, hn = hn_full[0], hn_full[1:]

print(hn_header, *hn[:5], sep="\n\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']

['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']

['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


## Analyzing average comment counts between post types

We then split the data into "Ask HN", "Show HN", and Other posts, then check how many of each are present.

In [2]:
from typing import List, Tuple
def split_posts(dataset: List[List], title_col: int = 1) -> Tuple[List[List], List[List], List[List]]:
    """Split the given dataset into Ask HN, Show HN, and Other posts."""
    ask_posts = list(filter(lambda row: row[title_col].lower().startswith("ask hn"), dataset))
    show_posts = list(filter(lambda row: row[title_col].lower().startswith("show hn"), dataset))
    other_posts = list(filter(lambda row: not(row[title_col].lower().startswith("ask hn")
                                              or row[title_col].lower().startswith("show hn")), dataset))
    return ask_posts, show_posts, other_posts

In [3]:
ask_posts, show_posts, other_posts = split_posts(hn)

print("Number of Ask HN posts:", len(ask_posts))
print("Number of Show HN posts:", len(show_posts))
print("Number of Other posts:", len(other_posts))

Number of Ask HN posts: 9139
Number of Show HN posts: 10158
Number of Other posts: 273822


In [4]:
from typing import List
def avg_comments(dataset: List[List], comment_col: int = 4):
    return round(sum([ int(row[comment_col]) for row in dataset ]) / len(dataset), 2)

In [5]:
avg_ask_comments = avg_comments(ask_posts)
avg_show_comments = avg_comments(show_posts)

print("Average comments per Ask HN post:", avg_ask_comments)
print("Average comments per Show HN post:", avg_show_comments)

Average comments per Ask HN post: 10.39
Average comments per Show HN post: 4.89


There are more than twice as many posts on each ask post compared to show posts.

In [6]:
import datetime as dt
from typing import List, Dict
def avg_comments_by_hour(dataset: List[List], date_col: int = 6, comment_col: int = 4) -> Dict[int, float]:
    counts_by_hour = dict()
    comments_by_hour = dict()
    for row in dataset:
        hour = dt.datetime.strptime(row[date_col].split()[1], "%H:%M").hour
        comments = int(row[comment_col])

        try:
            counts_by_hour[hour] += 1
        except KeyError:
            counts_by_hour[hour] = 1

        try:
            comments_by_hour[hour] += comments
        except KeyError:
            comments_by_hour[hour] = comments
    return dict(sorted({ hour: round(comments_by_hour[hour] / count, 2)
                            for hour, count in counts_by_hour.items() }.items(),
                        key=lambda x: x[1], reverse=True))

In [7]:
avg_ask_comments_by_hour = avg_comments_by_hour(ask_posts)

print("Top 5 Hours for Ask Post Comments:")
print(*[ f"{hour:02d}:00 - {amt}"
            for hour, amt in list(avg_ask_comments_by_hour.items())[:5] ], sep="\n")

Top 5 Hours for Ask Post Comments:
15:00 - 28.68
13:00 - 16.32
12:00 - 12.38
02:00 - 11.14
10:00 - 10.68


As described in the dataset documentation, the time zone is US Eastern. From this, we can see that the most popular post category, Ask HN, receives a moderate amount of activity around 10AM which increases from 12PM to 1PM, falls off at 2PM, then peaks at 3PM. 2AM is another high-activity hour, indicating a large contingent of night traffic or possible non-American users. This would be 7AM in British time, or 8AM in Berlin time.

Based on this information, posts made at 3PM EST recieve the most comments.

### Determining effects of posts with no comments

To account for potential skewing by posts that are made but recieve no comments, we will perform the same steps again after eliminating all posts that did not recieve any comments to see if our final data is different.

In [8]:
# _c suffix indicates "with comments"
ask_posts_c, show_posts_c, other_posts_c = split_posts(list(filter(lambda row: int(row[4]) > 0, hn)))

In [9]:
avg_ask_comments_c = avg_comments(ask_posts_c)
avg_show_comments_c = avg_comments(show_posts_c)

print("Average comments per commented Ask HN post:", avg_ask_comments_c)
print("Average comments per commented Show HN post:", avg_show_comments_c)

Average comments per commented Ask HN post: 13.74
Average comments per commented Show HN post: 9.81


After this adjustment, the different post types are more similar in average comment count, differing by a factor of 1.4 instead of 2.1, with Show HN having a higher average. This indicates that more Show HN posts recieve no comments, which we can quickly verify.

In [10]:
ask_posts_no_c_count = len(ask_posts) - len(ask_posts_c)
show_posts_no_c_count = len(show_posts) - len(show_posts_c)
print("Number of Ask HN posts with 0 comments:", len(ask_posts) - len(ask_posts_c),
            f"({ask_posts_no_c_count / len(ask_posts) * 100:.2f}%)")
print("Number of Show HN posts with 0 comments:", len(show_posts) - len(show_posts_c),
            f"({show_posts_no_c_count / len(show_posts) * 100:.2f}%)")

Number of Ask HN posts with 0 comments: 2228 (24.38%)
Number of Show HN posts with 0 comments: 5099 (50.20%)


As predicted, more than twice as many Show HN posts recieved no comments. Only half of all Show HN posts recieved a comment, compared to 3 out of 4 Ask HN posts.

### Average comments by hour of Ask HN posts with at least 1 comment

After determining that Ask HN posts both recieve more comments and are more likely to recieve at least 1 comment, we would like to compare the average comments made per hour again.

In [11]:
avg_ask_comments_by_hour_c = avg_comments_by_hour(ask_posts_c)

print("Top 5 Hours for Ask Post Comments (excluding posts with 0 comments):")
print(*[ f"{hour:02d}:00 - {amt}"
            for hour, amt in list(avg_ask_comments_by_hour_c.items())[:5] ], sep="\n")

Top 5 Hours for Ask Post Comments (excluding posts with 0 comments):
15:00 - 39.67
13:00 - 22.22
12:00 - 15.45
10:00 - 13.76
17:00 - 13.73


Posts made at 3PM still recieve the most comments, with 1PM and 12PM following with a gap of a similar ratio as before. 2AM no longer appears in the top 5, replaced by 5PM, with nearly the same average comments per post as 10AM.

## Analyzing average point counts between post types

Now that we know that Ask HN posts made at 3PM recieve the most comments, we are interested in determining if these posts also recieve more positive interaction in the form of vote totals.

In [12]:
from typing import List
def avg_points(dataset: List[List], points_col: int = 3):
    return round(sum([ int(row[points_col]) for row in dataset ]) / len(dataset), 2)

In [13]:
avg_ask_points = avg_points(ask_posts)
avg_show_points = avg_points(show_posts)

print("Average Ask HN post points:", avg_ask_points)
print("Average Show HN post points:", avg_show_points)

Average Ask HN post points: 11.31
Average Show HN post points: 14.84


todo - apply a filter for points > 0

difference suggests that it is easier to click on the link, skim/read the post and vote than it is to comment on a question. different level of effort - Ask HN more likely to maintain users who are already engaged while Show HN can bring new users in or maintain users with less engagement

maybe combine metrics?

In [14]:
def avg_points_by_hour(dataset: List[List], date_col: int = 6, point_col: int = 3) -> Dict[int, float]:
    counts_by_hour = dict()
    points_by_hour = dict()
    for row in dataset:
        hour = dt.datetime.strptime(row[date_col].split()[1], "%H:%M").hour
        points = int(row[point_col])

        try:
            counts_by_hour[hour] += 1
        except KeyError:
            counts_by_hour[hour] = 1

        try:
            points_by_hour[hour] += points
        except KeyError:
            points_by_hour[hour] = points
    return dict(sorted({ hour: round(points_by_hour[hour] / count, 2)
                            for hour, count in counts_by_hour.items() }.items(),
                        key=lambda x: x[1], reverse=True))

In [15]:
avg_ask_points_by_hour = avg_points_by_hour(ask_posts)
avg_show_points_by_hour = avg_points_by_hour(show_posts)

print("Top 5 Hours for Ask Post Points:")
print(*[ f"{hour:02d}:00 - {amt}"
            for hour, amt in list(avg_ask_points_by_hour.items())[:5] ], sep="\n", end="\n\n")

print("Top 5 Hours for Show Post Points:")
print(*[ f"{hour:02d}:00 - {amt}"
            for hour, amt in list(avg_show_points_by_hour.items())[:5] ], sep="\n")

Top 5 Hours for Ask Post Points:
15:00 - 21.64
13:00 - 17.93
12:00 - 13.58
10:00 - 13.44
17:00 - 12.19

Top 5 Hours for Show Post Points:
12:00 - 20.91
11:00 - 19.26
13:00 - 17.02
19:00 - 16.06
06:00 - 15.99


The engagement pattern which emerges is remarkably similar to the comment metrics. The early afternoon in EST retains the highest engagement in both comments and points per post.

Notably, Ask HN posts have a much larger dropoff between the second and third highest points per hour, from 17.93 to 13.58, while the top 5 hours for Show HN posts only have a difference of just under 5 points per hour. This indicates that Show HN posts recieve points more consistently throughout the day than Ask HN posts in addition to recieving more overall.

### Accounting for posts without a positive vote total

In this dataset, no posts has 0 points. However, like when checking comment metrics, there are a large number of posts which have only 1 point, and these may affect the outcome.

In [16]:
# _p suffix indicates "with points"
ask_posts_p, show_posts_p, other_posts_p = split_posts(list(filter(lambda row: int(row[3]) > 1, hn)))

In [17]:
ask_posts_no_p_count = len(ask_posts) - len(ask_posts_p)
show_posts_no_p_count = len(show_posts) - len(show_posts_p)
print("Number of Ask HN posts with only 1 point:", len(ask_posts) - len(ask_posts_p),
            f"({ask_posts_no_p_count / len(ask_posts) * 100:.2f}%)")
print("Number of Show HN posts with only 1 point:", len(show_posts) - len(show_posts_p),
            f"({show_posts_no_p_count / len(show_posts) * 100:.2f}%)")

Number of Ask HN posts with only 1 point: 2515 (27.52%)
Number of Show HN posts with only 1 point: 1895 (18.66%)


As seen above, 27% of Ask HN posts recieved no additional points, compared to only 18% of Show HN posts.

In [18]:
avg_ask_points_p = avg_comments(ask_posts_p)
avg_show_points_p = avg_comments(show_posts_p)

print("Average points per Ask HN post with points:", avg_ask_points_p)
print("Average points per Show HN post with points:", avg_show_points_p)

Average points per Ask HN post with points: 13.9
Average points per Show HN post with points: 5.93


This greatly changes our analysis. Of posts that recieved points, Ask HN posts recieved an average of more than twice as many points as Show HN posts. This indicates that while an Ask HN post is less likely to recieve any interaction, when it does it typically gets more than a Show HN post.

## Conclusions

Based on what we have learned from this data, we have a few key points to take away:

* Posts made in the early afternoon in EST recieve the most interaction overall
* Show HN posts are more likely than Ask HN posts to recieve interaction
* Of posts that do recieve interaction, Ask HN posts recieve more comments and points on average

Based on this information, the most popular type of post overall would be an Ask HN post made at 3PM EST.