# Exploring Hacker News Posts

## Introduction

In this project, we will explore a dataset of approximately 20,000 random posts that were submitted by users to Hacker News, a social news website that focuses on computer science and entreprenuership. On this website, users can 'upvote' or 'downvote' a post depending on whether they like it or not. Users can also comment and have discussions under each post.

The goal of this project is to compare two types of posts that users submit in order to 'ask' a question or 'show' the community something interesting. In order to do this, we will filter the dataset for posts that only begin with the titles `Ask HN` or `Show HN`, and then examine the data. For our analysis, we will answer two main questions:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## Importing the Data

First, let's import the data and extract the header row.

In [1]:
import csv

with open('hacker_news.csv', 'r') as f:
    hn = list(csv.reader(f))


# Preview first five rows of the dataset
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [2]:
# Removing header from dataset
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])  # Check if header was removed

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Extracting Ask HN and Show HN Posts

Before we explore the dataset more, we need to filter for posts that only start with `Ask HN` or `Show HN`, and store them into two separate lists.

In [3]:
# Create empty lists
ask_posts = []
show_posts = []
other_posts = []

# Filter for 'Ask HN' and 'Show HN' posts
for row in hn:
    title = row[1]
    # Control for different cases
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check number of posts in each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


## Average Number of Comments for Ask HN and Show HN Posts

Here we calculate the average number of comments that each `Ask HN` and `Show HN` post receives, and then compare the two values.

In [4]:
# Calculating average number of comments per 'Ask HN' post
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [5]:
# Caluclating average number of comments per 'Show HN' post
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


On average, each `Ask HN` post receives about 10 comments, while each `Show HN` post receives about 5 comments on average. This means that for every `Ask HN` post we find, we should expect to see about 5 more comments than on a `Show HN` post. Since `Ask HN` posts are more likely to receive comments, we will continue our remaining analysis on `Ask HN` posts only.

## Finding the Amount of Ask Posts and Comments by Hour Created

Now we want to find out if there is a certain time of the day when `Ask HN` posts are more likely to attract comments. For this, we will first start by calculating the total number of `Ask HN` posts and comments per each hour of the day.

In [6]:
import datetime as dt

result_list = []

# Append the time created and number of comments for each post
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

# Create dictionaries for counts and comments by hour
counts_by_hour = {}
comments_by_hour = {}

# Parse date and extract hour for each row
for row in result_list:
    date_created = row[0]
    date_dt = dt.datetime.strptime(date_created, "%m/%d/%Y %H:%M")
    hour = date_dt.strftime("%H")
    # Adding count and comment values to dictionary
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

Now that we have the total number of `Ask HN` comments for each hour of the day, we want to calculate the average number of comments `Ask HN` posts receive per hour.

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour] / counts_by_hour[hour])])

avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

## Sorting and Printing Values from a List of Lists

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)  # Check if value swap is correct

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [9]:
# Sort new list by descending order of average comments per hour
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [10]:
# Print top 5 Hours for Ask Posts Comments
print("Top 5 Hours for Ask Posts Comments")

for avg, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
    )

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


According to our results, 15:00 is the hour that receives the most average comments on `Ask HN` posts per day. From the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/), the time in this dataset is given in Eastern Time. Therefore, the hour that receives the most average comments on `Ask HN` posts per day is 03:00 PM ET, or 15:00 ET.

At 03:00 PM ET every day, we can expect to see an average of 28.68 comments per `Ask HN` post. The average number of comments during this hour is about 76% more than the hour with the next highest average number of comments, which is 01:00 PM ET.

## Conclusion

In this project, we were able to analyze data from `Ask HN` and `Show HN` posts from the Hacker News website to determine which posts received more comments and which hour of the day received the most comments on average. Our results were:

- On average, `Ask HN` posts receive approximately 5 more comments than `Show HN` posts.
- On `Ask HN` posts, 03:00 - 04:00 PM ET receives the most number of comments on average per day with approximately 28.68 comments per  `Ask HN` post.

We should note that these numbers may not be an accurate representation of `Ask HN` and `Show HN` posts as a whole, since we only explored a small sample size of the entire dataset, and excluded any posts that did not receive any comments. However, we can say that our analysis still holds true for the posts in our dataset that did receive comments.