# Exploring Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a popular site where users submit technology related posts that can be voted and commented upon.

We are going to work with a dataset of submissions to Hacker News, specifically those with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the community a question, such as "How can I further improve the accessibility of my website?". Users submit `Show HN` posts to show the community a project, product or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

The dataset can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), however do note that it has been reduced from approx 300,000 rows to approx 20,000 rows for this project. The rows were reduced by removing all submissions that did not receive any comments, and then random sampling from the remaining submissions.

## Introduction

First, we'll read in the data, transform it into a list of lists and then display the first few rows. 

In [79]:
from csv import reader

# Read in the data and transform it into a list of lists
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

hn[:5]

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing headers

We can see that the first row contains our column headers:

- `id`: the unique identifier for the post
- `title`: the title of the post
- `url`: the URL that the post links to, if it has a URL
- `num_points`: the number of points the post has aquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: the number of comments made on the post
- `author`: the username of the person that submitted the post
- `created_at`: the date and time at which the post was submitted

In order to analyse our data, we will separate the row containing the column headers and the rest of the dataset.

In [80]:
headers = hn[0]
hn = hn[1:]

print(headers, "\n")
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 



[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN posts

Next, we'll separate the `Ask HN` and `Show HN` posts into their own lists.

We'll do this by checking the `title` of each post and if it starts with either `Ask HN` or `Show HN`. In order to consider case variations, we'll convert the `title` to lowercase before checking for a match.

We'll then display the number of posts in each of our new lists.

In [81]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(f"There are {len(ask_posts)} Ask HN posts")
print(f"There are {len(show_posts)} Show HN posts")
print(f"There are {len(other_posts)} other posts")

There are 1744 Ask HN posts
There are 1162 Show HN posts
There are 17194 other posts


## Calculating the average number of comments for Ask HN and Show HN posts

In [82]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [83]:
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


We find that on average, `Ask HN` posts receive more comments *(~14)* than `Show HN` posts *(~10)*.

## Finding the amount of Ask HN posts and comments by hour created

Next, we'll determine if `Ask HN` posts created at a certain time are more likely to attract comments. 

First, we'll calculate the amount of `Ask HN` posts created in each hour of the day, along with the number of comments received for each hour. We will then display our results.

In [84]:
import datetime as dt

result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

posts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    created_at = result[0]
    num_comments = result[1]
    hour = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M").strftime("%H")

    if hour in posts_by_hour:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

posts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [85]:
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}