# Exploring Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a popular site where users submit technology related posts that can be voted and commented upon.

We are going to work with a dataset of submissions to Hacker News, specifically those with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the community a question, such as "How can I further improve the accessibility of my website?". Users submit `Show HN` posts to show the community a project, product or just generally something interesting.

We'll compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

The dataset can be found [here](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts), however do note that it has been reduced from approx 300,000 rows to approx 20,000 rows for this project. The rows were reduced by removing all submissions that did not receive any comments, and then random sampling from the remaining submissions.

## Introduction

First, we'll read in the data, transform it into a list of lists and then display the first few rows. 

In [19]:
from csv import reader

# Read in the data and transform it into a list of lists
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

We can see that the first row contains our column headers:

- `id`: the unique identifier for the post
- `title`: the title of the post
- `url`: the URL that the post links to, if it has a URL
- `num_points`: the number of points the post has aquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: the number of comments made on the post
- `author`: the username of the person that submitted the post
- `created_at`: the date and time at which the post was submitted

In order to analyse our data, we will separate the row containing the column headers and the rest of the dataset.

In [20]:
headers = hn[0]
hn = hn[1:]

print(headers, "\n")
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 



[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]