<div>
    <b>Description:</b> Exploring Hacker News Post<br>
    <b>Author:</b> Maika Carmelle Henry Northrop
</div>
<br>

In [3]:
#import modules
import csv
from csv import reader
import pandas as pd
import pprint

# Exploring Y-Combinator's Hacker News Post

In this project, our goal is explore and analyze a data set from the Hacker News Site.  Hacker news was developed by the startup incubator Y Combinator, where user's post tech stories and they are voted and commented by other users.  Hacker News is very popular among the tech community and startups.  Posts that make it to the top of their listings typically get hundreds of thousands of unique visits and views.

We are specifically interested in posts whose titles begin with **Ask HN** or **Show HN**.  Users submit Ask HN or Show HN posts to either ask the large Hacker News community a specific question or show the community a project, product or something interesting.  Following are examples of recent posts (March 2019):

* Ask HN: Startup failed after years of work -- Can I even get a job now?
* Show HN: A simple Prolog Interpreter written in a few lines of Python 3
* Ask HN: Recommended Platform for Programming Interviews?
* Show HN: Is running a VM windows 10 on a Linux OS (Ubuntu) secure?

We will compare these two types of posts to determine:

* On average, which title receives the most comments.
* On average, do posts submitted at a certain time receive more comments
* Which title receives the most points on average.
* Compare the results to the average number of comments and points other posts receive

## Collecting the Data

The data set we will be using can be found on Kaggle's website [here](https://www.kaggle.com/hacker-news/hacker-news-posts).  It has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.  The columns and descriptions are:

* **id** - The unique identifier from Hacker News for the post
* **title** - The title of the post
* **url** - The URL that the posts links to, if it has a URL
* **num_points** - The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
*  **num_comments** - The number of comments that the post generated
* **author** - The username of the person who submitted the post
* **created_at** - The date and time at which the post was submitted

## Let's explore the Hacker News data set

The following function was created following the DRY design method so that we can repeatedly print rows in a more readable way. Also, an option has been added to our function to show the number of rows and columns for any data set.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## Let's begin by opening and reading the data set.

In [5]:
### Open Hacker News Post data set ###
opened_file = open('datasets/hackernews_post_2016.csv', encoding="utf8")
read_file = reader(opened_file)
hacker = list(read_file)
hacker_header = hacker[0]
hacker = hacker[1:]

In [7]:
### Explore Hacker News Post data set
print(hacker_header)
print('\n')
explore_data(hacker, 0, 5, True)  # print first 5 rows

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

## Data Set Summary

There are 293,119 android apps and 7 columns in this data set.  Below are the columns along with their descriptions:
* id - The unique identifier from Hacker News for the post
* title - The title of the post
* url - The URL that the posts links to, assuming the post has an URL
* num_points- the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* author - The username of the person who submitted the post
* created_at - The date and time at which the post was submitted

## Remove submissions that received no comments.

When we explored the dataset, there were a number of submissions that did not receive comments.  We are specifically interested in posts that are engaging or at the very least, have a minimum of 1 comments from the community. Let's optimize our dataset by removing submissions without comments.