# Exploring Hacker News Posts

![Image](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

This project is about analyzing information from the Hacker News Posts (HN) dataset of submissions to a technology site [Hacker News](http://news.ycombinator.com/). Hacker News is a site where user-submitted stories (known as "posts") receive votes and comments and, depending on a number of positive or negative votes, can be raised in the feed of posts or lowered down, similar to reddit.

You can find the the Hacker News Posts dataset [here](http://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
- `id`: the unique identifier from Hacker News for the post;
- `title`: the title of the post;
- `url`: the URL that the posts links to, if the post has a URL;
- `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes;
- `num_comments`: the number of comments on the post;
- `author`: the username of the person who submitted the post;
- `created_at`: the date and time of the post's submission.

We are specifically interested in posts with titles that begin with either "Ask HN" or "Show HN". Users submit "Ask HN" posts to ask the Hacker News community a specific question. Likewise, users submit "Show HN" posts to show the Hacker News community a project, product, or just something interesting. We will compare these two types of posts to determine the following:
- Do "Ask HN" or "Show HN" receive more comments on average?
- Do posts created at a certain time receive more comments on average?

### Opening and exploring the data
First of all, to start working with the information stored in the dataset we will have to extract it from a `CSV` file and assign it to the `hn` variable. To do it, we will import the `reader`class from the `csv` module and use the `extract_data` function that takes an argument `directory`. The function returns information from the dataset in the "list of lists" format.

In [1]:
from csv import reader
def extract_data(directory):
    OpenedDataset = open(directory, encoding = "utf8")
    ReadData = reader(OpenedDataset)
    return list(ReadData)
HN = extract_data('..\..\Datasets\P2_Exploring_Hacker_News_Posts\hacker_news.csv')


To have a first look at the data from the Hacker News dataset we will write the `explore_data` function that takes 4 arguments:
1. `dataset` - a title of the dataset.
2. `start and end` - the start and the end indexes of a given dataset to display a certain number of rows that we want to display.
3. `rows_and_columns` - this argument is used to indicate if we need to display the aggregated information about a number of rows and a number of columns on the interval of rows chosen in the previous step. The argument is "False" by default.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')

Let's use the `explore_data` function to display first five rows to have a first look at the data.

In [3]:
explore_data(HN, 0, 6)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Notice that the first list in the inner lists contains the column `headers`. In the next step we will extract the first row of data, and assign it to the variable `headers`.

In [4]:
headers = HN[:1]
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In the next step we will remove the column `headers` from the `HN` list and use the function `explore_data` to display first five rows to verify that we removed the header row properly.

In [5]:
HN = HN[1:]
explore_data(HN, 0, 5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




Since we are only concerned with post titles beginning with "Ask HN" or "Show HN", we will create new lists of lists containing just the data for those titles. To find the posts that begin with either "Ask HN" or "Show HN", we will use the string method `startswith`. The method will return `True` if the given string object the string parametr starts with a substring given as an argument. Notice what capitalization matters, so we could will be using the string method `lower`, which returns a lowercase version of the starting string. Let's use these methods to separate posts beginning with "Ask HN" and "Show HN" (and case variations) into two different lists.

In [6]:
ask_posts = []
show_posts = []
other_posts = []
for data in HN:
    title = data[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(data)
    elif title.lower().startswith('show hn'):
        show_posts.append(data)
    else:
        other_posts.append(data)
print("The number of posts in the list 'ask_posts' is {list}".format(list = len(ask_posts)), "\n")
print("The number of posts in the list 'show_posts' is {list}".format(list = len(show_posts)), "\n")
print("The number of posts in the list 'show_posts' is {list}".format(list = len(other_posts)), "\n")

The number of posts in the list 'ask_posts' is 1744 

The number of posts in the list 'show_posts' is 1162 

The number of posts in the list 'show_posts' is 17194 

