# Hacker News Pipeline

## Introduction

In this guided project, we will use the pipeline we have been building, and apply it to a real world data pipeline project. From a JSON API, we will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for us.

The data we will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns JSON data of the top stories in 2014.

To make things easier, we have already downloaded a list of JSON posts to a file called _hn_stories_2014.json_. The JSON file contains a single key **stories**, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys:
- **created_at**: A timestamp of the story's creation time.
- **created_at_i**: A unix epoch timestamp.
- **url**: The URL of the story link.
- **objectID**: The ID of the story.
- **author**: The story's author (username on HN).
- **points**: The number of upvotes the story had.
- **title**: The headline of the post.
- **num_comments**: The number of a comments a post has.

Using this dataset, we will run a sequence of basic natural language processing tasks using **our Pipeline class**. The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

## Importing files and libraries

In [1]:
# We can import Pipeline class from the external python pipeline.py file
from pipeline import Pipeline, build_csv
from stop_words import stop_words

import json
import io
import datetime as dt
import csv
import string
import re

## Initializing the pipeline

In [2]:
pipeline = Pipeline()

## Loading the JSON Data

We'll start the project by loading the JSON file data into Python. Because JSON files resemble a key-value dictionary, the goal is to parse the JSON file into a Python dict object. We can accomplish this using the **json module**.

To load in a file, json exposes a method called json.load() which takes in a Python file object as the first argument. Using this json.load() method, we'll load the hn_stories_2014.json file as a Python dict.

In [3]:
@pipeline.task()
def file_to_json():
    with open("my_datasets/hn_stories_2014.json","r") as f:
        data = json.load(f)
    return data["stories"]

## Filtering stories

Let's start by filtering the list of stories to get the most popular stories of the year.

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not Ask HN posts), have a good number of points, and have some comments.

In [4]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_interesting(story):
        interesting = False
        if story["points"] > 50 and story["num_comments"] > 1 and not story["title"].startswith("Ask HN"):
            interesting = True
        return interesting
    return (story for story in stories if is_interesting(story))

## Convert to CSV

With a reduced set of stories, it's time to write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of your pipeline tasks will be adaptable with future task requirements.

In [5]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    header = ['objectID', 'created_at', 'url', 'points', 'title']
    
    def get_fields(stories, header):
        for story in stories:
            new_story = []
            for field in header:
                if field != "created_at":
                    new_story.append(story[field])
                else:
                    date = dt.datetime.strptime(story[field],"%Y-%m-%dT%H:%M:%SZ")
                    new_story.append(date)
            yield new_story
            
    result = build_csv(get_fields(stories, header), header=header, file=io.StringIO())
    return result

## Extract Title Column

Using the CSV file format we created in the previous task, we can now extract the title column. Once we have extracted the titles of each popular post, we can then run the next word frequency task.

The steps are:
1. Import csv, and create a csv.reader() object from the file object. 
2. Find the index of the title in the header.
3. Iterate the through the reader, and return each item from the reader in the corresponding title index position.

In [6]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    read = csv.reader(csv_file, delimiter=",")
    header = next(read)
    idx = header.index("title")
    return (row[idx] for row in read)

## Clean the Titles

Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like _Google_, _google_, _GooGle?_, and _google._, all mean the same keyword: **google**. If we were to split the title into words, however, they would all be lumped into different categories.

To clean the titles, we should make sure to lower case the titles, and to remove the punctuation. An easy way to rid a string of punctuation is to check each character, determine if it is a letter or punctuation, and only keep the letter.

From the **string** package, we are given a handy string constant that contains all the punctuation needed.

In [7]:
@pipeline.task(depends_on=extract_titles)
def clean_titles(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

## Create the Word Frequency Dictionary

With a cleaned title, we can now build the word frequency dictionary. However, to find actual keywords, we should enforce the word frequency dictionary to not include **stop words**. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

We have included a module called stop_words with a tuple of the most common used stop words in the English language.

In [8]:
@pipeline.task(depends_on=clean_titles)
def build_keyword_dictionary(titles):
    kw_dictionary = {}
    for title in titles:
        word_list = title.split(" ")
        for word in word_list:
            if word not in stop_words and word:
                if word not in kw_dictionary:
                    kw_dictionary[word] = 0
                kw_dictionary[word] += 1
    return kw_dictionary

## Sort the Top Words

Finally, we're ready to sort the top words used in all the titles. The goal is to output the **Top 100** list of tuples with (word, frequency) as the entries sorted from most used, to least most used.

In [9]:
@pipeline.task(depends_on=build_keyword_dictionary)
def sort_words(kw_dict):
    top_words = [(k,v) for k,v in kw_dict.items()]
    top_words.sort(reverse=True, key=lambda x: x[1])
    return top_words[:100]

## Executing the pipeline

Let's run the pipline using pipeline.run() and check the Top 100 list

In [10]:
results = pipeline.run()
results[sort_words]

[('new', 185),
 ('google', 167),
 ('bitcoin', 101),
 ('open', 92),
 ('programming', 90),
 ('web', 88),
 ('data', 85),
 ('video', 79),
 ('python', 75),
 ('code', 72),
 ('facebook', 71),
 ('released', 71),
 ('using', 70),
 ('2013', 65),
 ('javascript', 65),
 ('free', 64),
 ('source', 64),
 ('game', 63),
 ('internet', 62),
 ('microsoft', 59),
 ('c', 59),
 ('linux', 58),
 ('app', 57),
 ('pdf', 55),
 ('work', 54),
 ('language', 54),
 ('software', 52),
 ('2014', 52),
 ('startup', 51),
 ('apple', 50),
 ('use', 50),
 ('make', 50),
 ('time', 48),
 ('yc', 48),
 ('security', 48),
 ('nsa', 45),
 ('github', 45),
 ('windows', 44),
 ('world', 41),
 ('way', 41),
 ('like', 41),
 ('1', 40),
 ('project', 40),
 ('computer', 40),
 ('heartbleed', 40),
 ('git', 37),
 ('users', 37),
 ('dont', 37),
 ('design', 37),
 ('ios', 37),
 ('developer', 36),
 ('os', 36),
 ('twitter', 36),
 ('ceo', 36),
 ('vs', 36),
 ('life', 36),
 ('big', 35),
 ('day', 35),
 ('android', 34),
 ('online', 34),
 ('years', 33),
 ('simple', 