# Hacker News Pipeline

## Introduction

In this project, I will use the pipeline which I have worked on and apply it to a real world data pipeline project. From a JSON API, I will filter, clean, aggregate, and summarize data in a sequence of tasks that will apply these transformations for me.

The data I will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns [JSON data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON) of the top stories in 2014. If you're unfamiliar with Hacker News, it's a link aggregator website that users vote up stories that are interesting to the community. It is similar to [Reddit](https://www.reddit.com/), but the community only revolves around on computer science and entrepreneurship posts.

The file hn_stories_2014.json consists of already downloaded list of JSON posts. The JSON file contains a single key stories, which contains a list of stories (posts). Each post has a set of keys, but I will deal only with the following keys:

* created_at: A timestamp of the story's creation time.

* created_at_i: A unix epoch timestamp.

* url: The URL of the story link.

* objectID: The ID of the story.

* author: The story's author (username on HN).

* points: The number of upvotes the story had.

* title: The headline of the post.

* num_comments: The number of a comments a post has.

Using this dataset, I will run a sequence of basic natural language processing tasks using the Pipeline class. 

The goal will be to find the top 100 keywords of Hacker News posts in 2014. Because Hacker News is the most popular technology social media site, this will give us an understanding of the most talked about tech topics in 2014!

## Introduction to the Data

In [1]:
from datetime import datetime
from pipeline import build_csv, Pipeline
from stop_words import stop_words

pipeline = Pipeline()

import json
import io
import csv
import string

## Loading the JSON Data

I will load the JSON file data into Python. The goal is to parse the JSON file into a Python dict object. This can be accomplished using the json module.

To load in a file, json exposes a method called json.load() which takes in a python object as the first argument. Using this json.load() method, I'll load the hn_stories_2014.json file as a Python dict.

In [2]:
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories

## Filtering the Stories

After loading the stories as a list of dict objects, I can now operate on them. Let us start by filtering the list of stories to get the most popular stories of the year. 

Like any social link aggregator site, individual users can post whatever content they want. The reason we want the most popular stories is to ensure that we select stories that were the most talked about during the year. We can filter for popular stories by ensuring they are links (not Ask HN posts), have a good number of points, and have some comments.

In [3]:
# pipeline.task() funtion that depends on the file_to_json()
@pipeline.task(depends_on=file_to_json)

# function that filters stories with more than 50 points, more than 
# 1 comment and do not begin with Ask HN.
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN')
    
    return (
        story for story in stories
        if is_popular(story)
    )

## Convert to CSV

With a reduced set of stories, it's time to write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of the pipeline tasks will be adaptable with future task requirements.

In [4]:
# pipeline.task() function that depends on the filter_stories() function
@pipeline.task(depends_on=filter_stories)

# call function that writes the filtered JSON stories to a CSV file
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append(
            (story['objectID'], datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), story['url'], story['points'], story['title'])
        )
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())

## Extract Title Column

Using the CSV file format I created in the previous task, I can now extract the title column. Once the titles of each popular post have been extracted, I can then run the next word frequency task. To extract the titles, I'll follow the following steps:

1. Import csv, and create a csv.reader() object from the file object. 

2. Find the index of the title in the header. 

3. Iterate through the reader, and return each item from the reader in the corresponding title index position.

In [5]:
# pipeline.task() function that depends on the json_to_csv()
@pipeline.task(depends_on=json_to_csv)

# extract_titles() function that returns a generator of every Hacker
# News story title.
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    
    return (line[idx] for line in reader) 

## Clean the Titles

To clean the titles, I will mke the titles lower case and remove the punctuation. An easy way to rid a string of punctuation is to check each character, determine if it is a letter or punctuation, and only keep the letter.

In [6]:
# pipeline.task() that depends on extract_titles()
@pipeline.task(depends_on=extract_titles)

# clean_title function returns of a generator of cleaned titles 
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

## Create the Word Frequency Dictionary

With a cleaned title, I can now build the word frequency dictionary. A word frequency dictionary are key value pairs that connects a word to the number of times it is used in a text.

Furthermore, to find actual keywords, I will enforce the word frequency dictionary to not include stop words. Stop words are words that occur frequently in language like "the", "or", etc., and are commonly rejected in keyword searches.

In [7]:
# pipeline.task() function that depends on the clean_titles() function.
@pipeline.task(depends_on=clean_title)

# function that returns a dictionary of the word frequency of all
# HN titles.
def build_keyword_dictionary(titles):
    word_freq = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

## Sort the Top Words

The goal is to output a list of tuples with (word, frequency) as the entries sorted from most used, to least most used.

In [9]:
# pipeline.task() function that depends on the build_keyword_dictionary() function
@pipeline.task(depends_on=build_keyword_dictionary)

# function that returns a list of the top 100 tuples
def top_words(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

# print the output of the new task function
ran = pipeline.run()
print(ran[top_words])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3

## Conclusion

The function returned a list of the top 100 tuples as intended. The output is a list of tuples with (word, frequency) and is sorted from most used to least most used, with the most used word, 'new', occuring 186 times and the least most used word, 'inside', occuring 28 times.