# Hacker News Pipeline

I'll be building a pipeline with the goal of finding the top 100 keywords of Hacker News posts in 2014. 

## Data Source
The data we will use comes from a [Hacker News](https://news.ycombinator.com/) (HN) API that returns JSON data of the top stories in 2014. 

## Pipeline Steps
- Fetch data from the HN API, which returns JSON data of the top stories in 2014.
- Filter, clean, and aggregate the data.
- Implement basic Natural Language Processing (NLP) tasks to analyze the text content of the posts.
- Create a Pipeline class to orchestrate the sequence of tasks mentioned above.
- Using NLP techniques, extract the top 100 keywords from the processed text data.

## Data Description
To make things easier, we have already downloaded a list of JSON posts to a file called `hn_stories_2014.json`. The JSON file contains a single key stories, which contains a list of stories (posts). Each post has a set of keys, but we will deal only with the following keys and here are what they represent:
- `created_at`: A timestamp of the story's creation time.
- `created_at_i`: A Unix epoch timestamp.
- `url`: The URL of the story link.
- `objectID`: The ID of the story.
- `author`: The story's author (username on HN).
- `points`: The number of upvotes the story had.
- `title`: The headline of the post.
- `num_comments`: The number of comments a post has.

## Summary of Results
We have been able to find the Top 100 most talked about Topic/story in 2014.


## Loading the JSON Data
We'll start by loading the JSON file data into Python Dictionary

In [1]:
from datetime import datetime
import json
import io
import csv
import string

from pipeline import build_csv, Pipeline
from stop_words import stop_words

pipeline = Pipeline()

@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        data = json.load(f)
        stories = data['stories']
    return stories

Great! Now that we have loaded in all the stories as a list of dict objects, printed the first story, so we can now operate on them. Let's start by filtering the list of stories to get the most popular stories of the year.

## Filtering the Stories
Similar to other social link aggregator platforms, users on our site have the freedom to post various types of content. Our goal in identifying the most popular stories is to highlight those that garnered significant attention throughout the year. To achieve this, we prioritize stories that are link-based (rather than Ask HN posts), possess a substantial number of points, and have generated some level of engagement through comments.

In [2]:
@pipeline.task(depends_on=file_to_json)
def filter_stories(stories):
    def is_popular(story):
        return story['points'] > 50 and story['num_comments'] > 1 and not story['title'].startswith('Ask HN')
    
    return (
        story for story in stories
        if is_popular(story)
    )

Just to be sure we printed 5 out. We succedded in filtering stories that have more than 50 points, 1 comment and do not begin with `ASK HN`. 

## Convert to CSV
With a reduced set of stories, it's time to write these dict objects to a CSV file. The purpose of translating the dictionaries to a CSV is that we want to have a consistent data format when running the later summarizations. By keeping consistent data formats, each of the pipeline tasks will be adaptable with future task requirements.

In [3]:
@pipeline.task(depends_on=filter_stories)
def json_to_csv(stories):
    lines = []
    for story in stories:
        lines.append(
            (story['objectID'], datetime.strptime(story['created_at'], "%Y-%m-%dT%H:%M:%SZ"), story['url'], story['points'], story['title'])
        )
    return build_csv(lines, header=['objectID', 'created_at', 'url', 'points', 'title'], file=io.StringIO())

## Extract Title Column
Using the CSV file format we created in the previous task, we can now extract the title column.

In [4]:
@pipeline.task(depends_on=json_to_csv)
def extract_titles(csv_file):
    reader = csv.reader(csv_file)
    header = next(reader)
    idx = header.index('title')
    
    return (line[idx] for line in reader)

## Cleaning the Titles
Because we're trying to create a word frequency model of words from Hacker News titles, we need a way to create a consistent set of words to use. For example, words like Google, google, GooGle?, and google., all mean the same keyword: google

In [5]:
@pipeline.task(depends_on=extract_titles)
def clean_title(titles):
    for title in titles:
        title = title.lower()
        title = ''.join(c for c in title if c not in string.punctuation)
        yield title

## Create the Word Frequency Dictionary
With a cleaned title, we can now build the word frequency dictionary.

In [6]:
@pipeline.task(depends_on=clean_title)
def build_keyword_dictionary(titles):
    word_freq = {}
    for title in titles:
        for word in title.split(' '):
            if word and word not in stop_words:
                if word not in word_freq:
                    word_freq[word] = 1
                word_freq[word] += 1
    return word_freq

## Sort the Top Words
Finally, we're ready to sort the top words used in all the titles. In this final task, it's up to you to decide how you want to sort the top words. The goal is to output a list of tuples with (word, frequency) as the entries sorted from most used, to least most used.

In [7]:
@pipeline.task(depends_on=build_keyword_dictionary)
def top_keywords(word_freq):
    freq_tuple = [
        (word, word_freq[word])
        for word in sorted(word_freq, key=word_freq.get, reverse=True)
    ]
    return freq_tuple[:100]

ran = pipeline.run()
print(ran[top_keywords])

[('new', 186), ('google', 168), ('bitcoin', 102), ('open', 93), ('programming', 91), ('web', 89), ('data', 86), ('video', 80), ('python', 76), ('code', 73), ('facebook', 72), ('released', 72), ('using', 71), ('2013', 66), ('javascript', 66), ('free', 65), ('source', 65), ('game', 64), ('internet', 63), ('microsoft', 60), ('c', 60), ('linux', 59), ('app', 58), ('pdf', 56), ('work', 55), ('language', 55), ('software', 53), ('2014', 53), ('startup', 52), ('apple', 51), ('use', 51), ('make', 51), ('time', 49), ('yc', 49), ('security', 49), ('nsa', 46), ('github', 46), ('windows', 45), ('world', 42), ('way', 42), ('like', 42), ('1', 41), ('project', 41), ('computer', 41), ('heartbleed', 41), ('git', 38), ('users', 38), ('dont', 38), ('design', 38), ('ios', 38), ('developer', 37), ('os', 37), ('twitter', 37), ('ceo', 37), ('vs', 37), ('life', 37), ('big', 36), ('day', 36), ('android', 35), ('online', 35), ('years', 34), ('simple', 34), ('court', 34), ('guide', 33), ('learning', 33), ('mt', 3