---
__Hacker News Pipeline__

We have built a data pipeline that schedules our tasks.

The data we will use comes from a Hacker News (HN) API, returning JSON data of the top stories in 2014.

Each post has a set of keys, but we will deal only with the following keys:
- created_at: A timestamp of the story's creation time.
- created_at_i: A unix epoch timestamp.
- url: The URL of the story link.
- objectID: The ID of the story.
- author: The story's author (username on HN).
- points: The number of upvotes the story had.
- title: The headline of the post.
- num_comments: The number of a comments a post has.

In [1]:
# Set up pipeline
import json, csv, io, string, datetime as dt

from pipeline import Pipeline, build_csv
from stop_words import stop_words

pipeline = Pipeline()

In [2]:
# Extract Data
#  Load data from JSON file
@pipeline.task()
def file_to_json():
    with open('hn_stories_2014.json', 'r') as f:
        raw = json.load(f)
        data = raw['stories']
    return data

#  Filter data
@pipeline.task(depends_on = file_to_json)
def filter_data(data):
    def popular(item):
        return (item['points'] > 50 
                and item['num_comments'] > 1 
                and not item['title'].startswith('ASK HN')
               )
    return (item for item in data if popular(item))

In [3]:
# Transform Data
#  Convert JSON data to CSV
@pipeline.task(depends_on = filter_data)
def json_to_csv(data):
    lines = list()
    for item in data:
        lines.append((item['objectID'], 
                      dt.datetime.strptime(item['created_at'], '%Y-%m-%dT%H:%M:%SZ'),
                      item['url'], item['points'], item['title']
                     ))
        
    file = build_csv(lines, 
                     header = ['objectID', 'created_at', 'url', 'points', 'title'],
                     file = io.StringIO())
    return file

In [4]:
#  Isolate title data
@pipeline.task(depends_on = json_to_csv)
def extract_titles(file):
    reader = csv.reader(file)
    header = next(reader)
    id_num = header.index('title')
    return (i[id_num] for i in reader)

#  Standardise title data
@pipeline.task(depends_on = extract_titles)
def clean_titles(titles):
    titles = [t.lower() for t in titles]
    for p in string.punctuation:
        titles = [t.replace(p, '') for t in titles]
    return titles

In [5]:
#  Build key - value store of word frequencies
@pipeline.task(depends_on = clean_titles)
def build_dictionary(titles):
    word_freq = {}
    for t in titles:
        for i in t.split(' '):
            if len(i) == 0 or i in stop_words:
                pass            
            else:
                if i not in word_freq.keys():
                    word_freq[i] = 1

                word_freq[i] += 1
    
    return word_freq

In [6]:
# Arrange frequency table
@pipeline.task(depends_on = build_dictionary)
def top_entries(word_freq, no_entries = 100):
    sorted_items = sorted(word_freq.items(),
                          key=lambda x:x[1], 
                          reverse=True)
    
    return sorted_items[:no_entries]

In [7]:
# Test implementation
test = pipeline.run()
print('Top Entries')
for i in test[top_entries]:
    print(i[0], '-', i[1])

Top Entries
new - 186
google - 168
ask - 127
bitcoin - 103
open - 96
programming - 93
web - 90
data - 87
video - 80
python - 76
code - 75
facebook - 72
released - 72
using - 71
source - 69
2013 - 66
2014 - 66
free - 66
javascript - 66
game - 65
internet - 63
c - 61
work - 60
microsoft - 60
linux - 59
app - 58
pdf - 56
language - 55
software - 55
use - 54
startup - 53
make - 52
apple - 51
time - 50
yc - 49
security - 49
nsa - 46
github - 46
windows - 45
like - 45
project - 43
way - 43
world - 42
users - 41
developer - 41
1 - 41
computer - 41
heartbleed - 41
dont - 39
git - 38
design - 38
ios - 38
os - 37
twitter - 37
ceo - 37
online - 37
vs - 37
big - 37
life - 37
day - 36
android - 35
years - 35
apps - 35
best - 35
simple - 34
mt - 34
court - 34
firefox - 33
guide - 33
learning - 33
gox - 33
site - 33
api - 33
says - 33
browser - 33
server - 32
fast - 32
problem - 32
mozilla - 32
engine - 32
introducing - 31
does - 31
amazon - 31
better - 31
year - 31
text - 31
support - 30
stop - 30
t

__Closing remarks__

1. The data on HackerNews posts has been proceessed using the pipeline for task scheduling.
2. The data has been cleaned to standardise the word format, as well as skipping stop words and blank entries. 
3. The frequency of each word in the post titles has been extracted into a key - value store.
4. The top 100 words in the key - value store have been extracted and displayed in a readable format. 

The final result has some interesting keywords. There were terms like bitcoin, heartbleed (the 2014 hack), and many others. 

Now that we have created the pipeline, there are additional tasks we could perform with the data:

- Rewrite the Pipeline class' output to save a file of the output for each task. This will allow you to "checkpoint" tasks so they don't have to be run twice.
- Use the nltk package for more advanced natural language processing tasks.
- Convert to a CSV before filtering, so you can keep all the stories from 2014 in a raw file.
- Fetch the data from Hacker News directly from a JSON API. Instead of reading from the file we gave, you can perform additional data processing using newer data.