# Natural Language Processing with Python

## Objective

Predicting Hacker News upvotes using headlines. Learn some of the basic building blocks of natural language processing.

## Introduction

![alt url](https://tctechcrunch2011.files.wordpress.com/2013/05/hacker-news1.jpg?w=400)

[Hacker News](http://news.ycombinator.com/) is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community.

## Data Set

The data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which can find in one of [his GitHub repositories](https://github.com/arnauddri/hn). 

## Reading In The Data

Reading the data and sampled 3000 rows randomly from the data, and removed all of the extraneous columns. 

The data only has four columns:

    •	submission_time - When the article was submitted
    •	upvotes - The number of upvotes the article received
    •	url - The base URL of the article
    •	headline - The article's headline

In [1]:
import pandas as pd

submissions = pd.read_csv("C:/Users/i7/csv/stories.csv", header=None)
submissions = submissions.drop([0, 2, 3, 6], 1)
submissions.columns = ['submission_time', 'upvotes', 'url', 'headline']
submissions.head(5)

Unnamed: 0,submission_time,upvotes,url,headline
0,2015-02-20T11:29:58.000Z,2,,Ask HN: Simple SaaS as first Golang web app?
1,2015-02-20T11:34:22.000Z,1,startupjuncture.com,24sessions: live business advice over video-chat
2,2015-02-20T11:35:32.000Z,3,blog.erratasec.com,Some notes on SuperFish
3,2015-02-20T11:36:18.000Z,1,twitter.com,Apple Watch models could contain 29.16g of gold
4,2015-02-20T11:41:06.000Z,1,phpconference.co.uk,PHP UK Conference Diversity Scholarship Programme


In [2]:
submissions.shape

(1553934, 4)

In [3]:
import numpy as np

np.random.seed(1)
shuffled_idx = np.random.permutation(submissions.index)
submissions = submissions.loc[shuffled_idx]

In [4]:
submissions = submissions.drop(submissions.index[3000:])
submissions.head(5)

Unnamed: 0,submission_time,upvotes,url,headline
404493,2013-09-26T12:11:48Z,4,youtube.com,Bill Gates interview at Harvard (2013)
1435421,2009-05-08T14:36:28Z,2,news.bbc.co.uk,Google boss won't quit Apple job
1207322,2011-02-15T18:44:58Z,1,n-rhman.com,
756837,2012-07-23T03:07:44Z,93,spectrum.ieee.org,Why Bad Jobs-or No Jobs-Happen to Good Workers
639885,2012-12-19T19:50:57Z,2,charlespetzold.com,First-Person Shooter


In [5]:
submissions = submissions.dropna()
submissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2797 entries, 404493 to 584256
Data columns (total 4 columns):
submission_time    2797 non-null object
upvotes            2797 non-null int64
url                2797 non-null object
headline           2797 non-null object
dtypes: int64(1), object(3)
memory usage: 109.3+ KB


## Bag of Words Model to Tokenize the Data

#### Convert each headline to numerical representation

In [6]:
# Step 1: Tokenization
# Split each headline into individual words on the space character(" "), 

tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split(" "))
    
print(tokenized_headlines[:2])

[['Bill', 'Gates', 'interview', 'at', 'Harvard', '(2013)'], ['Google', 'boss', "won't", 'quit', 'Apple', 'job']]


In [7]:
# Step 2: Lowercasing and removing punctuation

punctuations_list = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuations_list:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

print(clean_tokenized[:2])

[['bill', 'gates', 'interview', 'at', 'harvard', '2013'], ['google', 'boss', 'wont', 'quit', 'apple', 'job']]


In [8]:
# Step 3: Retrieve all of the unique words from all of the headlines
# unique_tokens contains any tokens that occur more than once across all of the headlines.

unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [9]:
# Step 4: Counting Token Occurrences

for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

## Removing Columns To Increase Accuracy

Too many columns will cause the model to fit to noise instead of the signal in the data.

Remove any words that occur fewer than 5 times or more than 100 times.

In [10]:
counts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2797 entries, 0 to 2796
Columns: 2317 entries,  to stranger
dtypes: int64(2317)
memory usage: 49.5 MB


In [11]:
word_counts = counts.sum(axis=0)
counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

In [12]:
counts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2797 entries, 0 to 2796
Columns: 662 entries, bill to bug
dtypes: int64(662)
memory usage: 14.1 MB


## Linear Regression

In [13]:
# Train-test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

# Linear Regression
from sklearn.linear_model import LinearRegression

# instantiate an instance
clf = LinearRegression()

# Fit the training data
clf.fit(X_train, y_train)

# Make predictions
y_predict = clf.predict(X_test)

## Calculating Prediction Error

In [14]:
mse = sum((y_predict - y_test) ** 2) / len(y_predict)
rmse = (mse)**0.5
print(rmse)

52.6671083116


##### The calculation have high error (rmse of ~50 upvotes) in predicting upvotes as  used a very small data set. With larger training sets, this should decrease dramatically.

# Prediction Using Scikit-Learn

In [15]:
submissions_data = pd.read_csv("C:/Users/i7/csv/stories.csv", header=None)
submissions_data = submissions_data.drop([0, 2, 3, 6], 1)
submissions_data.columns = ['submission_time', 'upvotes', 'url', 'headline']

np.random.seed(1)
shuffled_idx = np.random.permutation(submissions_data.index)
submissions_data = submissions_data.loc[shuffled_idx]

submissions_data = submissions_data.drop(submissions_data.index[10000:])

submissions_data = submissions_data.dropna()
submissions_data.head()

Unnamed: 0,submission_time,upvotes,url,headline
404493,2013-09-26T12:11:48Z,4,youtube.com,Bill Gates interview at Harvard (2013)
1435421,2009-05-08T14:36:28Z,2,news.bbc.co.uk,Google boss won't quit Apple job
1207322,2011-02-15T18:44:58Z,1,n-rhman.com,
756837,2012-07-23T03:07:44Z,93,spectrum.ieee.org,Why Bad Jobs-or No Jobs-Happen to Good Workers
639885,2012-12-19T19:50:57Z,2,charlespetzold.com,First-Person Shooter


In [16]:
submissions_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9373 entries, 404493 to 1019647
Data columns (total 4 columns):
submission_time    9373 non-null object
upvotes            9373 non-null int64
url                9373 non-null object
headline           9373 non-null object
dtypes: int64(1), object(3)
memory usage: 366.1+ KB


## Generating a Matrix for all the Headlines

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct a bag of words matrix.
# This will lowercase everything, and ignore all punctuation by default.
# It will also remove stop words.
vectorizer = CountVectorizer(lowercase=True, stop_words="english")

matrix = vectorizer.fit_transform(submissions_data["headline"])
# Created bag of words matrix with far fewer commands.
print("matrix.todense():\n", matrix.todense())

# Let's apply the same method to all the headlines in all 100000 submissions.
# We'll also add the url of the submission to the end of the headline so we can take it into account.
submissions_data['full_test'] = submissions_data["headline"] + " " + submissions_data["url"]
full_matrix = vectorizer.fit_transform(submissions_data["headline"])
print("full_matrix.shape:\n", full_matrix.shape)

matrix.todense():
 [[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
full_matrix.shape:
 (9373, 13873)


## Reducing Dimensionality

Pick a subset of the columns that are the most informative -- that is, the columns that differentiate between good and bad headlines the best. A good way to figure out the most informative columns is to use something called a chi-squared test.

A chi-squared test finds the words that discriminate the most between highly upvoted posts and posts that weren't upvoted. This can be words that occur a lot in highly upvoted posts, and not at all in posts without upvotes, or words that occur a lot in posts that aren't upvoted, but don't occur in posts that are upvoted.


In [22]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Convert the upvotes variable to binary so it works with a chi-squared test.
col = submissions_data["upvotes"].copy(deep=True)
col_mean = col.mean()
col[col < col_mean] = 0
col[(col > 0) & (col > col_mean)] = 1

# Find the 1000 most informative columns
selector = SelectKBest(chi2, k=1000)
selector.fit(full_matrix, col)
top_words = selector.get_support().nonzero()

# Pick only the most informative columns in the data.
chi_matrix = full_matrix[:,top_words[0]]

SelectKBest(k=1000, score_func=<function chi2 at 0x000000000F842840>)

## Adding Meta Features

Ignore the "meta" features of the headlines we're missing out on a lot of good information. These features are things like length, amount of punctuation, average word length, and other sentence specific features.

Adding these in can greatly increase prediction accuracy.

In [25]:
import re

# List of functions to apply.
transform_functions = [
    lambda x: len(x),
    lambda x: x.count(" "),
    lambda x: x.count("."),
    lambda x: x.count("!"),
    lambda x: x.count("?"),
    lambda x: len(x) / (x.count(" ") + 1),
    lambda x: x.count(" ") / (x.count(".") + 1),
    lambda x: len(re.findall("\d", x)),
    lambda x: len(re.findall("[A-Z]", x)),
]

# Apply each function and put the results into a list.
columns = []
for func in transform_functions:
    columns.append(submissions_data["headline"].apply(func))
    
# Convert the meta features to a numpy array.
meta = np.asarray(columns).T

## Adding in more Features

Adding submission_time, that tells when a story was submitted, and could add more information.

In [28]:
columns = []

# Convert the submission dates column to datetime.
submissions_data_dates = pd.to_datetime(submissions_data["submission_time"])

# Transform functions for the datetime column.
transform_functions = [
    lambda x: x.year,
    lambda x: x.month,
    lambda x: x.day,
    lambda x: x.hour,
    lambda x: x.minute,
]

# Apply all functions to the datetime column.
for func in transform_functions:
    columns.append(submissions_data_dates.apply(func))

# Convert the meta features to a numpy array.
non_nlp = np.asarray(columns).T

# Concatenate the features together.
features = np.hstack([non_nlp, meta, chi_matrix.todense()])

## Making Predictions

Using ridge regression to make predictions, ridge regression introduces a penalty on the coefficients, which prevents them from becoming too large. This can help it work with large numbers of predictors (columns) that are correlated to each other

In [31]:
from sklearn.linear_model import Ridge
import random

train_rows = 7500
# Set a seed to get the same "random" shuffle every time.
random.seed(1)

# Shuffle the indices for the matrix.
indices = list(range(features.shape[0]))
random.shuffle(indices)

# Create train and test sets.
train = features[indices[:train_rows], :]
test = features[indices[train_rows:], :]
train_upvotes = submissions_data["upvotes"].iloc[indices[:train_rows]]
test_upvotes = submissions_data["upvotes"].iloc[indices[train_rows:]]

# Run the regression and generate predictions for the test set.
reg = Ridge(alpha=.1)
reg.fit(train, train_upvotes)
predictions = reg.predict(test)

Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

## Evaluating Error

In [32]:
# Use mean absolute error as an error metric.
mse_2 = sum(abs(predictions - test_upvotes)) / len(predictions)
print("mae:", mse_2)

# As a baseline, use the average number of upvotes
average_upvotes = sum(test_upvotes)/len(test_upvotes)

mse_22 = sum(abs(average_upvotes - test_upvotes)) / len(predictions)
print("mae:", mse_22)

mae: 12.2504280089
mae: 14.4016602582


The error is about 12.25 upvotes, which means that, on average, the prediction is 12.25 upvotes away from the actual number of upvotes.

This method estimates better than using without scikit-learn. 