# Intro to NLP

In this mission, we'll learn some of the basic building blocks of natural langage processing. When we feed a computer written text, it has no idea what that text means. In order for a computer to begin making inferences from it, we'll need to convert the text to a numerical representation. This process will enable the computer to intuit grammatical rules, which is more akin to learning a first language.

We'll explore how to get from written text to a numerical representation, and how we can use that representation to make predictions.

##  The Data

Our data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which you can find in one of his GitHub repositories. We've sampled 3000 rows from the data randomly, and removed all of the extraneous columns. Our data only has four columns:

 - submission_time - When the article was submitted
 - upvotes - The number of upvotes the article received
 - url - The base URL of the article
 
 - headline - The article's headline
In this mission, we'll be predicting the number of upvotes the articles received, based on their headlines. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

In [1]:
import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

## Tokenizing

Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.

While there are several ways to accomplish this, we'll use a bag of words model. A bag of words model represents each piece of text as a numerical vector.

In [2]:
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split())

## Preprocessing Tokens to Increase Accuracy

In [3]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

In [4]:
## Assembling a Matrix of Unique Words

import numpy as np
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [5]:
# Counting Token Occurrences

for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

## Removing Columns to increase accuracy

We have over 2000 columns in our matrix. This can make it very hard for a linear regression model to make good predictions. Too many columns will cause the model to fit to noise instead of the signal in the data.

There are two kinds of features that will reduce prediction accuracy. Features that occur only a few times will cause overfitting, because the model doesn't have enough information to accurately decide whether they're important. These features will probably correlate differently with upvotes in the test set and the training set.

Features that occur too many times can also cause issues. These are words like and and to, which occur in nearly every headline. These words don't add any information, because they don't necessarily correlate with upvotes. These types of words are sometimes called stopwords.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than 5 times or more than 100 times.

In [6]:
word_counts = counts.sum(axis=0)

counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

In [7]:
# Splitting the Data into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

In [8]:
# making predictions with fit

from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

In [10]:
# calculating prediction error

mse = sum((predictions - y_test) ** 2) / len(predictions)
print(mse)

2651.145705668969


## Next steps

Our MSE is 2651, which is a fairly large value. There's no hard and fast rule about what a "good" error rate is, because it depends on the problem we're solving and our error tolerance.

In this case, the mean number of upvotes is 10, and the standard deviation is 39.5. If we take the square root of our MSE to calculate error in terms of upvotes, we get 51.5. This means that our average error is 51.5 upvotes away from the true value. This is higher than the standard deviation, so our predictions are often far off-base.

We can take several steps to reduce the error and explore natural language processing further. Here are some ideas for your next steps:

 - Use the entire data set. While we used samples in this mission, you could download the entire data set from this GitHub repository https://github.com/arnauddri/hn. This approach will reduce the error rate dramatically. There are many features in natural language processing. Using more data will ensure that the model will find more occurrences of the same features in the test and training sets, which will help the model make better predictions.
 - Add "meta" features like headline length and average word length.
 - Use a random forest, or another more powerful machine learning technique.
 - Explore different thresholds for removing extraneous columns.