# The Data

Our data set consists of submissions users made to **Hacker News** from 2006 to 2015.

In this project, we'll be predicting the number of upvotes the articles received, based on their headlines. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.

We'll use a **bag of words** model. A bag of words model represents each piece of text as a numerical vector.

The first step in creating a bag of words model is **tokenization**. In **tokenization**, we break a sentence up into disconnected words.

In [1]:
import pandas as pd
submissions = pd.read_csv("stories.csv")
submissions = submissions.iloc[:,[1,4,5,7]]


In [2]:
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()
submissions.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2015-02-20T11:34:22.000Z,1,startupjuncture.com,24sessions: live business advice over video-chat
1,2015-02-20T11:35:32.000Z,3,blog.erratasec.com,Some notes on SuperFish
2,2015-02-20T11:36:18.000Z,1,twitter.com,Apple Watch models could contain 29.16g of gold
3,2015-02-20T11:41:06.000Z,1,phpconference.co.uk,PHP UK Conference Diversity Scholarship Programme
4,2015-02-20T11:43:04.000Z,2,preview.onedrive.com,Microsoft giving away 100GB free OneDrive stor...


In [3]:
submissions.shape

(1455868, 4)

Since the data is quite big let's sample only 4000 rows from the data randomly to reduce on the computation time.

In [24]:
submissions = submissions.sample(n=4000)

## Tokenization of Headlines

In [5]:
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split())

In [6]:
#tokenized_headlines

We now have tokens, let's preprocess them a bit to ensure that our predictions more accurate. 

In [7]:
#Loop through each item in tokenized_headlines, which is a list of lists.

punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized_headlines = []
for item in tokenized_headlines: #For each list of tokens
    tokens = []
    for token in item:
        token = token.lower() #Convert each individual token to lowercase
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized_headlines.append(tokens)

## Assembling a Matrix of Unique Words

Now that we have our tokens, we can begin converting the sentences to their numerical representations.

First, we'll retrieve all of the unique words from all of the headlines. Then, we'll create a matrix, and assign those words as the column headers. We'll initialize all of the values in the matrix to 0.

In [8]:
import numpy as np
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized_headlines:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized_headlines)), columns=unique_tokens)

In [9]:
counts.head()

Unnamed: 0,iphone,apps,facebook,of,a,–,for,sketch,bitcoin,to,...,engagement,z,badass,imessage,gurgaon,doj,fox,unemployment,owe,molecular
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Counting Token Occurances

Now that we have a matrix where all values are 0, we need to fill in the correct counts for each cell.

In [10]:
#use the enumerate() function when writing the loop to get an index along with the list of tokens.
for i, item in enumerate(clean_tokenized_headlines):
    for token in item: #Loop through each token in the list of tokens
        if token in unique_tokens: #Check whether the token is in unique_tokens
            counts.iloc[i][token] +=  #Increment the appropriate cell 

In [11]:
counts.head()

Unnamed: 0,iphone,apps,facebook,of,a,–,for,sketch,bitcoin,to,...,engagement,z,badass,imessage,gurgaon,doj,fox,unemployment,owe,molecular
0,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


We have over 3000 columns in our matrix. This can make it very hard for a linear regression model to make good predictions. Too many columns will cause the model to fit to noise instead of the signal in the data.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than 6 times or more than 110 times.

In [18]:
word_counts = counts.sum(axis=0) # Generate a vector that contains the sum of each column in counts

counts = counts.loc[:,(word_counts >= 6) & (word_counts <= 110)]

In [19]:
word_counts

iphone      53
apps        33
facebook    56
–           62
bitcoin     24
            ..
minutes      6
clojure      7
reader       6
yet          6
away         6
Length: 761, dtype: int64

## Splitting The Data Into Train And Test Sets

In [20]:
from sklearn.model_selection import train_test_split
#We'll train our algorithm on a training set, then test its performance on a test set.
#The train_test_split() function from scikit-learn will help us accomplish this.

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

## Making Predictions

Now that we have a training set and a test set, let's train a model and make test predictions.

In [21]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()#Train clf using the fit() method.

#Use the predict() method on clf to make predictions on X_test
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

## Computing The Prediction error

We'll use mean squared error (MSE) as our error metric.

In [23]:
mse = sum((predictions - y_test) ** 2) / len(predictions)
mse

1989.0367535887601

Our MSE is 2651, which is a fairly large value. There's no hard and fast rule about what a "good" error rate is, because it depends on the problem we're solving and our error tolerance. 

## Considerations For Future Exploration

We can take several steps to reduce the error and explore natural language processing further. Here are some ideas for your next steps:

- Use the entire data set. While we used samples in this mission, you could download the entire data set from this GitHub repository. This approach will reduce the error rate dramatically. There are many features in natural language processing. Using more data will ensure that the model will find more occurrences of the same features in the test and training sets, which will help the model make better predictions.
- Add "meta" features like headline length and average word length.
- Use a random forest, or another more powerful machine learning technique.
- Explore different thresholds for removing extraneous columns.
