Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive

we'll use a bag of words model.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than 5 times or more than 100 times.

the model is determining which words correlate with more upvotes, and which with less. By finding these correlations, the model will be able to predict which headlines will be highly upvoted in the future.

In [1]:
import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

In [2]:
submissions

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?
...,...,...,...,...
2994,2015-03-23T18:46:53.000Z,1,ondras.github.io,Rot.js: ROguelike Toolkit in JavaScript
2995,2010-03-11T19:52:37Z,40,economist.com,Amazon auctions computing power: Clouds under ...
2996,2015-04-03T18:07:13.000Z,2,computerworld.com,Nissan CEO: We will have an autonomous vehicle...
2997,2013-07-17T21:54:41Z,2,blog.pythonlibrary.org,Connecting to Dropbox with Python


In [8]:
tokenized_headlines = []
for item in submissions["headline"]:
    tokenized_headlines.append(item.split())

In [9]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []
for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

In [10]:
import numpy as np
unique_tokens = []
single_tokens = []

for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [11]:
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1

In [14]:
# We've already loaded in clean_tokenized and counts
word_counts = counts.sum(axis=0)

counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

In [16]:
word_counts

as              47
you            100
good            13
what            62
de               9
              ... 
sale             5
competition      5
diet             6
reasons          5
nike             7
Length: 661, dtype: int64

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

In [18]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

In [21]:
predictions

array([ 2.17690686e+01,  6.35049729e+01, -1.67007237e+01,  1.67866575e+01,
       -1.97586441e+00,  3.44558067e+01, -4.49860607e+01,  1.41788903e+01,
        1.53594595e+01,  4.82887218e+00,  2.25350723e+00,  4.98527927e+01,
        1.10696859e+01,  3.78096656e+01,  1.10326030e+01, -1.90095575e-01,
        1.10326030e+01,  3.72920816e+00, -1.40047322e+01,  3.48050765e+01,
        6.43508350e+01,  1.10326030e+01,  2.44084956e+01,  1.10326030e+01,
        2.02609640e+01,  2.36476055e+00,  1.10326030e+01,  2.26720526e+00,
        2.22436673e+01,  2.66568210e+00, -3.47492521e+00, -4.72847975e+01,
        3.67933060e+00,  1.09959656e+02,  9.91904416e+00,  4.43886626e+01,
        9.00963982e+00, -2.17246247e+01,  2.92874561e+01, -7.08448438e+00,
        5.38368177e+01, -2.67775578e+00,  3.52360958e+01,  2.15580590e+01,
        1.10326030e+01,  2.07073523e+01, -1.06418175e+01,  1.10326030e+01,
        1.72869227e+01, -1.39319454e+01, -1.55296118e+01,  1.23604698e+01,
       -5.10036138e+00,  

we can calculate our prediction error

In [22]:
mse = sum((predictions - y_test) ** 2) / len(predictions)

In [23]:
mse

2651.1457056689633

 There's no hard and fast rule about what a "good" error rate is, because it depends on the problem we're solving and our error tolerance. 

In this case, the mean number of upvotes is 10, and the standard deviation is 39.5. If we take the square root of our MSE to calculate error in terms of upvotes, we get 51.5. This means that our average error is 51.5 upvotes away from the true value. This is higher than the standard deviation, so our predictions are often far off-base.

Solutions: 
    Use the entire data set,
    Use a random forest, or another more powerful machine learning technique.
    Explore different thresholds for removing extraneous columns.