# Overview of the Data

- consists of submissions users made to _Hacker News_ from 2006 to 2015
- Dataquest sampled 3000 rows from the data randomly
- Data has four columns:
    - `submission_time`: when the article was submitted
    - `upvotes`: number of upvotes the article received
    - `url`: base URL of the article
    - `headline`: article's headline
    
- In this mission, we'll be predicting number of upvotes the articles received, based on their headlines
    - upvotes are indicator of popularity, so we'll discover which types of articles tend to be the most popular

In [16]:
import pandas as pd
import numpy as np

submissions = pd.read_csv('sel_hn_stories.csv')
submissions.columns = ['submission_time', 'upvotes', 'url', 'headline']

submissions = submissions.dropna()
submissions.head(5)

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?


In [99]:
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

# Tokenizing the Headlines

- goal: train a linear regression algorithm that predicts the number of upvotes a headline would receive
    - to do this, need to convert each headline to a numerical representation
- several ways to do this, but we'll use a `bag of words` model
    - represents each piece of text as a numerical vector
- first step in creating a bag of words model is tokenization
    - tokenization, we break up a sentence into disconnected words
    - all we're going is splitting each sentence into a list of individual words

In [40]:
tokenized_headlines = []

# Split each headline into individual words on the space character(" "), and append the resulting list to 
# tokenized_headlines.
for headline in submissions['headline']:
    tokenized_headlines.append(headline.split())

tokenized_headlines[:5]
# tokenized_headlines = submissions['headline'].str.strip().str.split(pat=' ').tolist()

[['Software:',
  'Sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['Google’s',
  'Stock',
  'Split',
  'Means',
  'More',
  'Control',
  'for',
  'Larry',
  'and',
  'Sergey'],
 ['SSL',
  'DOS',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['Immutability', 'and', 'Blocks', 'Lambdas', 'and', 'Closures'],
 ['Comment', 'optimiser', 'la', 'vitesse', 'de', 'Wordpress?']]

In [39]:
type(tokenized_headlines)

list

# Preprocessing Tokens to Increase Accuracy

- need to process tokens to make predictions more accurate
- need to convert variations so they're consistent (ex. `Berlin`, `Berlin.` and `berlin`)
    - can do this by lowercasing and removing punctuation
- doesn't have to be perfect, but more we can help computer group same word together the higher our prediction accuracy will be

In [57]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines: # loops through each item of list
    tokens = [] # create list to store each token
    for token in item: # for each token (i.e. word) in each entry
        token = token.lower() # lowercase all letters in token
        for punc in punctuation: # for each punctuation in list of punctuation
            token = token.replace(punc, '') # remove that specific punctuation mark 
        tokens.append(token) # append cleaned token (i.e. word) to tokens list
    clean_tokenized.append(tokens) # append tokens list to clean

# Assembling a Matrix of Unique Words

- we can now beging to convert sentences to numerical representations
- first, we'll retrive all unique words from all the headlines
- then, we'll create a matrix
    - and assign those words as the column headers
    - we'll initialize all the values in the matrix to `0`

In [105]:
import numpy as np
unique_tokens = []
single_tokens = []

for tokens in clean_tokenized: # loops through each list within clean_tokenized
    for token in tokens: # loops through each word in list 
        if token not in single_tokens: # if the word is not in list single_tokens...
            single_tokens.append(token) # append the word to single_tokens list
        elif token in single_tokens and token not in unique_tokens: # if word is in single tokens but not in unique tokens...
            unique_tokens.append(token) # add to unique_tokens list

# create a dataframe, initialized with 0 for every entry, an index the same length as clean_tokenized, and columns made
# up of words from unqiue_tokens list (i.e. words that appear more than once)
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)
counts.head(5)

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,Unnamed: 11,of,de,in,a,with,amazon,cloud,at,google,to,status,back,raises,faster,an,on,2014,out,show,dont,style,from,video,facebook,via,startups,threat,testing,releases,into,russia,job,released,or,it,icrosoft,programming,new,using,...,hell,django,cnet,infochimps,percent,peter,emails,widget,checklist,$12,usic,twilio,sleep,calculator,succeed,sources,eteor,basic,photographs,fundraiser,adapter,diversity,asking,link,deploying,plate,healthcare,term,gist,saving,devops,improved,practical,celebrate,thomas,sabo,club,breaking,macbook,contracts,frameworks,animated,walks,auctions,clouds,hammer,autonomous,vehicle,crowdsourcing,disaster
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Counting Token Occurrences

- loop through each list of tokens in `clean_tokenized`
    - use `enumerate()` function when writing the loop to get an index along with the list of tokens
- loop through each token in list of tokens
    - check whether the token is in `unique_tokens`
    - if not, it isn't a column in the dataframe and you should ignore it
- increment the appropriate cell by indexing the row of `counts` and finding the right column for the token
    - add `1` to the cell to indicate that you found the token once
    

In [106]:
# loop through each list of tokens in clean_tokenized
for idx, tokens in enumerate(clean_tokenized):
    for word in tokens: # loop through each token in list of tokens
        if word in unique_tokens: # check whether token is in unique_tokens
            counts.loc[idx, word] += 1 # if it is add 1 to cell according to entries index and column

In [107]:
counts.head(5)

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,Unnamed: 11,of,de,in,a,with,amazon,cloud,at,google,to,status,back,raises,faster,an,on,2014,out,show,dont,style,from,video,facebook,via,startups,threat,testing,releases,into,russia,job,released,or,it,icrosoft,programming,new,using,...,hell,django,cnet,infochimps,percent,peter,emails,widget,checklist,$12,usic,twilio,sleep,calculator,succeed,sources,eteor,basic,photographs,fundraiser,adapter,diversity,asking,link,deploying,plate,healthcare,term,gist,saving,devops,improved,practical,celebrate,thomas,sabo,club,breaking,macbook,contracts,frameworks,animated,walks,auctions,clouds,hammer,autonomous,vehicle,crowdsourcing,disaster
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Removing Columns to Increase Accuracy

- Have over `2000` columns in matrix
    - hard to make predictions with this much info
    - too many columns can induce model to fit noise instead of the signal in the data
- Two kinds of features that will reduce prediction accuracy
    - features that occur only a few times --> overfitting, not enough info to accurately decide whether they're important
    - features that occur too many times like `and` and `to` --> don't add info and don't necessarily correlate with upvotes
        - sometimes called stopwords
- to reduce number of features and enable better predictions
    - remove any words occurring fewer than 5 times or more than 100 times

In [125]:
# Generate a vector that contains the sum of each column in counts. This data will indicate how many times each 
# word occurs in the headlines.
word_counts = ((counts.sum() >= 5) & (counts.sum() <= 100))

# Use the vector to filter counts to remove any columns that occur less than 5 times, or more than 100 times.
counts = counts.loc[:, word_counts]

In [126]:
counts.shape

(2800, 661)

# Splitting the Data Into Train and Test Sets

- need to split the data into two sets so that we can evaluate algorithm effectively
- `train_test_split()` will help us accomplish this

In [127]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions['upvotes'], test_size = 0.2, random_state = 1)

# Making Predictions With fit()

In [128]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()

# train clf using the fit method
clf.fit(X_train, y_train)

# use predict() to make predictions on X_test
predictions = clf.predict(X_test)

# Calculating Prediction Error

- we'll use MSE --> mean squared error
    - penalizes errors further further away from actual value
    - we want our predictions to be relatively close to the actual values

In [138]:
# calculate mse associated with predictions
mse = (sum((predictions - y_test)**2)) / len(predictions - y_test)

mse

2651.1457056689683

# Next Steps

Our MSE is 2181, which is a fairly large value. There's no hard and fast rule about what a "good" error rate is, because it depends on the problem we're solving and our error tolerance.

In this case, the mean number of upvotes is 10, and the standard deviation is 39.5. If we take the square root of our MSE to calculate error in terms of upvotes, we get 46.7. This means that our average error is 46.7 upvotes away from the true value. This is higher than the standard deviation, so our predictions are often far off-base.

We can take several steps to reduce the error and explore natural language processing further. Here are some ideas for your next steps:

Use the entire data set. While we used samples in this mission, you could download the [entire data set from this GitHub repository](https://github.com/arnauddri/hn). This approach will reduce the error rate dramatically. There are many features in natural language processing. Using more data will ensure that the model will find more occurrences of the same features in the test and training sets, which will help the model make better predictions.
Add "meta" features like headline length and average word length.
Use a random forest, or another more powerful machine learning technique.
Explore different thresholds for removing extraneous columns.