Hacker News is a social news website focusing on computer science and entrepreneurship. It has a community where users can submit articles, and other users can upvote those articles. Like other website the articles with the most upvotes make it to the front page.

This data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which which can be  found in one of his GitHub repositories. https://github.com/arnauddri/hn

hn_stories is a 3000 rows that was sampled from the data randomly, and it has only has four columns:

* submission_time - When the article was submitted
* upvotes - The number of upvotes the article received
* url - The base URL of the article
* headline - The article's headline

I'll be predicting the number of upvotes the articles received, based on their headlines. 
Upvotes are an indicator of popularity, I will try to discover which types of articles tend to be the most popular it this community.

-------


In [38]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error

In [39]:

# Reading the file to datframe
articals = pd.read_csv("hn_stories.csv")
articals.columns = ["submission_time", "upvotes", "url", "headline"]


In [40]:
# Exploring the data
articals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 4 columns):
submission_time    2999 non-null object
upvotes            2999 non-null int64
url                2810 non-null object
headline           2989 non-null object
dtypes: int64(1), object(3)
memory usage: 93.8+ KB


As we can see here, we have four columns three with an object data type and one numerical.

There are some missing values, so I am going to remove those rows first

In [41]:
articals = articals.dropna()

-----
Since I am going to train a linear regression model to predicts the number of upvotes a headline would receive, I will need to convert each headline to a numerical representation.
There are several ways to do this, I will use the bag of words model where each piece of text is represented as a numerical vector.
The first step in creating a bag of words model is tokenization. I am going to split each sentence into a list of individual words on the space character

In [42]:
# Creating a list for the tokens
headline_tokens = []

# Looping through the dataframe to split headlines and add words as a list 
for row in articals['headline']:
    headline_tokens.append(row.split())
    

In [43]:
headline_tokens[:5]

[['Software:',
  'Sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['Google’s',
  'Stock',
  'Split',
  'Means',
  'More',
  'Control',
  'for',
  'Larry',
  'and',
  'Sergey'],
 ['SSL',
  'DOS',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['Immutability', 'and', 'Blocks', 'Lambdas', 'and', 'Closures'],
 ['Comment', 'optimiser', 'la', 'vitesse', 'de', 'Wordpress?']]

----
On my next step, I am going to remove punctuation lowercas all words.

In [44]:
# A list of punctuation to be removed from tokens
punctuation = [ "/", "-", "+", "&", "(", ")", ",", ":", ";", ".", "'", '"', "’", "?"]

# New list for the processed tokens
clean_tokens = []

# A loop to go through tokens and lower case each word and remove punctuation 
for tokens in headline_tokens:
    #list for each sentance
    tokens_list = []
    for token in tokens:
        token = token.lower()
        for p in punctuation:
            token = token.replace(p, '')
        tokens_list.append(token)
    clean_tokens.append(tokens_list)
    
clean_tokens[:5]

[['software',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['googles',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey'],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'],
 ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress']]

-----
Now I am going to create a dataframe for all the unique words to convert the sentences to their numerical representations.
I will only keep tokens that occured more than one time, tokens that only occurred once don't add to the model's prediction power.

In [45]:

# A list for tokens that only occured once
single_tokens = []
# List for unique tokens
unique_tokens = []

# loop through the clean tokens list to add the unique once
for tokens in clean_tokens:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token not in unique_tokens:
            unique_tokens.append(token)

# creating a dataframe for all unique tokens as columns names and intialize it with 0 and same size as the clean_tokens list
tokens_df = pd.DataFrame(0, index=np.arange(len(clean_tokens)), columns = unique_tokens)


In [46]:
tokens_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2800 entries, 0 to 2799
Columns: 2310 entries, and to disaster
dtypes: int64(2310)
memory usage: 49.4 MB


-----
I will loop through the dataframe I just created and add 1 to the tokens if it is in the list of unique tokens



In [47]:
for index, tokens in enumerate(clean_tokens):
    for token in tokens:
        if token in unique_tokens:
            tokens_df.iloc[index][token] += 1
            

In [48]:
tokens_df.iloc[:10,:20]

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,Unnamed: 11,of,de,in,a,with,amazon,cloud,at,google
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
5,0,1,2,2,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
7,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


To further inhance my prediciton, I am going to remove words that occured more than 100 times which should remove stopwords like 'and' and 'for', which occurs almost in every headline and words that occured less than 5 times. 



In [49]:
# get the sum of each token occurrence 
word_counts = tokens_df.sum(axis = 0)

# remove tokens fewer than 5 and more than 100
tokens_df = tokens_df.loc[:, (word_counts >= 5) & (word_counts <= 100)]

In [50]:
tokens_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2800 entries, 0 to 2799
Columns: 661 entries, as to nike
dtypes: int64(661)
memory usage: 14.1 MB


we reduced tokens from 2310 to 661

I will split the data into two sets: 80% training set and 20% test set, 
and I will use the train_test_split function from sckit-learn to do that 


In [51]:

x_train, x_test, y_train, y_test = train_test_split(tokens_df, articals['upvotes'], test_size = 0.2)

In [63]:

# initializing the linear regression model
lr = LinearRegression()

# Fitting the model
lr.fit(x_train, y_train)

# Calculating predictions on the test data
predictions = lr.predict(x_test)



In [66]:

mean = np.mean(predictions)
std  = np.std(predictions)
mse = mean_squared_error(predictions, y_test)

print('Mean = ', mean)
print('Srandard Deviation = ', std)
print('Mean Squared Error = ', mse)


Mean =  11.254356849122622
Srandard Deviation =  27.118933312035175
Mean Squared Error =  2358.5493277253954


Mean squared error is 2358 which is a big number. The mean  of upvotes is 11.25, and the standard deviation is 27.11.
I will try a random forest model, maybe I can get better results.

In [68]:
# initializing the the model
rf = RandomForestClassifier(random_state = 1)

#fitting the model
rf.fit(x_train, y_train)

# Calculating predictions
predictions = rf.predict(x_test)

mean = np.mean(predictions)
std  = np.std(predictions)

mse = mean_squared_error(predictions, y_test)

print('Mean = ', mean)
print('Srandard deviation = ', std)
print('Mean squared error = ', mse)

Mean =  6.764285714285714
Srandard deviation =  28.57224195270791
Mean squared error =  2570.6464285714287


The random forest mse is 2570.64 which is more than what I got from the regression model. 
I will leave it for now and try to do some more features engineering and hyperparameter optimization the future to get a better result. 