# Vectorisation

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Import the Pre-processes data set
df_fake_news = pd.read_csv('data/preprocessed_fake_news.csv')

In [3]:
# Check the dataset
df_fake_news.head()

Unnamed: 0.1,Unnamed: 0,Headline,Processed Headline,Body,Processed Body,Label
0,0,Four ways Bob Corker skewered Donald Trump,way Bob Corker skewer Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,image copyright Getty Images \n Sunday morning...,1
1,1,Linklater's war veteran comedy speaks to moder...,Linklater war veteran comedy speak modern Amer...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",LONDON Reuters flag Flying comedy drama Vietna...,1
2,2,Trump’s Fight With Corker Jeopardizes His Legi...,Trump ’s Fight Corker Jeopardizes legislative ...,The feud broke into public view last week when...,feud break public view week Mr. Corker say Mr....,1
3,3,Egypt's Cheiron wins tie-up with Pemex for Mex...,Egypt Cheiron win tie Pemex mexican onshore oi...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,MEXICO CITY Reuters Egypt ’s Cheiron Holdings ...,1
4,4,Jason Aldean opens 'SNL' with Vegas tribute,Jason Aldean open SNL Vegas tribute,"Country singer Jason Aldean, who was performin...",country singer Jason Aldean perform Las Vegas ...,1


In [4]:
# Drop the added Unnamed: 0 column
df_fake_news.drop(['Unnamed: 0'], axis=1, inplace=True)

In [5]:
# Check to see if there are any missing values
df_fake_news.isna().sum()

Headline              0
Processed Headline    6
Body                  0
Processed Body        0
Label                 0
dtype: int64

In [6]:
# Which rows have missing values? 
df_fake_news[df_fake_news['Processed Headline'].isna() == True]

Unnamed: 0,Headline,Processed Headline,Body,Processed Body,Label
222,Nowhere to Go but Up?,,Nowhere to Go but Up?\n(Before It's News)\nThe...,\n News \n player coach Baltimore Ravens not m...,0
2599,Nowhere to Go but Up?,,Nowhere to Go but Up?\n(Before It's News)\nThe...,\n News \n player coach Baltimore Ravens not m...,0
4182,Who had to go :-),,Leave a Reply Click here to get more info on f...,leave Reply Click info format 1 leave field wa...,0
7223,What Is To Be Done?,,What Is To Be Done? How to build a new anti-in...,build new anti interventionist movement Share ...,0
7415,‘Are We Next?’,,When the news broke that our French colleagues...,news break french colleague kill deep feeling ...,1
7444,:,,We the People Against Tyranny: Seven Princi...,we People Tyranny seven Principles free Go...,0


We are missing a few entries in the Processed Headline even though we didn't have any when we save the dataset previously. The expected reason for this is because, looking at the headlines from the missing entries, they are all comprised of stopwords and punctuation. As a result, the pre-processing will only return an empty string for these entries. This is fine until we save it as a .csv at which point the empty strings are converted to null values and so when we reload it we get the NaNs.

To fix this I drop all the rows with the NaNs as I am not able to fill them in effectively and an empty headline is not useful for my model.

In [7]:
df_fake_news.dropna(inplace=True)

In [8]:
# Check to see if there are any missing values
df_fake_news.isna().sum()

Headline              0
Processed Headline    0
Body                  0
Processed Body        0
Label                 0
dtype: int64

And now we have fixed the NaN values and we can proceed with the vectorising.

# Train/Test Split

Before performing any vectorisations, I need to split my dataset into Train and Test. The reason for this is that if I make the split AFTER vectorising, I will have leaked information about my test set in to my training set. This ocurrs because TF-IDF uses information about all the data it is fit on to transform and create features. As a result, if we split AFTER, then we are creating features using information from the test set. This is not good.



In [9]:
# Splitting the data into train and test sets with an 80:20 split
from sklearn.model_selection import train_test_split

X = df_fake_news[df_fake_news.columns[:-1]]
y = df_fake_news[df_fake_news.columns[-1]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# TF-IDF

When performing TF-IDF on this dataset, I will first set my min_df=100 as this is ~1% of the number of articles that I have in my dataset. I think this is a reasonable threshold but will see how many columns I end up with and then play with the min_df to see how this imapcts the number of columns.

In [10]:
df_fake_news.shape

(10317, 5)

In [11]:
df_fake_news.columns

Index(['Headline', 'Processed Headline', 'Body', 'Processed Body', 'Label'], dtype='object')

In [12]:
# Using TF-IDF to tokenize the Body and Negative Review data
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_body = TfidfVectorizer(min_df=100, ngram_range = (1,3))

body_tfidf = tfidf_body.fit(X_train["Processed Body"])
body_train_vec = body_tfidf.transform(X_train["Processed Body"])
body_test_vec = body_tfidf.transform(X_test["Processed Body"])


tfidf_headline = TfidfVectorizer(min_df=100, ngram_range = (1,3))

headline_tfidf = tfidf_headline.fit(X_train["Processed Headline"])
headline_train_vec = headline_tfidf.transform(X_train["Processed Headline"])
headline_test_vec = headline_tfidf.transform(X_test["Processed Headline"])


df_body_train = pd.DataFrame(columns=body_tfidf.get_feature_names(), data=body_train_vec.toarray())
df_body_test = pd.DataFrame(columns=body_tfidf.get_feature_names(), data=body_test_vec.toarray())
df_headline_train = pd.DataFrame(columns=headline_tfidf.get_feature_names(), data=headline_train_vec.toarray())
df_headline_test = pd.DataFrame(columns=headline_tfidf.get_feature_names(), data=headline_test_vec.toarray())

display(df_body_train)
display(df_body_test)
display(df_headline_train)
display(df_headline_test)

Unnamed: 0,00,000,000 campaign,000 people,09,10,10 000,10 percent,10 year,100,...,york city,york times,york times newsletter,york times product,young,young people,youth,youtube,zero,zone
0,0.0,0.0,0.0,0.0,0.0,0.029240,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
2,0.0,0.0,0.0,0.0,0.0,0.025104,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.050542,0.0,0.0,0.00000
3,0.0,0.0,0.0,0.0,0.0,0.009016,0.0,0.0,0.000000,0.00603,...,0.0,0.0,0.0,0.0,0.017135,0.0,0.000000,0.0,0.0,0.00000
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.022956,0.0,0.000000,0.0,0.0,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8248,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.10794,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
8249,0.0,0.0,0.0,0.0,0.0,0.027022,0.0,0.0,0.052973,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.09298
8250,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
8251,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000


Unnamed: 0,00,000,000 campaign,000 people,09,10,10 000,10 percent,10 year,100,...,york city,york times,york times newsletter,york times product,young,young people,youth,youtube,zero,zone
0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.044258,0.040274,0.000000,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.022844,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.167409,0.000000,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2059,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
2060,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.052013,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
2061,0.0,0.000000,0.0,0.0,0.0,0.028296,0.000000,0.0,0.0,0.0,...,0.027539,0.040217,0.0,0.0,0.0,0.0,0.000000,0.000000,0.025193,0.0
2062,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0


Unnamed: 0,10,2016,2017,america,attack,big,black,campaign,clinton,comment,...,trump,vegas,video,vote,war,week,white,win,world,year
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8250,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,10,2016,2017,america,attack,big,black,campaign,clinton,comment,...,trump,vegas,video,vote,war,week,white,win,world,year
0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.72661,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2059,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2060,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2061,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2062,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


To identify whether the word comes from the body or the headline I will add a prefix to each column for both the train and test tables. For the words from the body I will add "b_" and for the headline I will add "h_".

In [13]:
# Add the prefixes to the column indexes

body_train_columns = "b_" + df_body_train.columns
body_test_columns = "b_" + df_body_test.columns

headline_train_columns = "h_" + df_headline_train.columns
headline_test_columns = "h_" + df_headline_test.columns

In [14]:
# Update the column names for both dataframes
df_body_train.columns = body_train_columns
df_body_test.columns = body_test_columns
df_headline_train.columns = headline_train_columns
df_headline_test.columns = headline_test_columns

In [15]:
df_body_train

Unnamed: 0,b_00,b_000,b_000 campaign,b_000 people,b_09,b_10,b_10 000,b_10 percent,b_10 year,b_100,...,b_york city,b_york times,b_york times newsletter,b_york times product,b_young,b_young people,b_youth,b_youtube,b_zero,b_zone
0,0.0,0.0,0.0,0.0,0.0,0.029240,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
2,0.0,0.0,0.0,0.0,0.0,0.025104,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.050542,0.0,0.0,0.00000
3,0.0,0.0,0.0,0.0,0.0,0.009016,0.0,0.0,0.000000,0.00603,...,0.0,0.0,0.0,0.0,0.017135,0.0,0.000000,0.0,0.0,0.00000
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.022956,0.0,0.000000,0.0,0.0,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8248,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.10794,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
8249,0.0,0.0,0.0,0.0,0.0,0.027022,0.0,0.0,0.052973,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.09298
8250,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
8251,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.00000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000


In [16]:
df_headline_train

Unnamed: 0,h_10,h_2016,h_2017,h_america,h_attack,h_big,h_black,h_campaign,h_clinton,h_comment,...,h_trump,h_vegas,h_video,h_vote,h_war,h_week,h_white,h_win,h_world,h_year
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8250,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now I will create a new dataframe for both the train and test sets that will be the result of joining the new columns from df_body_train and df_headline_train, and similarly for the X_test set. Note that we will no longer need the original Body, Headline, Processed Body and Processed Headline columns as we have now vectorised these into machine readable formats.

In [17]:
# Join the three dataframes together

# Create X_train from df_body_train and df_headline_train
X_tfidf_train = df_headline_train.join(df_body_train)

# Create X_test from df_body_test and df_headline_test
X_tfidf_test = df_headline_test.join(df_body_test)


In [18]:
X_tfidf_train

Unnamed: 0,h_10,h_2016,h_2017,h_america,h_attack,h_big,h_black,h_campaign,h_clinton,h_comment,...,b_york city,b_york times,b_york times newsletter,b_york times product,b_young,b_young people,b_youth,b_youtube,b_zero,b_zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.050542,0.0,0.0,0.00000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017135,0.0,0.000000,0.0,0.0,0.00000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.022956,0.0,0.000000,0.0,0.0,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
8249,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.09298
8250,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000
8251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.00000


In [19]:
X_tfidf_test

Unnamed: 0,h_10,h_2016,h_2017,h_america,h_attack,h_big,h_black,h_campaign,h_clinton,h_comment,...,b_york city,b_york times,b_york times newsletter,b_york times product,b_young,b_young people,b_youth,b_youtube,b_zero,b_zone
0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.044258,0.040274,0.000000,0.0
1,0.0,0.72661,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.167409,0.000000,0.0
2,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
3,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
4,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2059,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
2060,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.052013,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0
2061,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.027539,0.040217,0.0,0.0,0.0,0.0,0.000000,0.000000,0.025193,0.0
2062,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0


Now we have a TF-IDF vectorised dataset so I will save the datasets to be loaded into future models.

In [20]:
# Save and export the train and test TF-IDF dataset
X_tfidf_train.to_csv('data/X_tfidf_train.csv', index=False)
X_tfidf_test.to_csv('data/X_tfidf_test.csv', index=False)

# Rename the targets to be consistent
y_tfidf_train = y_train
y_tfidf_test = y_test

y_tfidf_train.to_csv('data/y_tfidf_train.csv', header=True, index=False)
y_tfidf_test.to_csv('data/y_tfidf_test.csv', header=True, index=False)

Now that we have vectorised dataset, let's run some models, see what our accuracy looks like and how we can optimise over them. I will first start with a Logistic Regression, then a Random Forest and lastly a Recurrent Nerual Network. For the RNN, I will utilise the Word2Vec vectorisation which is carried out below.

# Word2Vec

For Word2Vec, I will use the same Train/Test split as my TF-IDF so that I can compare the performance directly. For similar reasons, it is important to split the data before vectorising so that there is no information leakage form our Test set.

In [62]:
from gensim.models import Word2Vec

path = get_tmpfile("word2vec.model")

model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

In [63]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [65]:
vector = model.wv['computer']  # numpy vector of a word

In [66]:
vector

array([ 3.7709232e-03, -2.0128882e-03, -2.8544404e-03,  4.8924531e-03,
       -3.6285343e-03,  1.5042886e-03,  8.9837326e-04, -8.7219829e-05,
       -1.3023480e-04, -4.4162860e-03, -2.7416125e-03, -2.8429360e-03,
        1.7961244e-03, -1.7936210e-03,  2.0466514e-03,  4.8784162e-03,
        1.6520530e-03, -3.8253467e-03,  4.3302607e-03, -2.8046889e-03,
       -4.5312219e-03,  2.2350946e-03, -2.5617364e-03,  1.8639783e-04,
        2.7125530e-04,  4.4598025e-03,  3.7664878e-03, -4.9925996e-03,
       -2.9477694e-03,  2.7240401e-03, -1.8468520e-03,  4.4552861e-03,
        4.0721814e-03,  5.6969147e-04, -3.8623230e-03, -2.3506044e-03,
       -2.4827714e-03, -3.4109810e-03, -1.2334787e-03,  3.3013400e-04,
        1.6140186e-03, -4.4121626e-03, -2.8376293e-03, -4.7903419e-03,
       -1.3234533e-03, -4.3969881e-03, -3.5370886e-03,  2.4954362e-03,
       -1.3133053e-03, -1.4366278e-03,  4.0780185e-03,  3.6046782e-03,
        3.8377086e-03, -3.5743979e-03, -2.1184981e-03,  3.4882859e-03,
      

In [72]:
from gensim.models import Word2Vec

w2c_model = Word2Vec(X_train['Processed Headline'], min_count = 100, size=10, workers=2)

In [None]:
model.accuracy()