# Fake News Machine learning prediction model

## Imports

In [1]:
# Generic
import pandas as pd
import re
import os
#from tqdm import tqdm
from tqdm import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

# Natural Language Processing
import nltk
nltk.download('stopwords')
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

# Machine learning
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/thomas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Loading and filtering data

In [2]:
# First we load the data
train = pd.read_csv('./fake-news/train.csv')
test = pd.read_csv('./fake-news/test.csv')
# Then we check for any missing values in the data
print("Empty Training data:")
print(train.isnull().sum())

print("Empty Testing data:")
print(test.isnull().sum())

Empty Training data:
id           0
title      558
author    1957
text        39
label        0
dtype: int64
Empty Testing data:
id          0
title     122
author    503
text        7
dtype: int64


In [3]:
#Seeing as there is some empty data, we have to fill this with something
# We are working with text so we'll fill it with empty strings:
train = train.fillna("")
test = test.fillna("")

In [4]:
# Inspecting the data
test.head()
train.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [5]:
# To make an accurate predection we want to include all the relevant factors when passing data to the model
# In our case, both the title, the author and the content can be an indication of fake news
test['content']=test['author'] + ': ' + test['title'] + '\n' + test['text']
train['content']=train['author'] +': ' + train['title'] + '\n' + train['text']

**Stemming** <br>
To determine which words are important in the fake news articles, we have to "Stem" them.
In other words reduce them to their roots to unify them.
Example:
* waited,waiting,waits -> wait

To do this we use the python package **N**atural **L**anguage **T**ool**k**it (nltk)

In [6]:
# Create pickle of df to be able to use cached value
useCachedStem = True
picklePath = './stemmedData.pkl'
# First we utilize a port stemmer to stem the words from the article content
port_stem = PorterStemmer()
# Next we specify a function that both applies this port stemmer algorithm and cleans the content 
def stemContent(content):
    content = re.sub('[^a-zA-Z]',' ',content) # Remove \#!€ etc.
    content = content.lower() # Change all to lowercase
    content = content.split() # Convert into an array to apply port stemmer algorithm on each word
    # Stem each word if it is not a stop word (words commenly used in a language but don't provide 
    # any value for the machine learning categorization task (for, an, nor, but, or, yet, so etc.))
    # This allows for faster processing later on
    content = [port_stem.stem(word) for word in content if not word in stopwords.words('english')] 
    content = ' '.join(content) # Join the list of stemmed words back into one string
    return content

if not useCachedStem:
    # Apply the stemming function to each element in the dataset:
    train['content'] = train['content'].progress_apply(stemContent)
    print(train['content'])
    train.to_pickle(picklePath)
else:
    if os.path.exists(picklePath):
        train = pd.read_pickle(picklePath)
    else:
        raise Exception("Error no cached pickle file. Run the function with useCachedStem=False to recalcuate the df, and create the pickle file '" + os.path.basename(picklePath) + "'" )

    

In [7]:
# We devidie the data into text (X) and label (Y) as well as training (80%) and test data (20%)
# NOTE: random_state is set to get the same division of the data each time the code is run
X_train, X_test, y_train, y_test = train_test_split(train['content'], train['label'], test_size=0.20, random_state=0) 

## Vectorization
Next we have to **Vectorize** our text, i.e. convert each word to a number. <br>
To figure out which words, or sequence of words, give an indication of fake-news we will have to look at <em><strong>word frequencies</strong></em> one way or another. To avoid the impact of words or <em><strong>tokens</strong></em> that occur frequently in a given <em><strong>corpus</strong></em> (set of documents) we use a **T**erm **F**requency **I**nverse **D**ocument **F**requency or TF-IDF for short. This is an algorithm that transforms text into a meaningful representation of numbers that our future machine learning algorithm(s) can use for prediction. By using a TFIDF the impact of frequently occuring tokens, which are hence emperically less informative, is reduced compared to features/tokens that occur in a small fraction of the training corpus. <br>
TF-IDF is simply put a measure of how <em>original</em> a word is. It compares the number of times a word appears in a document with the number of of documents the word appers in.<br>
More Formally:<br>
<p align="center">
<img src="https://latex.codecogs.com/svg.image?TF-IDF&space;=&space;TF(t,d)*IDF(t)"/><br>
</p>
Whereof:
<ul>
<li>d = a document</li>
<li>t = Term Frequency / number of times term t appears in a doc, d</li>
<li>IDF = Inverse document frequency</li>
</ul>
Where IDF is defined as:<br>
<p align="center">
<img src="https://latex.codecogs.com/svg.image?IDF(t)&space;=&space;log(\frac{1&plus;n}{1&plus;df(d,t)}&plus;1)" /><br>
</p>
Whereof:
<ul>
<li>n = # of documents</li>
<li>df(d,t) = document frequency of the term t / how many documents the term t appears in</li>
</ul>

To do this in Python we utilize the `TfidfVectorizer` provided by the sklearn library. 

When instantiating the `TfidVectorizer` from sklearn we pass in a list of stop words (i.e. words that don't add any meaning or value to what a given piece of text is about) as well as defining what dimension of <em>**n-grams**</em> we want. <br>
An <em>**n-gram**</em> is simply a sequence of <em>**N**</em> words. This is often use in NLP applications like autocompletion of sentences. After being trained on a huge <em>**Corpus**</em> of data the model will be able to predict what word has the highest probability of following a sequence of words. <br>
If you for example wrote "Thank you so much for your" most humans would easily deduce that the next word in that sentence would be <em>"time"</em> or <em>"help"</em>. For a machine to do this you tell it to gather words in <em>**n-grams**</em>, as for example:
<ul>
<li>San Francisco (is a 2-gram)</li>
<li>The Three Musketeers (is a 3-gram)</li>
<li>She stood up slowly (is a 4-gram)</li>
</ul>

This is used in our `TfidfVectorizer` to capture the <em><strong>context</strong></em> in which word are used together.
In other words the vectorizer doesn't simply just look at single words and produce a matrix with numbers for each word but produces a matrix with sequences of words in context and gives each of these a number for identification.<br>
In our case we will start by trying to look for <em><strong>bigrams</strong></em>


### Count vectorization
To calcualte the IDF scores with the TFIDF vectorizer we first have to instantiate a <em><strong>Count Vectorizer</strong></em>. <br>
The sklearn `CountVectorizer` is used to convert a collection of text documents into a vector of term/token counts. <br>
In other words it create a matrix as that indicates which words a document contains as seen below:
<img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qveZ_g7d--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/cassieview/intro-nlp-wine-reviews/master/imgs/vectorchart.PNG"/>


In [8]:
# Initialize the CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 
# Fit and transform the training data.
# In other words calculate counts with a given corpus (iterable of documents)
count_train = count_vectorizer.fit_transform(X_train)

# Next we transform the test set 
# Simply put we map the vocabulary from the training data to that of the test data, so that the number of feautres in the test data remains the same as in the training data
count_test = count_vectorizer.transform(X_test)

# Initialize the TFIDF
# We make the model learn vocabulary and IDF from training set (fit the model to the training data and save the vectorizer/model as a variable) 
# Put simply the fit method calculates the mean and variance of each of the features present in the dataset.
# Furthemore we also transform the traning data (to look for bigrams and exclude stopwords as defined above)
# Again this simply means that we map the vocabulary from the training data to that of the test data, 
# so that the number of feautres in the test data remains the same as in the training data
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(count_train)

# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=count_vectorizer.get_feature_names_out(),columns=["idf_weights"]) 
# sort ascending 
df_idf = df_idf.sort_values(by=['idf_weights'])
print(df_idf)


                             idf_weights
time                            1.471122
said                            1.475560
new                             1.532451
peopl                           1.619434
like                            1.621334
...                                  ...
guatemalan citizen             10.026478
guatemalan dictat              10.026478
guatemalan environmentalist    10.026478
guatemala write                10.026478
zzzz emerg                     10.026478

[3083822 rows x 1 columns]


### Intepreting IDF scores
Glimpsing at the IDF scores determined above we can see that words like <em>time</em> have a low IDF score while bigrams containing <em>guatemalan</em> have a high IDF score. The lower the IDF value of a word, the less unique it is to any particular document. <br>
In other words, the documents with bigrams containing <em>guatemalan</em> are more indicate of the texts content (as opposed to simple words like <em>time</em> <em>said</em> or <em>like</em>)