**This is part 2 of the fake news classifier and runs only the grid search with cross validation and provides the final verdict.**

# Imports as the other notebook

This is a continuation of the previous notebook but has been separated for running a Grid Search with 5 cross validations only.

In [22]:
# This is the first cell with all imports for throughout 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# For saving the models later. 
import joblib

# Importing wordcloud and its necessary stopwords package
#import wordcloud
#from wordcloud import STOPWORDS

# Natural Language Processing Tool Kit Imports
# Importing Natural Language ToolKit and its essential packages
import nltk

# For seeing and removing stopwords
from nltk.corpus import stopwords 

# For lemmatizing our words 
from nltk.stem import WordNetLemmatizer

# For stemming our words 
from nltk.stem import PorterStemmer

# Cleaning tools imports 
# Importing string for cleaning punctuations
import string 

# Importing Regex
import re

# Vectorizing Imports
# Importing CountVectorizer to tokenize our articles
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# For making training and testing splits prior to modelling
from sklearn.model_selection import train_test_split

# Importing Scaler for Scaling Data
from sklearn.preprocessing import StandardScaler

# Importing the different models for modelling purposes
from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC

# Importing metrics to evaluate our model
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score
from sklearn.metrics.pairwise import cosine_similarity

# For building up a pipeline
from sklearn.pipeline import Pipeline

# For a cross-validated grid search
from sklearn.model_selection import GridSearchCV

# Stopping warnings
import warnings
warnings.filterwarnings('ignore')

## Pipelines & GridSearchCV

Given the model scores above, we're going to run a GridSearchCv on the models that scored over 85%, a number chosen arbitrarily. We could've chosen 80% or 90% but 80% would mean more computation power while 90% will be extremely high even though it'll require lesser models being put into our GridSearch. We will run a GridSearch for the following models:

- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Neural Network MLP Classifier
- Naive Bayes
- AdaBoost Classifier
- Support Vector Machines Classifier


First, we're going to start off with the importing the new data frame we exported post processing. Then we're going to define the list of stopwords that we created and then finally load up all the models prior to running our grid search. 

**Note:** The Grid Search with cross validation along with all these models to find out the best model will take a lot of computation power and time hence, we're only going to do it for 10% of the data. 

In [3]:
# Importing file for gridsearch
combined_df_full = pd.read_csv('combined_df.csv')
combined_df_full.head()

Unnamed: 0,title,text,subject,date,Title Word Count,Text Word Count,label
0,after trump disclosures uks may says will cont...,london reuters british prime minister theresa...,Political News,"May 17, 2017",13,335,1
1,senators close to finishing encryption penalti...,washington reuters technology companies could...,Political News,"March 9, 2016",8,409,1
2,honduran opposition candidate nasralla says i ...,mexico city reuters honduran opposition candi...,World News,"December 23, 2017",20,100,1
3,polands pm szydlo to reshuffle cabinet soon,warsaw reuters poland s prime minister beata ...,World News,"October 24, 2017",7,397,1
4,uber agrees to settle us lawsuit filed by indi...,san francisco reuters uber technologies inc a...,World News,"December 9, 2017",11,481,1


In [4]:
# For running the gridsearch, we will work with 10% of the data only as it is computationally very, very heavy
combined_df = combined_df_full.sample(frac = 0.10).copy()
combined_df.reset_index(drop = True, inplace = True)
combined_df.head()

Unnamed: 0,title,text,subject,date,Title Word Count,Text Word Count,label
0,china criticises india over crashed drone on b...,beijing reuters china expressed strong dissa...,World News,"December 7, 2017",8,451,1
1,the new york times just fired back at trump w...,donald trump pissed off the new york times and...,World News,"March 29, 2017",15,493,0
2,carrier to employees donald trump lied we’re ...,while donald trump may have bribed carrier int...,World News,"December 3, 2016",14,463,0
3,a better future britains may tries to rally h...,london reuters prime minister theresa may wil...,World News,"September 29, 2017",11,695,1
4,deadline nears for catalan leader to clarify i...,madrid reuters catalan leader carles puigdemo...,World News,"October 15, 2017",9,482,1


In [5]:
# Splitting our data again
X = combined_df['text']
y = combined_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

In [23]:
# Let's import all our models for our GridSearch
#Saving Logistic Regression models
LR_model = joblib.load('LR_model.pkl')

#Saving Logistic Regression Scaled model
LRSS_model = joblib.load('LRSS_model.pkl')

#Saving Decision Tree model
DT_model = joblib.load('DT_model.pkl')

#Saving Random Forest model
RF_model = joblib.load('RF_model.pkl')

#Saving Neural Network model
NN_model = joblib.load('NN_model.pkl')

#Saving Naive Bayes model
NB_model = joblib.load('NB_model.pkl')

#SAving AdaBoost model
ADB_model = joblib.load('ADB_model.pkl')

# Saving Support Vector Machines models
SVC_model = joblib.load('SVC_model.pkl')

In [7]:
# Adding more words to the listofstopwords
# Here we're going to import nltk stopwords and assigning it to a listofstopwords so we can extend the list and
# add more words to it, words that we see above which may overlap as well as the ones pulled out from the wordcloud. 
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
listofstopwords = list(stopwords)
listofstopwords.extend(('said','trump','reuters','president','state','government','states','new','house','united',
                       'clinton','obama','donald','like','news','just', 'campaign', 'washington', 'election',
                        'party', 'republican'))

listofstopwords.extend(('say','obama','(reuters)','govern','news','united', 'states', '-', 'said', 'arent', 'couldnt',
                        'didnt', 'doesnt', 'dont', 'hadnt', 'hasnt', 'havent','isnt', 'mightnt', 'mustnt', 'neednt',
                        'shant', 'shes', 'shouldnt', 'shouldve','thatll', 'wasnt', 'werent', 'wont', 'wouldnt',
                        'youd','youll', 'youre', 'youve', 'trump', 'democrat', 'white', 'black', 'reuter', 'monday',
                        'tuesday','wednesday','thursday', 'friday','saturday','sunday'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
nltk.download('wordnet')
#stopwords = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def my_lemmatization_tokenizer(text):
    
    for word in text:
        listofwords = text.split(' ')
        
    listoflemmatized_words = []
    
    
    for word in listofwords:
        if (not word in listofstopwords) and (word != ''):
            lemmatized_word = lemmatizer.lemmatize(word)
            listoflemmatized_words.append(lemmatized_word)
            
    return listoflemmatized_words

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [15]:
# Below we create our pipeline where the stpes will act as placeholders and change when it gets into the GSCV
# Let's instantiate this belo
pipe = Pipeline([('vectorizer', CountVectorizer()),
                 ('model', LogisticRegression())])

# Below we create a list of parametres we will be using for our GridSearchCV
c_values = [.01, .1, 1, 10, 100, 100]
vectorizers = [CountVectorizer(stop_words = listofstopwords, tokenizer = my_lemmatization_tokenizer),
               TfidfVectorizer(stop_words = listofstopwords, tokenizer = my_lemmatization_tokenizer)]
ngram_range = [(1,2), (1,3)]
min_df = [100, 200, 300, 400, 500] # this has been reduced as we're only taking a limited sample 
max_depths = [20, 40, 60, 80]
n_estimators = [100, 150, 200, 250]
NB_alphas = [0, 0.5, 1.0, 10]
NN_alphas = [0.01, 0.1, 1, 10, 100]


param_gridLR = {'vectorizer' : vectorizers,
                'vectorizer__min_df' : min_df,
               'vectorizer__ngram_range' : ngram_range,
                'model' : [LogisticRegression()],
                'model__C' : c_values}

param_gridTrees = {'vectorizer' : vectorizers,
                   'vectorizer__min_df' : min_df,
                   'vectorizer__ngram_range' : ngram_range,
                   'model' : [DecisionTreeClassifier(), RandomForestClassifier()],
                   'model__max_depth' : max_depths}


param_gridNN = {'vectorizer' : vectorizers,
                'vectorizer__min_df' : min_df,
                'vectorizer__ngram_range' : ngram_range,
                'model': [MLPClassifier()],
                'model__alpha' : NN_alphas}

param_gridNB = {'vectorizer' : vectorizers,
                'vectorizer__min_df' : min_df,
                'vectorizer__ngram_range' : ngram_range,
                'model': [MultinomialNB()],
                'model__alpha' : NB_alphas}

param_gridADA = {'vectorizer' : vectorizers,
                 'vectorizer__min_df' : min_df,
                 'vectorizer__ngram_range' : ngram_range,
                 'model' : [AdaBoostClassifier()],
                 'model__n_estimators' : n_estimators}

param_gridSVM = {'vectorizer' : vectorizers,
                 'vectorizer__min_df' : min_df,
                 'vectorizer__ngram_range' : ngram_range,
                 'model' : [LinearSVC()]}

# Creating a list of parameter grid before we put them in our GridSearch
param_grids = [param_gridLR, param_gridTrees, param_gridNN , param_gridNB, param_gridADA, param_gridSVM]

In [16]:
%%time
# Instantiating our GridSearchCV for logistic regression
grid = GridSearchCV(pipe, param_grid = param_grids, cv = 5, n_jobs = -1, verbose = 1)

# Fitting the GridSearchCV on X_train, y_train
grid_fitted = grid.fit(X_train, y_train)

Fitting 5 folds for each of 560 candidates, totalling 2800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed: 19.4min
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed: 45.6min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed: 81.8min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed: 128.1min
[Parallel(n_jobs=-1)]: Done 1768 tasks      | elapsed: 184.9min
[Parallel(n_jobs=-1)]: Done 2418 tasks      | elapsed: 251.5min
[Parallel(n_jobs=-1)]: Done 2800 out of 2800 | elapsed: 289.5min finished


CPU times: user 1min 27s, sys: 1.32 s, total: 1min 29s
Wall time: 4h 50min 33s


In [17]:
# Checking for best parameters
print()
print("Best parameters set found on development set:")
print(grid_fitted.best_params_)
print()

# Checking for best score
print()
print("Grid best score:")
print (grid_fitted.best_score_)
print()

#Checking for best model
print()
print("Best model is:")
print(grid_fitted.best_estimator_)


Best parameters set found on development set:
{'model': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=60, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False), 'model__max_depth': 60, 'vectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=100, ngram_range=(1, 3), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words=['i', 

In [18]:
# Scoring our test set on the gridsearch model.
grid_fitted.score(X_test, y_test)

0.9213197969543148

In [19]:
# Saving the model
joblib.dump(grid_fitted, 'GridSearchCVTrained.pkl')
joblib.dump(grid_fitted, 'GridSearchCVTrained.h5')

['GridSearchCVTrained.h5']

In [21]:
# Printing some scores
grid_fitted.cv_results_

{'mean_fit_time': array([90.45750632, 91.29571242, 90.40830789, 89.4404407 , 88.14379263,
        89.25659184, 89.3679153 , 88.83359513, 87.81310701, 88.55097466,
        87.72677383, 88.96520467, 88.18350916, 89.00383449, 89.2412158 ,
        89.16892381, 88.84260058, 89.31360545, 87.86541018, 88.9003655 ,
        88.00577431, 89.17222548, 92.36915326, 89.95662417, 88.92281857,
        92.63250299, 88.89723063, 89.31614094, 88.17030215, 89.48371658,
        88.27223921, 89.58349872, 90.15021462, 90.20257087, 88.26545601,
        89.66300287, 86.07591739, 87.2058136 , 85.7158505 , 86.82307038,
        85.57852058, 88.56091185, 85.7468667 , 87.27533422, 86.85980635,
        87.99741273, 85.74789133, 87.14988618, 85.28563037, 86.49766688,
        85.32767806, 86.97160063, 86.53415394, 87.12137418, 86.20897713,
        87.54973469, 85.78815961, 87.46439109, 86.22321739, 87.49697909,
        86.76735811, 87.66110516, 86.69479766, 88.2879015 , 87.52166638,
        89.39716887, 88.73308377, 

In [24]:
# Pickling the best estimator while we're at it
joblib.dump(grid_fitted.best_estimator_, 'GridSearchCVBestEstimator.pkl')
joblib.dump(grid_fitted.best_estimator_, 'GridSearchCVBestEstimator.h5')


['GridSearchCVBestEstimator.pkl']

In [26]:
%%time

# Printing GridSearch results with a cross validation of 5 folds 
print(f"The best Random Forest Classifier's accuracy on the training set:{grid_fitted.score(X_train, y_train)}")
print(f"The best Random Forest Classifier's accuracy on the testing set:{grid_fitted.score(X_test, y_test)}")

The best Random Forest Classifier's accuracy on the training set:1.0
The best Random Forest Classifier's accuracy on the testing set:0.9213197969543148
CPU times: user 1min 27s, sys: 0 ns, total: 1min 27s
Wall time: 1min 27s


# Final Verdict

Our best performing model is the Random Forest Classifier with the optimized parameters. However, when comparign this score to the baseline models, it seems to be a few points lower than what we got otherwise. However, what we must keep in mind is that the GridSearch was only run  on 10% of the data while the other models were run on 30% of the data. 

To wrap up, we ran over 9 different models in order to identify the way how articles are written. We ran the following models:

- Logistic Regression
- Logistic Regression with scaling
- Decision Tree Classifier
- Random Forest Classifier
- K Nearest Neighbors Classifier
- Naive Bayes Multinomial Classifier
- Neural Network MLP Classifier
- Ada Boost Classifier
- Support Vector Machines Classifier

We could've selected one of the baseline models without having to do the grid search but I wanted to find the best model out of all baseline models and hence this outputted the Random Forest Classifier. However, we can see that AdaBoost Classifier performed very well with an accuracy of almost 90% without any hyperparameter optimization. 

A note to be made here is that applying dimensionality reduction and scaling our data was not necessary as we're only dealing with one specific column here which is `text` from both dataframes in a combined dataframe altogether. Hence, it does not make sense to reduce the dimensionality and scale our data. 

Based on our detailed analysis, we obviously will either minimize the options to the Ada Boost Classifier or the Random Forest Classifier. In terms of next steps, this is something we could run a grid search on including only Ada Boost and Random Forest to see which one performs the best.


Please also feel free to run the application that I've have created using the [Streamlit](https://www.streamlit.io/)

# End of Project