<h1>TECHNICAL REPORT - ANALYSING FAKE NEWS<span class="tocSkip"></span></h1>

Author: Amir Yunus<br>
GitHub: https://github.com/AmirYunus/GA_DSI_Project_2
***

# PREFACE

## Problem Statement

Suppose that a government agency wants to tackle the rise of fake news and scam in the community.

* What is the probability that an article would be considered fake?
* Which word is most likely from a fake content?
* Which word has the highest probability to be predicted wrongly?

## Executive Summary

For this project, we will consider the 2 classes: Fake News and Real News.

There are many definitions to what may constitute a fake news. For the purposes of our analysis, we will consider the following sub-reddits as fake news:
* `r/conspiracy` - A subreddit with posts on conspiracy theories
* `r/Alternativefacts` - A subreddit with posts of 'alternative facts' as coined by U.S. Counselor to the President Kellyanne Conway
* `r/scambaiting` - A subreddit with posts related to scams and scambaiting
* `r/satire` - A subreddit with posts of parody and satire of current events

As we have taken fake news from four sources, we will also consider 4 factual sources to balance our training data.
* `r/worldnews` - A subreddit with news around the world, no later than 1 week old
* `r/politics` - A subreddit with politcal news, but tend towards American politics
* `r/business` - A subreddit with financial, economy and business trends
* `r/technology` - A subreddit with gaming, data and tecnological advancements

As reddit is limiting scrapes to only 1,000 posts per time, we will be continuously scraping and append new results to the old DataFrame.

After scraping, we will look at the most common words for each category - fake and real news. After which, we will run a few models to consider if raw, pre-processed or lemmatization will yield the highest accuracy. Once we can determine the type of data, we will run a grid search on the parameters that will give the best results.

The final model is then fitted with a user-input content. User may input a real or fake news, title or body content of their liking, and run against the model.

## Contents

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#PREFACE" data-toc-modified-id="PREFACE-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>PREFACE</a></span><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Problem Statement</a></span></li><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Executive Summary</a></span></li><li><span><a href="#Contents" data-toc-modified-id="Contents-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Contents</a></span></li><li><span><a href="#Data-Dictionary" data-toc-modified-id="Data-Dictionary-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Data Dictionary</a></span></li><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Libraries</a></span></li></ul></li><li><span><a href="#DATA-GATHERING" data-toc-modified-id="DATA-GATHERING-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>DATA GATHERING</a></span><ul class="toc-item"><li><span><a href="#Scraping-Reddit-for-Fake-News" data-toc-modified-id="Scraping-Reddit-for-Fake-News-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Scraping Reddit for Fake News</a></span></li><li><span><a href="#Scraping-Reddit-for-Real-News" data-toc-modified-id="Scraping-Reddit-for-Real-News-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Scraping Reddit for Real News</a></span></li></ul></li><li><span><a href="#EXPLORATORY-DATA-ANALYSIS" data-toc-modified-id="EXPLORATORY-DATA-ANALYSIS-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>EXPLORATORY DATA ANALYSIS</a></span><ul class="toc-item"><li><span><a href="#Data-Source" data-toc-modified-id="Data-Source-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data Source</a></span></li><li><span><a href="#Data-Size" data-toc-modified-id="Data-Size-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Data Size</a></span></li></ul></li><li><span><a href="#DATA-VISUALISATION" data-toc-modified-id="DATA-VISUALISATION-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>DATA VISUALISATION</a></span><ul class="toc-item"><li><span><a href="#Top-Common-Words-for-Fake-News" data-toc-modified-id="Top-Common-Words-for-Fake-News-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Top Common Words for Fake News</a></span></li><li><span><a href="#Top-Common-Words-for-Real-News" data-toc-modified-id="Top-Common-Words-for-Real-News-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Top Common Words for Real News</a></span></li><li><span><a href="#Top-Common-Words-for-Training-Data" data-toc-modified-id="Top-Common-Words-for-Training-Data-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Top Common Words for Training Data</a></span></li></ul></li><li><span><a href="#DATA-ANALYSIS" data-toc-modified-id="DATA-ANALYSIS-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>DATA ANALYSIS</a></span><ul class="toc-item"><li><span><a href="#Baseline-Score" data-toc-modified-id="Baseline-Score-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Baseline Score</a></span></li><li><span><a href="#$M_0$---Raw-Data" data-toc-modified-id="$M_0$---Raw-Data-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>$M_0$ - Raw Data</a></span></li><li><span><a href="#$M_1$---Processed-Data" data-toc-modified-id="$M_1$---Processed-Data-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>$M_1$ - Processed Data</a></span></li><li><span><a href="#$M_2$---Data-with-Lemmatization" data-toc-modified-id="$M_2$---Data-with-Lemmatization-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>$M_2$ - Data with Lemmatization</a></span></li><li><span><a href="#$M_3$---Final-Model" data-toc-modified-id="$M_3$---Final-Model-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>$M_3$ - Final Model</a></span></li><li><span><a href="#Evaluating-our-Model" data-toc-modified-id="Evaluating-our-Model-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>Evaluating our Model</a></span></li></ul></li><li><span><a href="#USER-TESTING" data-toc-modified-id="USER-TESTING-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>USER TESTING</a></span><ul class="toc-item"><li><span><a href="#$T_0$---US-troops-cross-into-Iraq-as-part-of-withdrawal-from-Syria" data-toc-modified-id="$T_0$---US-troops-cross-into-Iraq-as-part-of-withdrawal-from-Syria-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>$T_0$ - US troops cross into Iraq as part of withdrawal from Syria</a></span></li><li><span><a href="#$T_1$---Petrol-bombs-and-tear-gas-scar-Hong-Kong-streets-as-police,-protesters-clash" data-toc-modified-id="$T_1$---Petrol-bombs-and-tear-gas-scar-Hong-Kong-streets-as-police,-protesters-clash-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>$T_1$ - Petrol bombs and tear gas scar Hong Kong streets as police, protesters clash</a></span></li><li><span><a href="#$T_2$---Indonesia's-Widodo-faces-test-on-reform-credentials-in-second-term" data-toc-modified-id="$T_2$---Indonesia's-Widodo-faces-test-on-reform-credentials-in-second-term-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>$T_2$ - Indonesia's Widodo faces test on reform credentials in second term</a></span></li><li><span><a href="#$T_3$---Boris-Johnson-‘has-the-numbers’-to-win-Brexit-vote-TODAY-but-‘poor-man’s-Cromwell’-Speaker-Bercow-may-block-it" data-toc-modified-id="$T_3$---Boris-Johnson-‘has-the-numbers’-to-win-Brexit-vote-TODAY-but-‘poor-man’s-Cromwell’-Speaker-Bercow-may-block-it-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>$T_3$ - Boris Johnson ‘has the numbers’ to win Brexit vote TODAY but ‘poor man’s Cromwell’ Speaker Bercow may block it</a></span></li><li><span><a href="#$T_4$---Cristiano-Ronaldo’s-DNA-matched-evidence-in-case-of-rape-accuser-Kathryn-Mayorga-and-he-told-lawyer-she-said-‘stop’" data-toc-modified-id="$T_4$---Cristiano-Ronaldo’s-DNA-matched-evidence-in-case-of-rape-accuser-Kathryn-Mayorga-and-he-told-lawyer-she-said-‘stop’-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>$T_4$ - Cristiano Ronaldo’s DNA matched evidence in case of rape accuser Kathryn Mayorga and he told lawyer she said ‘stop’</a></span></li><li><span><a href="#$T_5$---Following-a-ban-on-face-masks,-protesters-in-Hong-Kong-use-wearable-face-projectors-that-trick-the-facial-recognition-system-used-by-the-government" data-toc-modified-id="$T_5$---Following-a-ban-on-face-masks,-protesters-in-Hong-Kong-use-wearable-face-projectors-that-trick-the-facial-recognition-system-used-by-the-government-6.6"><span class="toc-item-num">6.6&nbsp;&nbsp;</span>$T_5$ - Following a ban on face masks, protesters in Hong Kong use wearable face projectors that trick the facial recognition system used by the government</a></span></li><li><span><a href="#$T_6$---Rep.-Ilhan-Omar-protested-outside-Trump’s-Oct.-10-campaign-rally-in-Minneapolis" data-toc-modified-id="$T_6$---Rep.-Ilhan-Omar-protested-outside-Trump’s-Oct.-10-campaign-rally-in-Minneapolis-6.7"><span class="toc-item-num">6.7&nbsp;&nbsp;</span>$T_6$ - Rep. Ilhan Omar protested outside Trump’s Oct. 10 campaign rally in Minneapolis</a></span></li><li><span><a href="#$T_7$---Photo-shows-former-President-Ronald-Reagan-meeting-with-Taliban-leaders-during-his-presidency" data-toc-modified-id="$T_7$---Photo-shows-former-President-Ronald-Reagan-meeting-with-Taliban-leaders-during-his-presidency-6.8"><span class="toc-item-num">6.8&nbsp;&nbsp;</span>$T_7$ - Photo shows former President Ronald Reagan meeting with Taliban leaders during his presidency</a></span></li><li><span><a href="#$T_8$---Photo-shows-U.S.-soldiers-on-the-ground-in-Syria-“crying-and-visibly-shaken-saying-they-could-stop-this-in-10-minutes-but-Trump-won’t-let-them" data-toc-modified-id="$T_8$---Photo-shows-U.S.-soldiers-on-the-ground-in-Syria-“crying-and-visibly-shaken-saying-they-could-stop-this-in-10-minutes-but-Trump-won’t-let-them-6.9"><span class="toc-item-num">6.9&nbsp;&nbsp;</span>$T_8$ - Photo shows U.S. soldiers on the ground in Syria “crying and visibly shaken saying they could stop this in 10 minutes but Trump won’t let them</a></span></li><li><span><a href="#$T_9$---So-called-“climate-change”-is-mostly-driven-by-factors-unrelated-to-human-activity-NASA-scientists-say" data-toc-modified-id="$T_9$---So-called-“climate-change”-is-mostly-driven-by-factors-unrelated-to-human-activity-NASA-scientists-say-6.10"><span class="toc-item-num">6.10&nbsp;&nbsp;</span>$T_9$ - So-called “climate change” is mostly driven by factors unrelated to human activity NASA scientists say</a></span></li><li><span><a href="#$T_u$---Custom-Input" data-toc-modified-id="$T_u$---Custom-Input-6.11"><span class="toc-item-num">6.11&nbsp;&nbsp;</span>$T_u$ - Custom Input</a></span></li><li><span><a href="#Testing-Against-r/nottheonion" data-toc-modified-id="Testing-Against-r/nottheonion-6.12"><span class="toc-item-num">6.12&nbsp;&nbsp;</span>Testing Against r/nottheonion</a></span></li><li><span><a href="#Summary-of-Results" data-toc-modified-id="Summary-of-Results-6.13"><span class="toc-item-num">6.13&nbsp;&nbsp;</span>Summary of Results</a></span><ul class="toc-item"><li><span><a href="#True-Negative" data-toc-modified-id="True-Negative-6.13.1"><span class="toc-item-num">6.13.1&nbsp;&nbsp;</span>True Negative</a></span></li><li><span><a href="#False-Negative-(Type-II-Error)" data-toc-modified-id="False-Negative-(Type-II-Error)-6.13.2"><span class="toc-item-num">6.13.2&nbsp;&nbsp;</span>False Negative (Type II Error)</a></span></li><li><span><a href="#False-Positive-(Type-I-Error)" data-toc-modified-id="False-Positive-(Type-I-Error)-6.13.3"><span class="toc-item-num">6.13.3&nbsp;&nbsp;</span>False Positive (Type I Error)</a></span></li><li><span><a href="#True-Positive" data-toc-modified-id="True-Positive-6.13.4"><span class="toc-item-num">6.13.4&nbsp;&nbsp;</span>True Positive</a></span></li></ul></li></ul></li><li><span><a href="#CONCLUSION" data-toc-modified-id="CONCLUSION-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>CONCLUSION</a></span><ul class="toc-item"><li><span><a href="#Areas-for-Improvement" data-toc-modified-id="Areas-for-Improvement-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Areas for Improvement</a></span></li></ul></li></ul></div>

## Data Dictionary

| Feature 	| Type 	| Dataset 	| Description 	|
|------------------	|--------	|--------------------------------------------------------	|-------------------------------------	|
| df_fake.content 	| string 	| r/conspiracy <br>r/Alternativefacts <br>r/scambaiting <br>r/satire 	| Fake news scraped from reddit 	|
| df_news.content	| string 	| r/worldnews <br>r/politics <br>r/business <br>r/technology 	| Real news scraped from reddit 	|
| df_onion.content	| string 	| r/nottheonion 	| Real news that appears fake scraped from reddit 	|
| is_fake 	| int 	| (All above) 	| 1: source is fake<br>0: source is real 	|
| pred_fake 	| int 	| model 	| 1: source is predicted to be fake<br>0: source is predicted to be real 	|

## Libraries

In this notebook, we will try to keep it clean. Therefore, all functions are compiled into a single class. The Python code `functions.py` is part of this technical report. We will import the class, `project_3`, and rename it as `p3` for brevity.

In [None]:
# Allows Jupyter to reload custom module without restarting kernel
%load_ext autoreload
%autoreload 2

In [None]:
# Import custom class project_3 as p3 from functions.py
from functions import project_3 as p3

# DATA GATHERING

## Scraping Reddit for Fake News

We will scrape the following four subreddits for fake news:
* `r/conspiracy`
* `r/Alternativefacts`
* `r/scambaiting`
* `r/satire`

The code to scrape is turned off (i.e set to `False`), so that we do not accidentally run them unless we want to.

When scraping, we will get posts for `hot`, `new`, `all-time controversial`, `all-time top`, and `rising` for each topic. After scraping we will assign a new feature called `is_fake` and set it to 1.

In [None]:
# Scraping set to False. To scrape, set to True
p3.fake('conspiracy','Alternativefacts','scambaiting','satire',False)

In the event that we have already scraped the data, we will load them into this report as `df_fake`.

In [None]:
# Loading fake.csv as df_fake
df_fake = p3.read('fake')

## Scraping Reddit for Real News

We will scrape the following four subreddits for real news:
* `r/worldnews`
* `r/politics`
* `r/business`
* `r/technology`

The code to scrape is turned off (i.e set to `False`), so that we do not accidentally run them unless we want to.

When scraping, we will get posts for `hot`, `new`, `all-time controversial`, `all-time top`, and `rising` for each topic. After scraping we will assign a new feature called `is_fake` and set it to 0.

In [None]:
# Scraping set to False. To scrape, set to True
p3.news('worldnews','politics','business','technology',False)

In the event that we have already scraped the data, we will load them into this report as `df_news`.

In [None]:
# Loading news.csv as df_news
df_news = p3.read('news')

# EXPLORATORY DATA ANALYSIS

## Data Source

Our data is from reddit, one of the most active social media platform as of writing. As with all social media outlets, it is prone to false information and fake news. It is heavily reliant on users posting new information onto their platform, and to earn points, some may even post a controversial comment or article.

Nevertheless, reddit has a few channels that are controlled and data sources from these channels are reliable. These sources, such as `r/worldnews` and `r/business` were selected for their credibility and scraped as real news.

On the other hand, subreddits which are moderated by users themselves, tend to be less reliable. Especially so if they are posting topics such as `r/conspiracy` and `r/Alternativefacts`. These subreddits were scraped and labelled as fake news.

## Data Size

Before we perform any form of modelling, we need to ensure that we have a good sample size for our training model. Let us consider the size of our `df_fake`.

In [None]:
# Check the size of df_fake
p3.check(f'Size of df_fake is {df_fake.shape[0]}')

We have at least 50,000 posts of fake news, as at time of writing. Now consider `df_news`.

In [None]:
# Check the size of df_news
p3.check(f'Size of df_news is {df_news.shape[0]}')

Real news have more unique posts of at least 65,000 as of writing. We would not want our training mdoel to be unbalanced. Let us consider only the top 50,000 rows from each class.

In [None]:
df_fake = df_fake[:50000]
df_fake.shape

In [None]:
df_news = df_news[:50000]
df_news.shape

We merge both datasets together as `df`.

In [None]:
# Merge the DataFrames and check the size
df = p3.merge(df_news,df_fake).reset_index().drop('index',axis=1)
y = df.is_fake
df.drop('is_fake', inplace=True, axis=1)

In [None]:
# Check the classification. 1 = fake news, 0 = real news
y.value_counts(normalize=True)

The distribution of 0.5 is expected as we have selected the exact number of observations per class.

# DATA VISUALISATION

## Top Common Words for Fake News

What were the common words to appear in fake news articles?

In [None]:
# Display wordcloud for df_fake
p3.wordcloud(df_fake.content,'fake')

In [None]:
# Display top 10 words for df_fake
words = p3.features(df_fake,df_fake.is_fake)

There are a few interesting keywords from the wordcloud above, such as:
* Trump; and
* conspiracy

Let us consider the common words for real news.

## Top Common Words for Real News

In [None]:
# Display wordcloud for df_news
p3.wordcloud(df_news.content,'news')

In [None]:
# Display top 10 words for df_news
words = p3.features(df_news,df_news.is_fake)

There are a few interesting keywords from `df_news`, such as:

* Trump
* mueller
* russia; and
* cohen

Surprisingly, `Trump` appears as the most common word for both real and fake news and our model may have difficulty classifying it. Note, we have not processed this data for stopwords.

Let us consider the total size of our word features.

## Top Common Words for Training Data

In [None]:
# Display wordcloud for df
p3.wordcloud(df.content,'default')

In [None]:
# Display top 10 words for training data
words = p3.features(df,y)

We have at least 42,000 words, as of writing, and after combining the unique words for both `df_fake` and `df_news`, we can see that `trump` is still leading as the first in both real and fake news.

# DATA ANALYSIS

Let us find which data source will provide us with the best `F1 Score`. We will use `GridSearchCV` to find the best `F1 Score`. The following vectorizers will be considered:

* `cv`: `CountVectorizer()`
* `tv`: `TfidfVectorizer()`
* `lr`: `LogisticRegression()`

And the following classifiers:

* `bnb`: `BernoulliNB()`
* `mnb`: `MultinomialNB()`
* `gnb`: `GaussianNB()`
* `knn`: `kNeighborsClassifer()`

Other parameters:

* `use_params`: Set whether to use grid search to find the best parameters

## Baseline Score

Before we start modelling, we need to determine the baseline score. In our situation, since we selected the same number of observations from both classes, our baseline score is 0.5.

In [None]:
# Check the classification. 1 = fake news, 0 = real news
y.value_counts(normalize=True)

## $M_0$ - Raw Data

This model utilises the best `F1 Score` using raw, unprocessed data.

In [None]:
# Enter what vectorizers and classifiers we want to use and find the best.

vectorizer = ['cv','tv']
classifier = ['lr','bnb','mnb','knn']
use_params = False

df_solns, X_train, X_test, y_train, train_score, test_score, y_test, y_test_hat, best_params, f1_score = p3.search(vectorizer,classifier,df,y,use_params)

Let us see what is the best model.

In [None]:
# Display the best model
df_solns = df_solns.sort_values(ascending=False, by='f1_score',axis=1)
df_solns.iloc[:,0:1]

For our benchmark ($M_0$) model, we will use `TfidfVectorizer` with `LogisticRegression` which yields an `F1 Score` of `0.912374`.

## $M_1$ - Processed Data

In this model, we will use `BeautifulSoup` and `RegEx` to clean our data. However, we will not use `Lemmatization` for this model. Stopwords are not removed as well.

In [None]:
# Clean the training data and assign it to df_processed
df_processed = p3.clean(df,lemma=False)

Let us see if the number of words dropped after cleaning.

In [None]:
# Display wordcloud for df
p3.wordcloud(df_processed,'default')

In [None]:
# Display top 10 words for processed data
words_processed = p3.features(df_processed,y)

We can see that the number of features dropped from `42901` to `35559`. The top common words did not change though, with `trump`, `russia` and `mueller` in the top 10.

Let us find out what is the best model for our $M_1$.

In [None]:
# Enter what vectorizers and classifiers we want to use and find the best.

vectorizer = ['cv','tv']
classifier = ['lr','bnb','mnb','knn']
use_params = False

df_solns, X_train, X_test, y_train, train_score, test_score, y_test, y_test_hat, best_params, f1_score = p3.search(vectorizer,classifier,df,y,use_params)

In [None]:
# Display the best model
df_solns = df_solns.sort_values(ascending=False, by='f1_score',axis=1)
df_solns.iloc[:,0:1]

For our $M_1$, the best model is `TfidfVectorizer` with `LogisticRegression` with an `F1 Score` of `0.909728` which is lower than our benchmark score of `0.912374`.

## $M_2$ - Data with Lemmatization

In this model, we will use BeautifulSoup and RegEx to clean our data. Lemmatization will be used for to pick out common words. Stopwords are not removed for this model.

In [None]:
# Cleaning our data with lemmatization
df_lemma = p3.clean(df,lemma=True)

Let us see the number of words after cleaning.

In [None]:
# Display wordcloud for df
p3.wordcloud(df_lemma,'default')

In [None]:
# Display top 10 words for processed data with lemmatization
words_lemma = p3.features(df_lemma,y)

The number of words dropped further from `35559` to `29626`. The top common words have changed. `trump` is still leading, but `russia` is no where near the top now.

Let us find the best model for $M_2$.

In [None]:
# Enter what vectorizers and classifiers we want to use and find the best.

vectorizer = ['cv','tv']
classifier = ['lr','bnb','mnb','knn']
use_params = False

df_solns, X_train, X_test, y_train, train_score, test_score, y_test, y_test_hat, best_params, f1_score = p3.search(vectorizer,classifier,df,y,use_params)

In [None]:
# Display the best model
df_solns = df_solns.sort_values(ascending=False, by='f1_score',axis=1)
df_solns.iloc[:,0:1]

For our $M_2$ model, we used `TfidfVectorizer` with `LogisticRegression`. However, the `F1 Score` dropped below the baseline score of `0.912374` to `0.900371`.

## $M_3$ - Final Model

From the past 3 models, we can determine that raw unprocessed data, without Lemmatization has the highest `F1 Score`. Now we will fit the same data, but we will introduce new parameters and find the best model.

Note that we will not be considering `BernoulliNB` and `kNeighborsClassifer` as it is giving us a poor `F1 Score`. We have also set our `cv = 2` as we do not see an increase in the `F1 Score` above `cv = 2`.

In [None]:
# Enter what vectorizers and classifiers we want to use and find the best.

vectorizer = ['cv','tv']
classifier = ['lr','mnb']
use_params = True

df_solns, X_train, X_test, y_train, train_score, test_score, y_test, y_test_hat, best_params, f1_score = p3.search(vectorizer,classifier,df,y,use_params)

In [None]:
# Display the best model
df_solns = df_solns.sort_values(ascending=False, by='f1_score',axis=1)
df_solns.iloc[:,0:1]

Given the result above, our best model is `TfidfVectorizer` with `MultinomialNB` with an `F1 Score` of `0.989823` which is higher than our benchmark ($M_1$) score of `0.912374` and baseline score of `0.5`.

Let us see what are the best parameters for our model.

In [None]:
# Display best parameters
df_solns.iloc[0,0]

We will set these parameters for our model when testing user input data.

In [None]:
# # Saving our best parameters for later use
# best_max_df = list(df_solns.iloc[0,0].values())[0]
# best_max_features = list(df_solns.iloc[0,0].values())[1]
# best_min_df = list(df_solns.iloc[0,0].values())[2]
# best_ngram_range = list(df_solns.iloc[0,0].values())[3]
# best_stop = list(df_solns.iloc[0,0].values())[4]
# best_alpha = list(df_solns.iloc[0,0].values())[5]
# best_penalty = list(df_solns.iloc[0,0].values())[6]
# best_vect = df_solns.iloc[5,0]
# best_model = df_solns.iloc[2,0]

In [None]:
# Saving our best parameters for later use
best_alpha = list(df_solns.iloc[0,0].values())[0]
best_max_df = list(df_solns.iloc[0,0].values())[1]
best_max_features = list(df_solns.iloc[0,0].values())[2]
best_min_df = list(df_solns.iloc[0,0].values())[3]
best_ngram_range = list(df_solns.iloc[0,0].values())[4]
stop = list(df_solns.iloc[0,0].values())[5]
best_model = df_solns.iloc[2,0]
best_vect = df_solns.iloc[5,0]

## Evaluating our Model

Let us consider the confusion matrix of our model.

|  	| Actual Fake 	| Actual Real 	| 0.5<br>prevalence 	| 0.97<br>accuracy 	|
|----------------	|------------------	|--------------------	|---------------------------	|--------------------------	|
| Predicted Fake 	| 7,621 	| 254<br>(Type I) 	| 0.96<br>precision 	| 0.04<br>false discovery 	|
| Predicted Real 	| 53<br>(Type II) 	| 7,421 	| 0.01<br>false omission 	| 0.99<br>negative predictive 	|
|  	| 0.99<br>sensitivity 	| 0.04<br>fall out rate 	| 30.00<br>positive likelihood 	| 4201.11<br>diagnostic odds 	|
|  	| 0.01<br>miss rate 	| 0.96<br>specificity 	| 0.01<br>negative likelihood 	| 0.98<br>F1 score 	|

Our Type I error is more than Type II. This is acceptable as false positives allows readers to be wary of articles that were marked as fake but were real.

Let us display which of the content we predicted correctly.

In [None]:
X_test['is_fake'] = y_test.values
X_test['pred_fake'] = y_test_hat
X_test_true = X_test[X_test["is_fake"] == X_test['pred_fake']]
X_test_false = X_test[X_test["is_fake"] != X_test['pred_fake']]

In [None]:
X_test_true

Now consider the false predictions.

In [None]:
X_test_false

A few interesting posts we wrongly predicted. For example:

    " Facebook pays teens to install VPN that spies on them "

Anybody would have considered that to be a fake post. Nevertheless, it is true and were reported in multiple reputable sources. (https://techcrunch.com/2019/01/29/facebook-project-atlas/)


    " Manafort indicted as part of Trump Russia probe "

This is probably a real post. It is, but it was rephrased from the original title "Ex-Trump campaign chair Paul Manafort pleads not guilty to charges in Russia probe". The model predicted correctly (real news), but the source was marked as fake news. (https://www.cnbc.com/2017/10/30/former-trump-campaign-chairman-paul-manafort-indicted-as-part-of-russia-election-probe-nyt.html)

# USER TESTING

We will now test our model with 10 different content, 5 real and 5 fake news.

Real News (3 regular news, 2 tabloid news):
* US troops cross into Iraq as part of withdrawal from Syria (https://www.channelnewsasia.com/news/world/us-troops-cross-into-iraq-as-part-of-withdrawal-from-syria-12020614)
* Petrol bombs and tear gas scar Hong Kong streets as police, protesters clash (https://www.channelnewsasia.com/news/asia/hong-kong-protests-tear-gas-water-cannon-tsim-sha-tsui-12018546)
* Indonesia's Widodo faces test on reform credentials in second term (https://www.reuters.com/article/us-indonesia-politics-president/indonesias-widodo-faces-test-on-reform-credentials-in-second-term-idUSKBN1WZ090)
* Boris Johnson ‘has the numbers’ to win Brexit vote TODAY but ‘poor man’s Cromwell’ Speaker Bercow may block it (https://www.thesun.co.uk/news/brexit/10179746/brexit-vote-boris-johnson-speaker-bercow/)
* Cristiano Ronaldo’s DNA matched evidence in case of rape accuser Kathryn Mayorga and he told lawyer she said ‘stop’ (https://www.thesun.co.uk/news/10177398/cristiano-ronaldo-dna-rape-accuser/)

Fake News:
* Following a ban on face masks, protesters in Hong Kong use wearable face projectors that trick the facial recognition system used by the government (https://www.apnews.com/e74ac1065b8b45fca1097eebbcc527be)
* Rep. Ilhan Omar protested outside Trump’s Oct. 10 campaign rally in Minneapolis (https://www.apnews.com/89193675e3e246cc8251e78ca4fa7612)
* Photo shows a Turkish soldier helping a child drink from a water bottle with the implication it was taken during the military offensive launched by Turkey last week in northern Syria (https://www.apnews.com/89193675e3e246cc8251e78ca4fa7612)
* Photo shows U.S. soldiers on the ground in Syria “crying and visibly shaken saying they could stop this in 10 minutes but Trump won’t let them. (https://www.apnews.com/89193675e3e246cc8251e78ca4fa7612)
* Photo shows a massive crowd in Baghdad demonstrating in early October against corruption. (https://www.apnews.com/e74ac1065b8b45fca1097eebbcc527be)

## $T_0$ - US troops cross into Iraq as part of withdrawal from Syria

In [None]:
text = ["US troops cross into Iraq as part of withdrawal from Syria"]
is_fake = 'N'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Negative</b>

## $T_1$ - Petrol bombs and tear gas scar Hong Kong streets as police, protesters clash

In [None]:
text = ["Petrol bombs and tear gas scar Hong Kong streets as police protesters clash"]
is_fake = 'N'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Negative</b>

## $T_2$ - Indonesia's Widodo faces test on reform credentials in second term

In [None]:
text = ["Indonesia's Widodo faces test on reform credentials in second term"]
is_fake = 'N'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Negative</b>

## $T_3$ - Boris Johnson ‘has the numbers’ to win Brexit vote TODAY but ‘poor man’s Cromwell’ Speaker Bercow may block it

In [None]:
text = ["Boris Johnson ‘has the numbers’ to win Brexit vote TODAY but ‘poor man’s Cromwell’ Speaker Bercow may block it"]
is_fake = 'N'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Negative</b>

## $T_4$ - Cristiano Ronaldo’s DNA matched evidence in case of rape accuser Kathryn Mayorga and he told lawyer she said ‘stop’

In [None]:
text = ["Cristiano Ronaldo’s DNA matched evidence in case of rape accuser Kathryn Mayorga and he told lawyer she said ‘stop’"]
is_fake = 'N'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Negative</b>

## $T_5$ - Following a ban on face masks, protesters in Hong Kong use wearable face projectors that trick the facial recognition system used by the government

In [None]:
text = ["Following a ban on face masks protesters in Hong Kong use wearable face projectors that trick the facial recognition system used by the government"]
is_fake = 'Y'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted incorrectly: <b>False Negative</b>

## $T_6$ - Rep. Ilhan Omar protested outside Trump’s Oct. 10 campaign rally in Minneapolis

In [None]:
text = ["Rep. Ilhan Omar protested outside Trump’s Oct. 10 campaign rally in Minneapolis"]
is_fake = 'Y'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted incorrectly: <b>False Negative</b>

## $T_7$ - Photo shows former President Ronald Reagan meeting with Taliban leaders during his presidency

In [None]:
text = ["Photo shows former President Ronald Reagan meeting with Taliban leaders during his presidency"]
is_fake = 'Y'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Positive</b>

## $T_8$ - Photo shows U.S. soldiers on the ground in Syria “crying and visibly shaken saying they could stop this in 10 minutes but Trump won’t let them

In [None]:
text = ["Photo shows U.S. soldiers on the ground in Syria “crying and visibly shaken saying they could stop this in 10 minutes but Trump won’t let them"]
is_fake = 'Y'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted correctly: <b>True Positive</b>

## $T_9$ - So-called “climate change” is mostly driven by factors unrelated to human activity NASA scientists say

In [None]:
text = ["So-called “climate change” is mostly driven by factors unrelated to human activity NASA scientists say"]
is_fake = 'Y'
df_user = p3.test_input(text,is_fake)

In [None]:
df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_user,y,best_penalty)

Predicted incorrectly: <b>False Negative</b>

## $T_u$ - Custom Input

The code below is commented out. Feel free to uncomment and run your own string and let the model predict.

In [None]:
# df_user = p3.get_input()

In [None]:
# df_user = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, df,df_user,y)

## Testing Against r/nottheonion

`r/nottheonion` is a subreddit that posts news that it is ridiculours that people thought it was fake. It is not surprising if our model will predict more false positive that usual as the wordings will tend towards being fake.

In [None]:
# Scraping set to False. To scrape, set to True
p3.onion('nottheonion',True)

In [None]:
# Read onion.csv as df_onion
df_onion = p3.read("onion")

In [None]:
# Select only content as feature and check shape
df_onion = df_onion[["content"]]
df_onion.shape

In [None]:
# Run the model against df_onion and display results
df_onion = p3.check_user(best_vect,best_max_df, best_min_df, best_ngram_range, best_max_features, best_model, best_alpha, best_stop, df,df_onion,y,best_penalty)

In [None]:
# See what is the distribution of pred_fake
df_onion.pred_fake.value_counts(normalize=True)

Surprisingly, our model did well by accurately identifying 93% of posts as real as compared to the 6% false positives (Type I error). This could be due to the large training size we provided into our model.

## Summary of Results

We have 70% accuracy for the 10 articles available outside of Reddit. Results are shown below:

### True Negative

* US troops cross into Iraq as part of withdrawal from Syria
* Petrol bombs and tear gas scar Hong Kong streets as police protesters clash
* Indonesia's Widodo faces test on reform credentials in second term
* Boris Johnson ‘has the numbers’ to win Brexit vote TODAY but ‘poor man’s Cromwell’ Speaker Bercow may block it
* Cristiano Ronaldo’s DNA matched evidence in case of rape accuser Kathryn Mayorga and he told lawyer she said ‘stop’

### False Negative (Type II Error)

* Following a ban on face masks protesters in Hong Kong use wearable face projectors that trick the facial recognition system used by the government
* Rep. Ilhan Omar protested outside Trump’s Oct. 10 campaign rally in Minneapolis
* So-called “climate change” is mostly driven by factors unrelated to human activity NASA scientists say

### False Positive (Type I Error)

* None

### True Positive

* Photo shows former President Ronald Reagan meeting with Taliban leaders during his presidency
* Photo shows U.S. soldiers on the ground in Syria “crying and visibly shaken saying they could stop this in 10 minutes but Trump won’t let them

# CONCLUSION

* What is the probability that an article would be considered fake?
* Which word is most likely from a fake content?
* Which word has the highest probability to be predicted wrongly?

## Areas for Improvement

* Increase sample size by scraping more topics, more content
* Include user comments