This will be the main notebook for working on the capstone. Couple of things to do:

- Build a function to clean your data - rows of text
- Make the classification binary through renaming columns and feature engineering
- Fill in the date for the article instead of the link
-
-

# Fake News Classifier 

### Produced by: Aly Boolani

***Data source:***
The data has been collected from https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset and you can download it here - hyperlink this 


***Citations:***

1. Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.

2. Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

This dataset contains two types of articles fake and real news. This dataset has been collected from real world sources; the truthful articles were obtained by crawling articles from Reuters.com (A legitimate News website). As for the fake news articles, they were collected from a number of various sources. These fake news artiicles were collected from unrreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contatins different types of articles on different topics, however, the majority of articles focus on political and world news topics. 

The dataset consists of two CSV files. The file ***True.csv*** contains more than 12,600 articles from reuter.com while the second file ***Fake.csv*** contains more than 12,600 artciles from different fake news outlet resources. Each article (data point) contains the following information: 
- Article Title
- Article Text
- Article Subject
- Date the article was published on

The overall data has been cleaned for us prior to downloading on it and contains articles from 2016 to 2017 and contains punctuations and mistakes that existed in the ***Fake.csv*** were kept as is.


| News      | Size   |      Subject     | Article size (breakdowns) |
|-----------|--------|:----------------:|---------------------------|
| Real-News | 21,417 | World News       | 10,145                    |
|           |        | Political News   | 11,272                    |
|           |        |                  |                           |
| Fake-News | 23,481 | Government News  | 1,570                     |
|           |        | Middle-east News | 778                       |
|           |        | US News          | 783                       |
|           |        | Left-News        | 4459                      |
|           |        | Politics         | 6841                      |
|           |        | News             | 9050                      |
|           |        |                  |                           |

In [27]:
# This is the first cell with all imports for throughout 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


# Importing Natural Language ToolKit
import nltk

# Importing NLP essentials 
import string
from nltk.corpus import stopwords 

# Importing CountVectorizer to tokenize our articles
from sklearn.feature_extraction.text import CountVectorizer


# Splitting our data using train-test-split
from sklearn.model_selection import train_test_split


# For applying NLP techniques


# Modelling
#from sklearn.linear_model import LogisiticRegression
#from sklearn.neighbors import KNearestNeighbors
#from sklearn.trees import DecisionTreeClassifier
#from sklearn.neural_network import MLPClassifier 


# Metrics
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import classification_report




We've now covered our imports, lets move on to importing the data into the notebook:

In [28]:
# Importing our True Articles 
tdf = pd.read_csv('News/True.csv')
tdf.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [29]:
# Importing our Fake Articles 
fdf = pd.read_csv('News/Fake.csv')
fdf.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [55]:
# Let's make copies of our DataFrames as we don't want to make changes to the original one
true_df = tdf.copy() # copying our true article dataframe
fake_df = fdf.copy() # copying our fake articles dataframe

## Exploratory Data Analysis

In [19]:
pd.reset_option('display.max_colwidth',None)

In [34]:
true_df.head(1)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"


In [None]:
# Reviewing our data
true_df.head()

In [None]:
# Reviewing our data
fake_df.head()

In [None]:
# Checking our True article columns
true_df.info()

In [None]:
# Checking our Fake article columns
fake_df.info()

In [None]:
# Checking our datatypes for the True CSV
true_df.dtypes

In [None]:
# Checking our datatypes for the Fake CSV
fake_df.dtypes

We can see that our data columns are all objects. Leaving the first three columns as is, we will move on to the Date column as this could give us some valuable information. We will convert the ```date``` column type from object to date time as can be seen below:

Now that the column has converted to date time, let's go ahead and split these columns into the following 
- Year
- Month
- Day


In [35]:
# Converting date column data type from object to datetime
true_df['date'] = pd.to_datetime(true_df['date'])

# Converting true_df date columns into year, month and day
# Extracting the year of publishing
true_df['Year'] = true_df['date'].dt.year

# Extracting the month of the year
true_df['Month'] = true_df['date'].dt.month

# Extracting the day of the month
true_df['Day of the Month'] = true_df['date'].dt.day


In [None]:
# Converting date column data type from object to datetime
fake_df['date'] = pd.to_datetime(fake_df['date'])

# Converting fake_df date columns into year, month and day
# Extracting the year of publishing
fake_df['Year'] = fake_df['date'].dt.year

# Extracting the month of the year
fake_df['Month'] = fake_df['date'].dt.month

# Extracting the day of the month
fake_df['Day of the Month'] = fake_df['date'].dt.day

In [36]:
# Checking if the above has run correctly
true_df.dtypes

title                       object
text                        object
subject                     object
date                datetime64[ns]
Year                         int64
Month                        int64
Day of the Month             int64
dtype: object

In [None]:
# Checking if the above has run correctly
fake_df.dtypes

In [None]:
# Checking for splitting of columns after date time split
true_df.head()

In [None]:
# Checking for splitting of columns after date time split
fake_df.head()

In [None]:
# Checking for nulls in our True DataFrame
true_df.isna().sum()

In [None]:
# Checking for nulls in our Fake DataFrame
fake_df.isna().sum()

In [None]:
# Checking for duplicates in True DataFrame
true_df.duplicated()

In [None]:
# Checking for duplicates in Fake DataFrame
fake_df.duplicated()

In [None]:
# Checking for the sum of duplication in our True DataFrame
true_df.duplicated().sum()

In [None]:
# Checking for the sum of duplication in our Fake DataFrame
fake_df.duplicated().sum()

In [None]:
# Identifying unique True Articles through the title - also a sanity check for the number stated earlier
true_df['title'].value_counts().sum()


In [None]:
# Identifying unique True Articles through the title - also a sanity check for the number stated earlier
fake_df['title'].value_counts().sum()

In [None]:
# Describing our True data
true_df.describe()

In [None]:
# Describing our Fake data
fake_df.describe()

In [6]:
# Checking the different classes within the True DataFrame - # one hot encoding required 
true_df['subject'].value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

In [7]:
# Checking the different classes within the Fake DataFrame - # one hot encoding required 
fake_df['subject'].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

In [None]:
# Here, we're going to remove all duplicated rows for both True and Fake news CSVs
# As per our check previously, there were 206 duplicated rows in the True DataFrame while 3 in the Fake DataFrame
# Let's remove these now

# Removing duplicates from True DataFrame
true_df.drop_duplicates(inplace = True)


# Removing duplicates from Fake DataFrame
fake_df.drop_duplicates(inplace = True)


In [None]:
# Doing a sanity check again for seeing if these values have been dropped in the True DataFrame
true_df.duplicated().sum()

In [None]:
# Doing a sanity check again for seeing if these values have been dropped in the Fake DataFrame
fake_df.duplicated().sum()

Some of the observations we've made up till now are as follows:
- 206 duplicated rows in our True (Real-News) DataFrame
- 3 duplicated rows in our Fake (Fake-News) DataFrame
- No null values in either of the two DataFrames
- There were 21,417 articles for Real-News and 23,481 articles for Fake-News - after ***dropping our duplicated rows***, this has gone to 21,211 (dropped by 206) articles for Real-News and 23,478 (dropped by 3) articles for Fake-News.

We've also now done some basic cleaning of the data. Let's now look at the cleaning the text data itself.

## Feature Engineering 

In order to ensure that our data is ready to be put into a model, we must do some Feature Engineering. This would include the following: 

- Labeling our dataset
- One Hot Encoding our ```subject``` 
- One of the previous things we did
-
-

Before we move on to combining our two dataframes. What we must do is to do some Feature Engineering prior to combining. This would include:
- Creating a binary classification for the different subjects we have in both true and fake articles. To keep it simple, we're going to create two classifications as follows:
    - Political News in as 0 
    - World News / Any other news as 1 
    
Give our ```true_df``` has 2 unique subjects, ***```politicsNews```*** and ***```worldnews```***, we will apply one hot encoding on this column and use the labels mentioned above (Political news as 1 and World News as 1). 


To simply it for our ```fake_df```, we will also hot encode this and reduce the number of classes, from 6, to 2 (Political news as 1 and World News as 0). 
- The following will be group as Political News (value of 1) in the ```fake_df``` dataframe:
    - Politics
    - Government News
    
- The following will be group as World News (value of 0) in the ```fake_df``` dataframe:
    - News
    - left-news
    - US_News
    - Middle-east
   
This can be seen below:

##### Labeling our datasets

In [None]:
# Adding a column of ones in the True Data Frames to identify True as class 1
true_df['label'] = '1'

In [None]:
# Adding a column of zeros in the True Data Frames to identify True as class 0
fake_df['label'] = '0'

##### Renaming and reducing the multi-class classification to binary classification

In [37]:
# Renaming the columns for true_df to Politics and World News
true_df.rename(columns = {'politicsNews' : 'Political News' , 'worldnews' : 'World News'})

Unnamed: 0,title,text,subject,date,Year,Month,Day of the Month
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,2017-12-31,2017,12,31
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,2017-12-29,2017,12,29
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,2017-12-31,2017,12,31
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,2017-12-30,2017,12,30
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,2017-12-29,2017,12,29
...,...,...,...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,2017-08-22,2017,8,22
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,2017-08-22,2017,8,22
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,2017-08-22,2017,8,22
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,2017-08-22,2017,8,22


In [56]:
# Renaming the columns and grouping together to create a binary classification for true_df
conversion_dict_true = {'politicsNews' : 'Political News' ,\
                        'worldnews' : 'World News'}

true_df['subject'] = true_df['subject'].map(conversion_dict_true)

In [57]:
# Renaming the columns and grouping together to create a binary classification for fake_df
conversion_dict_fake = {'politics' : 'Political News',\
                        'Government News' : 'Political News',\
                        'News' : 'World News',\
                        'left-news' : 'World News',\
                        'US_News' : 'World News',\
                        'Middle-east' : 'World News'}

fake_df['subject'] = fake_df['subject'].map(conversion_dict_fake)

In [58]:
# Let's look at how the dataframes look like now for the true_df and see the class propotions
true_df['subject'].value_counts()

Political News    11272
World News        10145
Name: subject, dtype: int64

In [60]:
# Let's look at how the dataframes look like now for the fake_df and see the class propotions
fake_df['subject'].value_counts()

World News        15070
Political News     8411
Name: subject, dtype: int64

#### Stemming and Lemmatization

In [None]:
stemmer = nltk.stem.PorterStemmer()

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords 
ENGLISH_STOP_WORDS = stopwords.words('english')

#### Applying N-grams

Now that we've created the two labels with True articles being (1) and False articles being 0. Let's go ahead and combined them.

In [5]:
# Concatenating the two DataFrames with labels 
combined_df = pd.concat([true_df, fake_df], axis = 0)

In [None]:
combined_df.duplicated().sum()

In [None]:
combined_df['label'].value_counts()

In [None]:
print(f'The shape of the new combined DataFrame (combined_df) is {combined_df.shape}')

In [None]:
combined_df.head(10)

As we've gone through the process of labelling are articles and combining them into one combined dataframe as ```combined_df```, let's seperate our data into our target (label column) and features variables (rest of the columns). The label column comprises of the number which corresponds to the article being true (1) or false (0).

In [None]:
# Setting our feature variables
X = combined_df.drop(columns = ['label'], axis = 1)

# Setting our target variables
y = combined_df['label']

Let's take a look at our X features and y variable shape below:

In [None]:
print(f'The shape for our features is {X.shape}')
print(f'The shape for our target is {y.shape}')

Now that we've gotten to a point where we've defined our feature and target variables, it's time we move on to splitting the data into a training, validation and test set. The model will be trained on 40% of the data, validated on 30% of the data and tested on 30% of the remaining data with being stratified across all three. 

However, before we move on to splitting our data, it is important that we apply the tokenize our data:

In [None]:
# Lemmatize 


In [None]:
# Instantiaing the CountVectorizer
Bagofwords = CountVectorizer(stop_words = 'english') # insert min_df

# Fitting the CountVectorizer 
Bagofwords.fit(combined_df['text'])

# Transforming all of the data
Bagofwords.transform()

In [None]:
# Let's check the shape of the new sparse matrix we just created


### Train Test Split

When doing a train-test-split, we must be careful of three different things in terms of our data:

1. Stratification for class imbalances
2. Preprocessing of our data
3. Splitting by indices. 


The issue with splitting the data into different sets might make us lose some value of our tokens and will cause bias in our models.

As we can see, up till now, we were only working with the two dataframes without having to split our data. Let's go ahead and split our data now. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,\
                                                    y,\
                                                    test_size = 0.3,\
                                                    stratify = y)
                                                    
    
# Shuffling might be required too
                                                    
                                                    
                                                    
                                                    
                                    

In [None]:
# Let's take a look at how the values have been split
y_train.value_counts()

### Logistic Regression

In [None]:
# Instantiating the Neural Network model

# Fitting the Model

# Scoring the model on training data 

# Scoring the model on validation data

# Scoring the model on test data

print(f'The training score using a Neural Network is: {}')
print('\n')
print(f'The validation score using a Neural Network is: {}')
print('\n')
print(f'The testing score using a Neural Network is: {}')

#### Logistic Regression using PCA

In [None]:
# Scaling for PCA

In [None]:
# Finding the number of components

### Decision Tree Classifier

In [None]:
# Instantiating the Decision Tree Classifier model

# Fitting the Model

# Scoring the model on training data 

# Scoring the model on validation data

# Scoring the model on test data

print(f'The training score using a Neural Network is: {}')
print('\n')
print(f'The validation score using a Neural Network is: {}')
print('\n')
print(f'The testing score using a Neural Network is: {}')

### K Nearest Neighbors

In [None]:
# Instantianing the Scaler

In [None]:
# Instantiating the K Nearest Neighbors model

# Fitting the Model

# Scoring the model on training data 

# Scoring the model on validation data

# Scoring the model on test data

print(f'The training score using a Neural Network is: {}')
print('\n')
print(f'The validation score using a Neural Network is: {}')
print('\n')
print(f'The testing score using a Neural Network is: {}')

### 5-Fold Cross Validation

In [None]:
# Instantiating the Cross Validation model with 5 folds

# Fitting the Model

# Scoring the model on training data 

# Scoring the model on validation data

# Scoring the model on test data

print(f'The training score using a Neural Network is: {}')
print('\n')
print(f'The validation score using a Neural Network is: {}')
print('\n')
print(f'The testing score using a Neural Network is: {}')

### Neural Network 

In [None]:
# Instantiating the Neural Network model

# Fitting the Model

# Scoring the model on training data 

# Scoring the model on validation data

# Scoring the model on test data

print(f'The training score using a Neural Network is: {}')
print('\n')
print(f'The validation score using a Neural Network is: {}')
print('\n')
print(f'The testing score using a Neural Network is: {}')

In [None]:
#Models to try
# Decision Tree Classifier
# KNN 
# Logisitic Regression
# Neural Network 