# Fake News Detection

We have implemented a Fake News detection algorithm and have performed the following tasks:

1. Extracting the Dataset ([Download Here](https://drive.google.com/drive/folders/1ByadNwMrPyds53cA6SDCHLelTAvIdoF_)).
2. Cleaning the Dataset using Regular Expressions and Labelling them.
3. Training/Fitting the dataset on two models: **Logistic Regression** and **Decision Tree**.
4. Evaluating the two Models.
5. Providing a simple interface for testing the models.

### Importing Libraries

In [2]:
import pandas as pd 
#quick, adaptable, and expressive data structures.

import numpy as np 
#NumPy is used to manipulate arrays.

import seaborn as sns
#see random distributions

import matplotlib.pyplot as plt 
#integrating charts into programs 

#It includes a variety of classification, regression, and clustering methods,

from sklearn.model_selection import train_test_split
#compare the output of our own machine-learning model to that of other machines using this quick and simple process.

from sklearn.metrics import accuracy_score
#set of labels predicted for a sample must exactly match the corresponding set of labels in y true.

from sklearn.metrics import classification_report 
#A classification report is used to assess the accuracy of a classification algorithm's predictions

import re 
# determine whether a given text fits a given regular expression

import string
#You can use the Python library NLTK, or Natural Language Toolkit, for NLP.

### Importing the Dataset

In [3]:
data_fake = pd.read_csv('Fake.csv')
#DataFrame is read from a comma-separated values (csv) file.

data_true = pd.read_csv('True.csv')

print("Samples of Fake Data")
data_fake.head()
#The top five rows of the dataframe

Samples of Fake Data


Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
print("Samples of True Data")
data_true.head()

Samples of True Data


Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


### Assigning Classes to the Dataset

In [5]:
data_fake["class"] = 0
data_true["class"] = 1

### Checking the number of rows and columns in the Dataset
There are **23481 True Rows** and **21417 Fake Rows***

In [6]:
print("Data_Fake shape: ", data_fake.shape)
print("Data_True shape: ", data_true.shape)

Data_Fake shape:  (23481, 5)
Data_True shape:  (21417, 5)


### Merging both the Datasets using concat

In [7]:
data_merge = pd.concat([data_fake, data_true], axis = 0)
data_merge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


### Drop unwanted columns

In [8]:
data = data_merge.drop(['title', 'subject', 'date'], axis = 1)

### Function for Data Cleaning

In [9]:
def wordopt(text):
    text = text.lower()
    #A lower case string is produced
    
    text = re.sub('\[.*?\]', '', text)
    
    text = re.sub('\\W', " ", text)
    
    text = re.sub('https?://\S+|www\.\S+', '', text)
    
    text = re.sub('<.*?>+', '', text)
    
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    #escape: automatically escaping each space
    #A pre-initialized string called punctuation is utilized as a string constant
    #punctuation :  function returns all available punctuation 
    
    text = re.sub('\n', '', text)
    #All instances of the supplied pattern that match are replaced by the replace string in the returned string
    
    text = re.sub('\w*\d\w*', '', text)
    return text

### Applying the function to column text

In [10]:
data['text'] = data['text'].apply(wordopt)
#It applies a function that is provided as input to a whole DataFrame: APPLY

In [11]:
x = data['text']
y = data['class']

### Defining Training and Testing Data
The dataset will be split in the ratio of 0.75 by default

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size= 0.25)

### Converting Raw Data Into Matrix

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
#The TfidfVectorizer turns a set of raw documents into a TF-IDF feature matrix. 
#Python implementation of Us with and Word2Vec word embeddings.

vectorization = TfidfVectorizer()

xv_train = vectorization.fit_transform(x_train)
# scale it and learn the scaling parameters: fit_transform

xv_test = vectorization.transform(x_test)

In [14]:
xv_train.shape

(33673, 95453)

In [15]:
xv_test.shape

(11225, 95453)

### Load the models from local storage
Load the pre-saved models from local disk. By running this cell, there is no need to fit the model anymore, you can directly load the model from the local storage. One can directly skip to the evaluation and testing phase.

**NOTE:** The models must belong in the same directory as the notebook

In [16]:
import pickle #it is a module used for serializing and deserializing Python objects, convert to byte stream
import os

if os.path.isfile('./Logisitc_Regressor.pkl'):
    with open('Logisitc_Regressor.pkl', 'rb') as f:
        LR = pickle.load(f)
        print('Logistic Regressor Loaded!')

if os.path.isfile('./Decision_Tree.pkl'):
    with open('Decision_Tree.pkl', 'rb') as f:
        DT = pickle.load(f)
        print('Decision Tree Loaded!')


Logistic Regressor Loaded!
Decision Tree Loaded!


### Creating a Logistic Regression Model
The sklearn library is used to implement the logistic regression model

In [17]:
from sklearn.linear_model import LogisticRegression
#Based on a collection of independent variables, 
#logistic regression assesses the likelihood of an event occurring

LR = LogisticRegression()
#logistic regression fits a line in order to optimally distinguish the two classes
LR.fit(xv_train, y_train)

### Checking the Model Accuracy and Classification Report
This is for the Logistic Regressor Model

In [18]:
pred_1r = LR.predict(xv_test)
print("Accuracy of the Logistic Regression Model: ", LR.score(xv_test, y_test))
print("\nClassification Report:\n", classification_report(y_test, pred_1r))

Accuracy of the Logistic Regression Model:  0.986369710467706

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      5833
           1       0.98      0.99      0.99      5392

    accuracy                           0.99     11225
   macro avg       0.99      0.99      0.99     11225
weighted avg       0.99      0.99      0.99     11225



### Creating a Decision Tree Model
We have used the sklearn library for the Decision Tree Classifier as well

In [19]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(criterion="entropy")
#The DecisionTreeClassifier class may conduct multi-class classification on a dataset. If numerous classes have the same and highest probability, the classifier will 
#forecast the class with the lowest index among those classes.

DT.fit(xv_train, y_train)

In [20]:
pred_dt = DT.predict(xv_test)
print("Accuracy of the Logistic Regression Model: ", DT.score(xv_test, y_test))
print("\nClassification Report:\n", classification_report(y_test, pred_dt))

Accuracy of the Logistic Regression Model:  0.9959910913140312

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      5833
           1       1.00      1.00      1.00      5392

    accuracy                           1.00     11225
   macro avg       1.00      1.00      1.00     11225
weighted avg       1.00      1.00      1.00     11225



### Save the models in local disk
Run this cell to save the models in your local storage

In [21]:
import pickle

with open('Decision_Tree.pkl', 'wb') as f:
    pickle.dump(DT, f)

with open('Logisitc_Regressor.pkl', 'wb') as f:
    pickle.dump(LR, f)

### Creating Functions for Inference

In [22]:
def output_lable(n):
  if n == 0:
    return "Fake News"
  elif n == 1:
    return "Not A Fake News"

def manual_testing(news) :
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(wordopt)
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_LR = LR.predict(new_xv_test)
    pred_DT = DT.predict(new_xv_test)
    
    return print("\n\nLR Prediction: {} \nDT Prediction: {}".format(output_lable(pred_LR[0]),output_lable(pred_DT[0])))

### Run this cell to test your input

In [23]:
news = str(input())
manual_testing(news)



LR Prediction: Fake News 
DT Prediction: Fake News
