<a href="https://colab.research.google.com/github/SubhashiniDB/Fake_news_detection/blob/main/Fake_news_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMPORTING NECESSARY LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import string as st
import re
import nltk
from nltk import PorterStemmer, WordNetLemmatizer
import matplotlib.pyplot as plt
import os

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**IMPORTING DATASET**

In [None]:
data = pd.read_csv('fake_or_real_news.csv')
data.shape

(6335, 4)

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [None]:
# Distribution of label
print(np.unique(data['label']))
print(np.unique(data['label'].value_counts()))

['FAKE' 'REAL']
[3164 3171]


**DATA CLEANING AND PRE-PROCESSING**

*   Removing punctuation
*   Converting text to lower case
*   Tokenization
*   Removing tokens of length less than 3
*   Remove stopwords
*   Lemmatization
*   Converting label into binary

**REMOVING PUNCTUATION**

Punctuation can provide grammatical context to a sentence which supports our understanding. But for our vectorizer which counts the number of words and not the context, it does not add value, so we remove all special characters.

In [None]:
# Remove punctuation
def remove_punct(text):
    return ("".join([ch for ch in text if ch not in st.punctuation]))
data['removed_punc'] = data['text'].apply(lambda x: remove_punct(x))
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,Daniel Greenfield a Shillman Journalism Fellow...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,US Secretary of State John F Kerry said Monday...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,— Kaydee King KaydeeKing November 9 2016 The l...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,Its primary day in New York and frontrunners H...


**TOKENIZATION & CONVERTING TEXT TO LOWERCASE**

Converting all the text to lowercase and performing Tokenization that separates text into units such as sentences or words. It gives structure to previously unstructured text.

In [None]:
# Converting text to lowercase
def tokenize(text):
    text = re.split('\s+' ,text)
    return [x.lower() for x in text]

# Tokenization
data['tokens'] = data['removed_punc'].apply(lambda msg : tokenize(msg))
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc,tokens
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,Daniel Greenfield a Shillman Journalism Fellow...,"[daniel, greenfield, a, shillman, journalism, ..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...,"[google, pinterest, digg, linkedin, reddit, st..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,US Secretary of State John F Kerry said Monday...,"[us, secretary, of, state, john, f, kerry, sai..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,— Kaydee King KaydeeKing November 9 2016 The l...,"[—, kaydee, king, kaydeeking, november, 9, 201..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,Its primary day in New York and frontrunners H...,"[its, primary, day, in, new, york, and, frontr..."


**REMOVING TOKENS OF LENGTH LESS THAN 3**

Tokens that are at a length less than 3 doesn't provide any extra information for detecting the fake news.

In [None]:
# Remove tokens of length less than 3
def remove_small_words(text):
    return [x for x in text if len(x) > 3 ]

data['filtered_tokens'] = data['tokens'].apply(lambda x : remove_small_words(x))
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc,tokens,filtered_tokens
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,Daniel Greenfield a Shillman Journalism Fellow...,"[daniel, greenfield, a, shillman, journalism, ...","[daniel, greenfield, shillman, journalism, fel..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...,"[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,US Secretary of State John F Kerry said Monday...,"[us, secretary, of, state, john, f, kerry, sai...","[secretary, state, john, kerry, said, monday, ..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,— Kaydee King KaydeeKing November 9 2016 The l...,"[—, kaydee, king, kaydeeking, november, 9, 201...","[kaydee, king, kaydeeking, november, 2016, les..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,Its primary day in New York and frontrunners H...,"[its, primary, day, in, new, york, and, frontr...","[primary, york, frontrunners, hillary, clinton..."


**REMOVING STOPWORDS**

Stopwords are common words that will likely appear in any text. They don‘t tell us much about our data so we remove them.

In [None]:
# Remove stopwords
nltk.download('stopwords')
nltk.download('wordnet')
def remove_stopwords(text):
    return [word for word in text if word not in nltk.corpus.stopwords.words('english')]
data['clean_tokens'] = data['filtered_tokens'].apply(lambda x : remove_stopwords(x))
data.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc,tokens,filtered_tokens,clean_tokens
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,Daniel Greenfield a Shillman Journalism Fellow...,"[daniel, greenfield, a, shillman, journalism, ...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...,"[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,US Secretary of State John F Kerry said Monday...,"[us, secretary, of, state, john, f, kerry, sai...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,— Kaydee King KaydeeKing November 9 2016 The l...,"[—, kaydee, king, kaydeeking, november, 9, 201...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,Its primary day in New York and frontrunners H...,"[its, primary, day, in, new, york, and, frontr...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton..."


**LEMMATIZATION**

Lemmatization,  takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.

In [None]:
# Apply lemmatization on tokens
def lemmatize(text):
    word_net = WordNetLemmatizer()
    return [word_net.lemmatize(word) for word in text]
data['lemma_words'] = data['clean_tokens'].apply(lambda x : lemmatize(x))
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc,tokens,filtered_tokens,clean_tokens,lemma_words
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,Daniel Greenfield a Shillman Journalism Fellow...,"[daniel, greenfield, a, shillman, journalism, ...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,Google Pinterest Digg Linkedin Reddit Stumbleu...,"[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,US Secretary of State John F Kerry said Monday...,"[us, secretary, of, state, john, f, kerry, sai...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,— Kaydee King KaydeeKing November 9 2016 The l...,"[—, kaydee, king, kaydeeking, november, 9, 201...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,Its primary day in New York and frontrunners H...,"[its, primary, day, in, new, york, and, frontr...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton..."


In [None]:
# Create sentences to get clean text as input for vectors
def return_sentences(tokens):
    return " ".join([word for word in tokens])
data['clean_text'] = data['lemma_words'].apply(lambda x : return_sentences(x))
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc,tokens,filtered_tokens,clean_tokens,lemma_words,clean_text
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",1,Daniel Greenfield a Shillman Journalism Fellow...,"[daniel, greenfield, a, shillman, journalism, ...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel...",daniel greenfield shillman journalism fellow f...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,1,Google Pinterest Digg Linkedin Reddit Stumbleu...,"[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...",google pinterest digg linkedin reddit stumbleu...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,0,US Secretary of State John F Kerry said Monday...,"[us, secretary, of, state, john, f, kerry, sai...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ...",secretary state john kerry said monday stop pa...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",1,— Kaydee King KaydeeKing November 9 2016 The l...,"[—, kaydee, king, kaydeeking, november, 9, 201...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les...",kaydee king kaydeeking november 2016 lesson to...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,0,Its primary day in New York and frontrunners H...,"[its, primary, day, in, new, york, and, frontr...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton...",primary york frontrunners hillary clinton dona...


**CONVERTING LABEL INTO BINARY**

In [None]:
# Binary value - 1 for fake, 0 for real

data['label'] = [1 if x == 'FAKE' else 0 for x in data['label']]
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,removed_punc,tokens,filtered_tokens,clean_tokens,lemma_words
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",1,Daniel Greenfield a Shillman Journalism Fellow...,"[daniel, greenfield, a, shillman, journalism, ...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel...","[daniel, greenfield, shillman, journalism, fel..."
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,1,Google Pinterest Digg Linkedin Reddit Stumbleu...,"[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st...","[google, pinterest, digg, linkedin, reddit, st..."
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,0,US Secretary of State John F Kerry said Monday...,"[us, secretary, of, state, john, f, kerry, sai...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ...","[secretary, state, john, kerry, said, monday, ..."
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",1,— Kaydee King KaydeeKing November 9 2016 The l...,"[—, kaydee, king, kaydeeking, november, 9, 201...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les...","[kaydee, king, kaydeeking, november, 2016, les..."
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,0,Its primary day in New York and frontrunners H...,"[its, primary, day, in, new, york, and, frontr...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton...","[primary, york, frontrunners, hillary, clinton..."


**SPLITTING THE DATASET**

In [None]:
# Split the dataset into train set & test set
X_train,X_test,y_train,y_test = train_test_split(data['clean_text'], data['label'], test_size=0.2, random_state = 5)

print(X_train.shape)
print(X_test.shape)

(5068,)
(1267,)


**TF-IDF VECTORIZATION**

Term frequency - Inverse document frequency : Term frequency is the number of times a term occurs in a document. Inverse document frequency is an inverse function of the number of documents in which that a given word occurs.

In [None]:
# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_train = tfidf.fit_transform(X_train)
tfidf_test = tfidf.transform(X_test)

print(tfidf_train.toarray())
print(tfidf_train.shape)
print(tfidf_test.toarray())
print(tfidf_test.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(5068, 68134)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(1267, 68134)


**MODELS USED**

*   Logistic regression
*   XG boost classifier
*   Passive aggressive classifier

**LOGISTIC REGRESSION**

It is a classification algorithm used to estimate discrete values based on given set of independent variable(s). It predicts the probability of occurrence of an event by fitting data to a loss function.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 500)
lr.fit(tfidf_train, y_train)
print('Logistic Regression model fitted..')

Logistic Regression model fitted..


In [None]:
from sklearn import metrics
test_pred = lr.predict(tfidf_test)
print(metrics.classification_report(y_test, test_pred))
print('Accuracy  :',metrics.accuracy_score(y_test,test_pred))
print('Precision :',metrics.precision_score(y_test,test_pred))
print('Recall    :',metrics.recall_score(y_test,test_pred))
print('F1-score  :',metrics.f1_score(y_test,test_pred))
print("Confusion matrix : \n {}".format(confusion_matrix(y_test,test_pred)))

              precision    recall  f1-score   support

           0       0.94      0.90      0.92       630
           1       0.90      0.94      0.92       637

    accuracy                           0.92      1267
   macro avg       0.92      0.92      0.92      1267
weighted avg       0.92      0.92      0.92      1267

Accuracy  : 0.9179163378058406
Precision : 0.9019607843137255
Recall    : 0.9387755102040817
F1-score  : 0.92
Confusion matrix : 
 [[565  65]
 [ 39 598]]


**XGBOOST CLASSIFIER**

In this algorithm, decision trees are created in sequential form. Weights are assigned to all the independent variables which are then fed into the decision tree which predicts results. The weight of variables predicted wrong by the tree is increased and these variables are then fed to the second decision tree. These individual classifiers/predictors then ensemble to give a strong and more precise model.

In [None]:
import xgboost
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(tfidf_train, y_train)

print('XGBoost Classifier model fitted..')

XGBoost Classifier model fitted..


In [None]:
from sklearn import metrics
test_pred = xgb.predict(tfidf_test)
print(metrics.classification_report(y_test, test_pred))
print('Accuracy  :',metrics.accuracy_score(y_test,test_pred))
print('Precision :',metrics.precision_score(y_test,test_pred))
print('Recall    :',metrics.recall_score(y_test,test_pred))
print('F1-score  :',metrics.f1_score(y_test,test_pred))
print("Confusion matrix : \n {}".format(confusion_matrix(y_test, test_pred)))

              precision    recall  f1-score   support

           0       0.92      0.86      0.89       630
           1       0.87      0.93      0.90       637

    accuracy                           0.90      1267
   macro avg       0.90      0.90      0.90      1267
weighted avg       0.90      0.90      0.90      1267

Accuracy  : 0.8958168902920284
Precision : 0.8729689807976366
Recall    : 0.9277864992150706
F1-score  : 0.8995433789954338
Confusion matrix : 
 [[544  86]
 [ 46 591]]


**PASSIVE AGGRESSVE CLASSIFIER**

Passive-Aggressive algorithms is one of the few 'online-learning algorithms'. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once.

In [None]:
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

PassiveAggressiveClassifier(max_iter=50)

In [None]:
from sklearn import metrics
test_pred =  pac.predict(tfidf_test)
print(metrics.classification_report(y_test, test_pred))
print('Accuracy  :',metrics.accuracy_score(y_test,test_pred))
print('Precision :',metrics.precision_score(y_test,test_pred))
print('Recall    :',metrics.recall_score(y_test,test_pred))
print('F1-score  :',metrics.f1_score(y_test,test_pred))
print("Confusion matrix : \n {}".format(confusion_matrix(y_test, test_pred)))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       630
           1       0.94      0.94      0.94       637

    accuracy                           0.94      1267
   macro avg       0.94      0.94      0.94      1267
weighted avg       0.94      0.94      0.94      1267

Accuracy  : 0.9376479873717443
Precision : 0.9400630914826499
Recall    : 0.9356357927786499
F1-score  : 0.9378442171518491
Confusion matrix : 
 [[592  38]
 [ 41 596]]


**CONCLUSION**

Performance evaluation is also done using various performance measures and our best model came out to be Passive Aggressive Classifier with an accuracy of  93.7%.