# Context 
How will you detect fake news?

**Fake news** : A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media. This is often done to further or impose certain ideas and is often achieved with political agendas. Such news items may contain false and/or exaggerated claims, and may end up being viralized by algorithms, and users may end up in a filter bubble.

This is a project of detecting fake news and making a difference between real and fake news.

## Mission 
Build a model to accurately classify a piece of news as REAL or FAKE.

## Dataset : 
The first column identifies the news, the second and third are the title and text, and the fourth column has labels denoting whether the news is REAL or FAKE.

The dataset is downloadable [here](https://drive.google.com/file/d/1er9NJTLUA3qnRuyhfzuN0XUsoIC4a-_q/view)

# Setting environment 

In [1]:
#import environment 
import pandas as pd
import numpy as np
''' Data visualisation'''
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
''' Scikit-Learn'''
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn import set_config

set_config(display='diagram')
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.metrics import confusion_matrix
''' Imbalanced Classes'''
import imblearn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
''' Tensorflow Keras'''
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras import Sequential, layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import ExponentialDecay

Init Plugin
Init Graph Optimizer
Init Kernel


# loading data & EDA 

In [2]:
#load data 
data = pd.read_csv('../FakeNewsNLP/data/news.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [3]:
# data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6335 non-null   int64 
 1   title       6335 non-null   object
 2   text        6335 non-null   object
 3   label       6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


In [4]:
#Data description
data.describe()

Unnamed: 0.1,Unnamed: 0
count,6335.0
mean,5280.415627
std,3038.503953
min,2.0
25%,2674.5
50%,5271.0
75%,7901.0
max,10557.0


In [5]:
#duplicates
print('number of rows before removing duplicates :', len(data))
duplicates= data.duplicated()
print('number of duplicated rows:', duplicates.sum())
data.drop_duplicates(inplace=True)
print('number of rows after removing duplicates :', len(data))

number of rows before removing duplicates : 6335
number of duplicated rows: 0
number of rows after removing duplicates : 6335


In [6]:
#missing values
print('number of missing values:', data.isnull().sum())

number of missing values: Unnamed: 0    0
title         0
text          0
label         0
dtype: int64


# Features engineering & text mining

In [7]:
!pip install nltk



In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/oumniasadaouni/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oumniasadaouni/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string


#lowercasing text 
data['text_proc']= [x.lower() for x in data.text]

#Removing punctuation
for punctuation in string.punctuation:
    data['text_proc']= [x.replace(punctuation, '') for x in data.text_proc]

#stopwords & tokenizing
stop_words= set(stopwords.words('english'))
word_tokens= data.apply(lambda row: word_tokenize(row['text_proc']), axis=1)
stop_free= data.text_proc.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
data['text_proc']= stop_free


In [11]:
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,text_proc
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,daniel greenfield shillman journalism fellow f...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,google pinterest digg linkedin reddit stumbleu...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,us secretary state john f kerry said monday st...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,— kaydee king kaydeeking november 9 2016 lesso...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,primary day new york frontrunners hillary clin...


In [12]:
#stemming
from nltk import PorterStemmer

stemmer= PorterStemmer()
data['text_proc']= [stemmer.stem(s) for s in data.text_proc]

# model 

In [13]:
#train text split
X= data.text_proc
y= data.label
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.2,
                                                   random_state= 42)

In [14]:
# shape of our splits
print( X_train.shape, X_test.shape, y_train.shape , y_test.shape)

(5068,) (1267,) (5068,) (1267,)


In [15]:
#TFidf text Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf= TfidfVectorizer(max_df=0.7)
X_train_tfidf= tfidf.fit_transform(X_train)
X_test_tdidf= tfidf.transform(X_test)

## Naive Bayes

In [16]:
# Naive Bayes modelling implementation

from sklearn.naive_bayes import MultinomialNB
nb_model= MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
nb_model.score(X_train_tfidf, y_train)

0.8975927387529598

In [17]:
from sklearn.metrics import accuracy_score
y_pred = nb_model.predict(X_test_tdidf)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy NB: {round(score*100,2)}%')

Accuracy NB: 83.35%


## Passive Agressive Classifier

In [18]:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier()
pac.fit(X_train_tfidf, y_train)
pac.score(X_train_tfidf, y_train)

#accuracy score 
y_pred = pac.predict(X_test_tdidf)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy PAC: {round(score*100,2)}%')

Accuracy PAC: 94.63%


In [19]:
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[597,  31],
       [ 37, 602]])

## Random Forest Classifier

In [20]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train_tfidf, y_train)
print(rfc.score(X_train_tfidf, y_train))

#accuracy score 
y_pred = rfc.predict(X_test_tdidf)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy Random Forest: {round(score*100,2)}%')

1.0
Accuracy Random Forest: 92.11%


# Model tuning & Hyperparameters tuning 

In [21]:
#Pac params
pac.get_params()

{'C': 1.0,
 'average': False,
 'class_weight': None,
 'early_stopping': False,
 'fit_intercept': True,
 'loss': 'hinge',
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'n_jobs': None,
 'random_state': None,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [22]:
#Grid search of pac params
params= {'C' : [0.03, 0.1, 1, 10 ],
        'loss': ['hinge', 'squared_hinge'],
        'n_iter_no_change': [5, 10, 30, 100, 300]}

search = GridSearchCV(pac, 
                      param_grid= params,
                      n_jobs=-1,
                      verbose=1,
                      scoring="accuracy",
                      refit=True,
                      cv=5 
                     )

search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


In [23]:
# grid search results
print('best params:', search.best_params_)
print('best score:', search.best_score_)


best params: {'C': 10, 'loss': 'hinge', 'n_iter_no_change': 10}
best score: 0.9394251456898581


In [24]:
#scoring best model
best_model = search.best_estimator_

#accuracy score 
y_pred = best_model.predict(X_test_tdidf)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy best_model: {round(score*100,2)}%')

Accuracy best_model: 94.48%


In [25]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred, labels=["FAKE", "REAL"])

array([[596,  32],
       [ 38, 601]])

With our best model, we have 589 true positives, 587 true negatives, 42 false positives, and 49 false negatives.

# Saving model 


In [27]:
import joblib 

joblib.dump(best_model, filename="PAC_tuned.pkl")

['PAC_tuned.pkl']