# NLP - Fake News Classification

Nowadays, fake news has become a common trend. Even trusted media houses are known to spread fake news and are losing their credibility. So, how can we trust any news to be real or fake?


A full training dataset with the following attributes:

* id: unique id for a news article.

* title: the title of a news article.

* author: author of the news article.

* text: the text of the article; could be incomplete.

* label: a label that marks the article as potentially unreliable. Where 1: unreliable and 0: reliable.

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn import metrics
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

## Loading Dataset

In [2]:
df = pd.read_csv("Datasets\FakeNews_train.csv")

In [3]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
df.shape

(20800, 5)

## Data Preprocessing

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [6]:
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

### Handle missing values

In [7]:
#Checking for null values
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [8]:
##So, our dataset have many null values max around 1957.
##Since our dataset is very large over 20k columns, dropping might not effect the model
df = df.dropna()

In [9]:
df.shape

(18285, 5)

In [10]:
# here we are reseting our index
df.reset_index(inplace=True)

In [11]:
# Create a column with all the data available
df['total']=df['author']+' '+df['title']

In [12]:
df['total']

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
18280    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
18281    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
18282    Michael J. de la Merced and Rachel Abrams Macy...
18283    Alex Ansary NATO, Russia To Hold Parallel Exer...
18284              David Swanson What Keeps the F-35 Alive
Name: total, Length: 18285, dtype: object

In [13]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18285 entries, 0 to 18284
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   18285 non-null  int64 
 1   id      18285 non-null  int64 
 2   title   18285 non-null  object
 3   author  18285 non-null  object
 4   text    18285 non-null  object
 5   label   18285 non-null  int64 
 6   total   18285 non-null  object
dtypes: int64(3), object(4)
memory usage: 1000.1+ KB


Unnamed: 0,index,id,title,author,text,label,total
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com Why the Truth Might Get You...
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy Iranian woman jailed for fictio...


### STEMMING:

A stemming algorithm is a process of linguistic normalisation, in which the variant forms of a word are reduced to a common form.
In simple words, Stemming is the process of reducing a word to its Root word.

EXAMPLES:

actor, actress, acting --> act
eating, eats, eaten --> eat

In [14]:
port_stem = PorterStemmer()

In [15]:
def stemming(content):
    review = re.sub('[^a-zA-Z]',' ',content)
    review = review.lower()
    review = review.split()
    review = [port_stem.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    return review

df['total'] = df['total'].apply(stemming)

In [16]:
print(df['total'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
18280    jerom hudson rapper trump poster child white s...
18281    benjamin hoffman n f l playoff schedul matchup...
18282    michael j de la merc rachel abram maci said re...
18283    alex ansari nato russia hold parallel exercis ...
18284                            david swanson keep f aliv
Name: total, Length: 18285, dtype: object


In [17]:
X=df['total'].values
y=df['label'].values

In [18]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


## Vectorizing our Data

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. The process of converting words into numbers are called Vectorization.

#### Tf-IDF Vectorizer

TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

In [19]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

In [20]:
# Dividing the training set by using train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=2)

## Model Training

### Logistic Regression

In [21]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [22]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, y_train)

In [23]:
training_data_accuracy

0.9901558654634947

In [24]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)

In [25]:
test_data_accuracy

0.9827727645611156

In [26]:
confusion_matrix(X_test_prediction, y_test)

array([[2019,   10],
       [  53, 1575]], dtype=int64)

### Multinomial Naive Bayes

In [27]:
from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB()
classifier.fit(X_train, y_train)

In [28]:
# accuracy score on the training data
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, y_train)

In [29]:
training_data_accuracy

0.9740907847962811

In [30]:
# accuracy score on the test data
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)

In [31]:
test_data_accuracy

0.949685534591195

In [32]:
confusion_matrix(X_test_prediction, y_test)

array([[2070,  182],
       [   2, 1403]], dtype=int64)