# Fake News Detection using simple LSTM

Build a system to identify unreliable news articles

Dataset : https://www.kaggle.com/c/fake-news/data


*   id: unique id for a news article
*   title: the title of a news article
*   author: author of the news article
*   text: the text of the article; could be incomplete
*   label: a label that marks the article as potentially unreliable (target data) 

    1: unreliable
    0: reliable





**Import the dataset and view the data**

In [1]:
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/NLP/train.csv")

In [2]:
data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


**Remove null values**

In [5]:
data = data.dropna()

**Obtain the independent feature X and dependent (target feature) Y**

In [7]:
X = data.drop('label', axis=1)
X.head()

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...


In [9]:
Y = data['label']
Y.head()

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

**Define vocab_size and perform data pre-processing**

*stemming , removal of stop words and corpus creation*

In [10]:
import tensorflow as tf
tf.__version__

'2.7.0'

In [18]:
vocab_size = 5000 #5000 words considered in the dictionary

In [11]:
messages=X.copy()
messages.reset_index(inplace=True) # since we have removed null values

In [12]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
import re

#refer the notebook for stemming and lemmatization

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []

for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [14]:
#view the corpus 

corpus[:10]

['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag']

**one hot representation**

In [19]:
from tensorflow.keras.preprocessing.text import one_hot
onehot_repr=[one_hot(words,vocab_size)for words in corpus] 

**create word embeddings - Do pre-padding since the length of sentences must be equal to pass in LSTM**

In [20]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)


**Define the feature size and create the model**

In [22]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

In [27]:
## Creating model
features=40

#define the model
model=Sequential()

#define the embedding layer
model.add(Embedding(vocab_size,features,input_length=sent_length))

#add the LSTM layer with 100 neurons here
model.add(LSTM(100))

#Since the output is going to be binary (ie) fake or not fake - use sigmoid function and define a dense layer
model.add(Dense(1,activation='sigmoid'))


#for classification problems, binary cross entropy is generally used
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 20, 40)            200000    
                                                                 
 lstm_2 (LSTM)               (None, 100)               56400     
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


**Define the train and test data**

In [30]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(Y)

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.30, random_state=42)

**Train the model**

In [32]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff83094ac50>

**Accuracy and confusion matrix**

In [34]:
y_pred=model.predict(X_test)

In [37]:
y_pred

array([[9.9999976e-01],
       [6.2656403e-04],
       [4.5303914e-05],
       ...,
       [9.9999714e-01],
       [2.1365953e-09],
       [9.9992836e-01]], dtype=float32)

In [38]:
y_test

array([1, 0, 0, ..., 1, 0, 1])