<a href="https://www.kaggle.com/code/sohanamitarathod/fake-news-classifier-using-bi-lstm?scriptVersionId=141460954" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [3]:
import warnings

# Ignore specific category of warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, accuracy_score
#NLP Libraries
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
#Deep Learning Libraries
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout

#Vocabulary Size
vocab_size=5000

In [5]:
df= pd.read_csv('/kaggle/input/fake-news-identification-using-bi-lstm/fake_or_real_news.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [7]:
df.isnull().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

In [8]:
df= df.dropna()

In [9]:
x= df.drop('label',axis=1)

In [10]:
df['label'] = df.label.replace({'FAKE':0,'REAL':1})

In [11]:
y=df['label']

In [12]:
y

0       0
1       0
2       1
3       0
4       1
       ..
6330    1
6331    0
6332    0
6333    1
6334    1
Name: label, Length: 6335, dtype: int64

In [13]:
var= x.copy()

In [14]:
var.reset_index(inplace=True)

In [15]:
#  NLTK (Natural Language Toolkit) Stopwords.
# Stopwords are commonly used words in a language that are generally filtered out or ignored when processing text data for various natural language processing tasks. These words, like "and," "the," "in," "is," etc., are often considered to carry little meaning on their own and don't contribute significantly to the semantic content of the text.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
stemmer = PorterStemmer()
corpus = []
for i in range(0, len(var)):
    review = re.sub('[^a-zA-Z]', ' ', var['title'][i])
    review = review.lower()
    review = review.split()
    
    review = [stemmer.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)


To represent words as vectors, we have to use an encoder. In this implementation we will use one-hot encoding, which is available in TensorFlow. At the beginning, we set the vocab size to 5000.


After encoding all the words in the corpus, we need to perform padding to ensure all the sequences in the corpus have the same length.

In order to do padding sequence on the corpus, we can use the pad_sequence library from the Keras library. We set the maximum length of the sequence to be 20


In [18]:
encoding = [one_hot(words,vocab_size) for words in corpus]


In [19]:
emb_docs=pad_sequences(encoding,padding='pre',maxlen=20)
print(emb_docs)


[[   0    0    0 ... 4101 4723 1829]
 [   0    0    0 ... 1330 2572   27]
 [   0    0    0 ... 3587 4471 3080]
 ...
 [   0    0    0 ... 1433   50 4954]
 [   0    0    0 ... 2483 3882 2291]
 [   0    0    0 ... 1910 1330 4528]]


Creating a sequential model is easier because we are trying to evaluate the base model with an embedding layer, biLSTM layer, and a dense layer.



The layers I have used are as follows:

**1st Layer** — Embedding layer: Applies the embedding of the given size to the input sequence

**2nd Layer** — Bi-Directional LSTM Layer : Contains a LSTM with 100 neurons

**3rd Layer** — Dense Layer : Connects all the outputs from previous layers to its neurons

**Activation Function** — Sigmoid Activation Function: This will give us the outputs in the values of 0 and 1

**Loss Function** — Binary Cross Entropy: Predicts the class output between 0 and 1

In [20]:
model = Sequential()
model.add(Embedding(vocab_size,40,input_length=20)) # making embedding layer
model.add(Bidirectional(LSTM(100)))  # one LSTM Layer with 100 neurons
# model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

print(model.summary())


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              112800    
 l)                                                              
                                                                 
 dense (Dense)               (None, 1)                 201       
                                                                 
Total params: 313,001
Trainable params: 313,001
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
x_final= np.array(emb_docs)
y_final= np.array(y)


In [22]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x_final,y_final,test_size=0.2,random_state=0)


In [23]:
x_train

array([[   0,    0,    0, ..., 4505, 1330, 4333],
       [   0,    0,    0, ..., 4071, 4763, 3989],
       [   0,    0,    0, ..., 3300,  508, 1253],
       ...,
       [   0,    0,    0, ...,  600, 1330,   61],
       [   0,    0,    0, ..., 2220, 1127, 4613],
       [   0,    0,    0, ..., 3259,  760,  673]], dtype=int32)

In [24]:
y_train

array([1, 1, 0, ..., 0, 0, 1])

In [25]:
y_test

array([1, 0, 0, ..., 0, 1, 1])

In [26]:
model.fit(x_train, y_train,
           batch_size=64,
           epochs=10,
           validation_data=[x_test, y_test])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f9fa423d690>

In [27]:
y_pred= model.predict(x_test)



In [28]:
print(confusion_matrix(y_test,y_pred.round()))


[[489 126]
 [160 492]]


In [29]:
print(accuracy_score(y_test,y_pred.round()))

0.7742699289660616


The purpose of adding a dropout layer is to increase the robustness of the model and also to remove any simple dependencies between the neurons.

###Reference:

For TF.keras
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional

For BILSTM more depth:
https://medium.com/@raghavaggarwal0089/bi-lstm-bc3d68da8bd0