<a href="https://colab.research.google.com/github/Shubhankar9934/Deep_Learning/blob/main/FakeNewsClassiferUsingLSTM_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Fake News Classifier Using LSTM**


Dataset: https://www.kaggle.com/c/fake-news/data

In [1]:
#Install the Kaggle library
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Make a directory named “.kaggle”

! mkdir ~/.kaggle

In [3]:
# Copy the “kaggle.json” into this new directory
! cp kaggle.json ~/.kaggle/

In [4]:
# Allocate the required permission for this file.
! chmod 600 ~/.kaggle/kaggle.json

In [5]:
! kaggle competitions download -c fake-news

Downloading fake-news.zip to /content
 97% 45.0M/46.5M [00:00<00:00, 155MB/s]
100% 46.5M/46.5M [00:00<00:00, 151MB/s]


In [6]:
! unzip fake-news

Archive:  fake-news.zip
  inflating: submit.csv              
  inflating: test.csv                
  inflating: train.csv               


In [7]:
import pandas as pd


In [8]:
df = pd.read_csv('train.csv')

In [9]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [10]:
df.tail()

Unnamed: 0,id,title,author,text,label
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1
20799,20799,What Keeps the F-35 Alive,David Swanson,"David Swanson is an author, activist, journa...",1


In [11]:
###Drop Nan Values
df=df.dropna()

In [12]:
## Get the Independent Features

X=df.drop('label',axis=1)

In [13]:
X

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [14]:
y=df['label']

In [15]:
y

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 18285, dtype: int64

In [16]:
X.shape

(18285, 4)

In [17]:
y.shape

(18285,)

In [18]:
import tensorflow as tf

In [19]:
tf.__version__

'2.8.2'

In [20]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Whenever you are giving input to the embeding layer make sure that we have 
# to keep our Size input length fixed that's why we need to use pad on bodh side 
# either on 'pre' side or 'post' side.
# we Add '0' Padded on pre or post so that we make the sentences equal in length.
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
# one_hot helps us to convert the sentences over given vocabulary size
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
# Since this is a classification problem so that my final layer
# must have one node so that it identify the news is fake or not.

In [21]:
### Vocabulary size
voc_size=5000

** Onehot Representation **

In [22]:
messages=X.copy()

In [23]:
messages['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

In [24]:
messages.reset_index(inplace=True)
# Reason of doing reset_index because we drop NaN value from this data set.

In [25]:
messages

Unnamed: 0,index,id,title,author,text
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...,...
18280,20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
18281,20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
18282,20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
18283,20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


**Data Preprocessing using nltk**

In [26]:
import nltk # importing Natural Language Toolkit
import re   # re stands for regular Expression
from nltk.corpus import stopwords #(stopwords:words that you do not want to use to describe the topic of your content.)

In [27]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Why we need preprocessing?

Because we need to clean the data, data may be having various special character,it will having spaces,it will having words
that are not important for that we can remove it by using stopwords.

In [28]:
### Dataset Preprocessing
from nltk.stem.porter import PorterStemmer
# PorterStemmer : we need PorterStemmer because we need to Stemming Sentences
# which means sees the entire sentence as a word, so it returns it as it is.
ps = PorterStemmer()# Intialization of PorterStemmer.
corpus = [] # we are creating a list.
for i in range(0, len(messages)):
    print(i)
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i]) # using regular-expression
    # and substituting apart from this->'[^a-zA-Z]' character and substituting
    # everythings with blank. simply means skipping the character from the sentences.
    review = review.lower()
    #lowering the words in that sentences reason is that if any capital letter 
    # thanks and Thanks to make both are same, if we do not lower them that would 
    # be treated as two different words.
    review = review.split()
    # we are using split because we need to apply stopwords.
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
13285
13286
13287
13288
13289
13290
13291
13292
13293
13294
13295
13296
13297
13298
13299
13300
13301
13302
13303
13304
13305
13306
13307
13308
13309
13310
13311
13312
13313
13314
13315
13316
13317
13318
13319
13320
13321
13322
13323
13324
13325
13326
13327
13328
13329
13330
13331
13332
13333
13334
13335
13336
13337
13338
13339
13340
13341
13342
13343
13344
13345
13346
13347
13348
13349
13350
13351
13352
13353
13354
13355
13356
13357
13358
13359
13360
13361
13362
13363
13364
13365
13366
13367
13368
13369
13370
13371
13372
13373
13374
13375
13376
13377
13378
13379
13380
13381
13382
13383
13384
13385
13386
13387
13388
13389
13390
13391
13392
13393
13394
13395
13396
13397
13398
13399
13400
13401
13402
13403
13404
13405
13406
13407
13408
13409
13410
13411
13412
13413
13414
13415
13416
13417
13418
13419
13420
13421
13422
13423
13424
13425
13426
13427
13428
13429
13430
13431
13432
13433
13434
13435
13436
13437
13438
13439
13440

In [29]:
corpus

['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag',
 'russian research discov secret nazi militari base treasur hunter arctic photo',
 'us offici see link trump russia',
 'ye paid govern troll social media blog forum websit',
 'major leagu soccer argentin find home success new york time',
 'well fargo chief abruptli step new york time',
 'anonym donor pay million releas everyon arrest dakota access pipelin',
 'fbi close hilla

**One Hot Representation**

In [30]:
onehot_repr=[one_hot(words,voc_size)for words in corpus] # for words in corpus 
# we are appying onehot encoding by passing words and vocabulary size.
onehot_repr

[[1966, 4126, 4773, 2890, 4427, 1412, 4353, 1571, 1516, 1485],
 [2900, 1056, 3486, 2914, 4260, 4162, 4991],
 [466, 3242, 1604, 398],
 [1081, 1055, 1176, 2650, 2169, 4730],
 [3950, 4260, 1096, 987, 4261, 1336, 4260, 3097, 4055, 1936],
 [2530,
  1529,
  1506,
  358,
  757,
  2112,
  3808,
  2692,
  4213,
  3068,
  2695,
  3077,
  1111,
  2084,
  4991],
 [1640, 3836, 2230, 3601, 3131, 1908, 407, 4091, 4227, 1008, 4842],
 [774, 3500, 1095, 3819, 4727, 1446, 2112, 2605, 4227, 1008, 4842],
 [127, 2696, 56, 3005, 2789, 2736, 2011, 473, 2112, 4290],
 [1466, 3056, 4924, 1077, 3395, 2223, 2999, 922],
 [1972, 4515, 1436, 3118, 300, 4353, 1742, 4896, 3887, 4819, 113],
 [2650, 1085, 4427, 2736, 2112, 4727],
 [1812, 4564, 946, 2185, 177, 2525, 1188, 212, 4464],
 [4610, 4427, 2048, 4257, 2157, 2279, 3536, 4227, 1008, 4842],
 [1190, 4085, 4458, 2457, 2426, 4227, 1008, 4842],
 [4671, 3681, 2633, 4355, 2002, 2085, 2701, 596, 1952, 2159],
 [1114, 948, 1056],
 [4791, 3369, 628, 1963, 2112, 238, 3830, 4991

**Embedding Representation**

In [31]:
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 1571 1516 1485]
 [   0    0    0 ... 4260 4162 4991]
 [   0    0    0 ... 3242 1604  398]
 ...
 [   0    0    0 ... 4227 1008 4842]
 [   0    0    0 ... 3987 4320 1020]
 [   0    0    0 ... 2933 3557 1680]]


In [32]:
embedded_docs[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 1966,
       4126, 4773, 2890, 4427, 1412, 4353, 1571, 1516, 1485], dtype=int32)

In [33]:
## Creating model
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


In [34]:
len(embedded_docs),y.shape

(18285, (18285,))

In [35]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [36]:
X_final.shape,y_final.shape

((18285, 20), (18285,))

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.3,random_state = 42)

**Model Training**


In [38]:
### Finally Training
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f3990499250>

**Adding Dropout**


In [39]:
from tensorflow.keras.layers import Dropout
## Creating model
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

**Performance Metrics And Accuracy**

In [40]:
y_pred = (model.predict(X_test) > 0.5)

In [41]:
from sklearn.metrics import confusion_matrix

In [42]:
confusion_matrix(y_test,y_pred)

array([[2109,  998],
       [1249, 1130]])

In [43]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.5904119577105359