### Fake News Classifier Using RNN

Dataset: https://www.kaggle.com/c/fake-news/data#

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [1]:
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df=pd.read_csv("/content/drive/MyDrive/COLLEGE DOCUMENTS/fake.csv")

In [4]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [5]:
df.shape

(20800, 5)

In [6]:
df.isnull().sum() # to check how many null values are there

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [7]:
df=df.dropna() # to drop null values

In [8]:
df.isnull().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [9]:
X=df.drop("label",axis=1)

In [10]:
X

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [12]:
y=df["label"]
y

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 18285, dtype: int64

In [13]:
X.shape,y.shape

((18285, 4), (18285,))

In [14]:
import tensorflow as tf

In [15]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import SimpleRNN,LSTM,GRU,Dense


###Copy()
Creating a copy of the data before performing exploratory data analysis (EDA) is a good practice for several reasons:

Data Integrity: By creating a copy, you preserve the integrity of the original dataset. If any modifications or transformations are applied during the EDA process, the original data remains unchanged and available for future reference or analysis.

Reproducibility: Having a copy of the original data ensures that the EDA process can be reproduced accurately. If multiple analysts or team members are working on the analysis, each can start from the same point with the original dataset copy, avoiding conflicts or discrepancies in their findings.

Error Recovery: Mistakes can happen during the analysis, such as accidental data deletion or unintended modifications. With a copy of the data, you can easily recover from such errors by referring back to the original dataset.

Performance Optimization: EDA often involves experimenting with different transformations, calculations, or filtering operations. Working on a copy allows you to optimize these operations without affecting the original data. This can be especially beneficial when dealing with large datasets, as it avoids unnecessary computational overhead.

Overall, creating a copy of the data before performing EDA provides a safeguard against unintended changes, enables reproducibility, allows error recovery, and facilitates performance optimization.

In [16]:
messages=X.copy()
messages

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [17]:
messages["title"][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [18]:
messages.reset_index(inplace=True)
messages

Unnamed: 0,index,id,title,author,text
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...,...
18280,20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
18281,20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
18282,20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
18283,20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [19]:
messages=messages.drop(["index"],axis=1)
messages

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
18280,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
18281,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
18282,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
18283,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [20]:
import nltk
import re
from nltk.corpus import stopwords

In [21]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [22]:
### Dataset Preprocessing
from nltk.stem.porter import PorterStemmer ##stemming purpose
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
    review = review.lower()
    review = review.split()

    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [23]:
corpus

['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag',
 'russian research discov secret nazi militari base treasur hunter arctic photo',
 'us offici see link trump russia',
 'ye paid govern troll social media blog forum websit',
 'major leagu soccer argentin find home success new york time',
 'well fargo chief abruptli step new york time',
 'anonym donor pay million releas everyon arrest dakota access pipelin',
 'fbi close hilla

In [24]:
voc_size=10000

In [25]:
corpus

['hous dem aid even see comey letter jason chaffetz tweet',
 'flynn hillari clinton big woman campu breitbart',
 'truth might get fire',
 'civilian kill singl us airstrik identifi',
 'iranian woman jail fiction unpublish stori woman stone death adulteri',
 'jacki mason hollywood would love trump bomb north korea lack tran bathroom exclus video breitbart',
 'beno hamon win french socialist parti presidenti nomin new york time',
 'back channel plan ukrain russia courtesi trump associ new york time',
 'obama organ action partner soro link indivis disrupt trump agenda',
 'bbc comedi sketch real housew isi caus outrag',
 'russian research discov secret nazi militari base treasur hunter arctic photo',
 'us offici see link trump russia',
 'ye paid govern troll social media blog forum websit',
 'major leagu soccer argentin find home success new york time',
 'well fargo chief abruptli step new york time',
 'anonym donor pay million releas everyon arrest dakota access pipelin',
 'fbi close hilla

### Onehot Representation

In [26]:
onehot_repr=[one_hot(words,voc_size)for words in corpus]
onehot_repr

[[492, 7733, 2908, 1556, 4908, 5419, 2636, 9736, 964, 321],
 [5740, 1892, 7866, 1746, 8382, 1012, 4342],
 [6257, 1600, 4304, 4747],
 [9967, 9343, 2941, 4234, 2463, 9860],
 [2744, 8382, 5863, 5244, 8157, 6993, 8382, 4621, 9364, 1719],
 [6337,
  4022,
  8851,
  8402,
  1771,
  4820,
  3914,
  6319,
  3196,
  7150,
  7454,
  5761,
  48,
  9856,
  4342],
 [8174, 6101, 4831, 3675, 4657, 4894, 2248, 3841, 3634, 5381, 692],
 [9980, 8802, 447, 6963, 3486, 939, 4820, 2552, 3634, 5381, 692],
 [4748, 8107, 2754, 8367, 3111, 826, 5820, 9916, 4820, 3228],
 [6258, 6116, 8518, 7361, 7716, 3859, 1337, 9593],
 [1040, 1430, 1758, 6885, 692, 3296, 1803, 4063, 7616, 5680, 3233],
 [4234, 8301, 4908, 826, 4820, 3486],
 [3131, 1371, 212, 9684, 8177, 251, 5231, 8987, 6041],
 [4868, 3936, 1468, 5770, 5421, 1170, 8750, 3634, 5381, 692],
 [3952, 2195, 3216, 4733, 2971, 3634, 5381, 692],
 [1338, 1244, 6315, 9139, 5242, 3848, 6222, 7461, 4996, 5544],
 [5510, 5969, 1892],
 [6570, 8891, 5705, 6331, 4820, 2742, 8802,

In [27]:
corpus[1]

'flynn hillari clinton big woman campu breitbart'

In [28]:
onehot_repr[1]

[5740, 1892, 7866, 1746, 8382, 1012, 4342]

### Embedding Representation

In [29]:
# it is same as that of padding we do in the matrix
# here we have fixed the size of the vector and so if the word length is 3 we add 17 zeros at its end
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='post',maxlen=sent_length)
print(embedded_docs)

[[ 492 7733 2908 ...    0    0    0]
 [5740 1892 7866 ...    0    0    0]
 [6257 1600 4304 ...    0    0    0]
 ...
 [7752 1514 2876 ...    0    0    0]
 [1041 3486 2265 ...    0    0    0]
 [2016 2426 1442 ...    0    0    0]]


In [30]:
embedded_docs[1]

array([5740, 1892, 7866, 1746, 8382, 1012, 4342,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

In [31]:
embedded_docs[0]

array([ 492, 7733, 2908, 1556, 4908, 5419, 2636, 9736,  964,  321,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0], dtype=int32)

###RNN

In [32]:
## Creating model

# single layer used which send the data to itself
embedding_vector_features=40 ##features representation
RNN_model=Sequential()
RNN_model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
RNN_model.add(SimpleRNN(100))
RNN_model.add(Dense(1,activation='sigmoid'))
RNN_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(RNN_model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            400000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 100)               14100     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 414,201
Trainable params: 414,201
Non-trainable params: 0
_________________________________________________________________
None


In [33]:
len(embedded_docs)

18285

In [34]:
import numpy as np

In [35]:
X_final=np.array(embedded_docs)

In [36]:
X_final.shape

(18285, 20)

In [37]:
Y_final=np.array(y)
Y_final.shape

(18285,)

In [38]:
from sklearn.model_selection import train_test_split

In [39]:
X_train,X_test,y_train,y_test=train_test_split(X_final,Y_final,test_size=0.33,random_state=42)

**Model_Training**

In [40]:
%%time
RNN_model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 57.1 s, sys: 5.31 s, total: 1min 2s
Wall time: 1min 24s


<keras.callbacks.History at 0x79beb04a35e0>

In [41]:
X_test

array([[1040, 2610, 4371, ...,    0,    0,    0],
       [5809, 8866, 4577, ...,    0,    0,    0],
       [2351, 2660,  687, ...,    0,    0,    0],
       ...,
       [1697, 4220, 2766, ...,    0,    0,    0],
       [4820,    0,    0, ...,    0,    0,    0],
       [7468, 6557, 6888, ...,    0,    0,    0]], dtype=int32)

In [42]:
y_pred=RNN_model.predict(X_test)



In [43]:
y_pred

array([[9.9993211e-01],
       [4.1125553e-05],
       [2.5750423e-05],
       ...,
       [2.7423870e-05],
       [9.9993742e-01],
       [9.9131072e-01]], dtype=float32)

In [44]:
y_pred=np.where(y_pred > 0.6, 1,0)

In [45]:
y_pred

array([[1],
       [0],
       [0],
       ...,
       [0],
       [1],
       [1]])

In [46]:
from sklearn.metrics import confusion_matrix

In [47]:
confusion_matrix(y_test,y_pred)

array([[3121,  298],
       [ 262, 2354]])

In [48]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.9072079536039768

In [49]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.91      0.92      3419
           1       0.89      0.90      0.89      2616

    accuracy                           0.91      6035
   macro avg       0.91      0.91      0.91      6035
weighted avg       0.91      0.91      0.91      6035



In [None]:
#assignment write Inference code for RNN LSTM GRU

In [None]:
#assignment add Dropoutlayer

###LSTM

In [50]:
## Creating model
embedding_vector_features=40 ##features representation
LSTM_model=Sequential()
LSTM_model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
LSTM_model.add(LSTM(100)) # 100 is the number of neurons
LSTM_model.add(Dense(1,activation='sigmoid'))
LSTM_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(LSTM_model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 40)            400000    
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 456,501
Trainable params: 456,501
Non-trainable params: 0
_________________________________________________________________
None


In [51]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [52]:
X_final.shape,y_final.shape

((18285, 20), (18285,))

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

### Model Training

In [54]:
### Finally Training
%%time
LSTM_model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 2min 11s, sys: 7.26 s, total: 2min 18s
Wall time: 1min 46s


<keras.callbacks.History at 0x79bea2d8ee90>

In [55]:

X_test

array([[1040, 2610, 4371, ...,    0,    0,    0],
       [5809, 8866, 4577, ...,    0,    0,    0],
       [2351, 2660,  687, ...,    0,    0,    0],
       ...,
       [1697, 4220, 2766, ...,    0,    0,    0],
       [4820,    0,    0, ...,    0,    0,    0],
       [7468, 6557, 6888, ...,    0,    0,    0]], dtype=int32)

In [56]:
y_pred=LSTM_model.predict(X_test)



In [57]:
y_pred=np.where(y_pred > 0.6, 1,0)

In [58]:
y_pred

array([[1],
       [0],
       [0],
       ...,
       [0],
       [1],
       [1]])

In [59]:
from sklearn.metrics import confusion_matrix

In [60]:
confusion_matrix(y_test,y_pred)

array([[3113,  306],
       [ 248, 2368]])

In [61]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.9082021541010771

In [62]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92      3419
           1       0.89      0.91      0.90      2616

    accuracy                           0.91      6035
   macro avg       0.91      0.91      0.91      6035
weighted avg       0.91      0.91      0.91      6035



##GRU

In [63]:
## Creating model
embedding_vector_features=40 ##features representation
GRU_model=Sequential()
GRU_model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
GRU_model.add(GRU(100))
GRU_model.add(Dense(1,activation='sigmoid'))
GRU_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(GRU_model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 20, 40)            400000    
                                                                 
 gru (GRU)                   (None, 100)               42600     
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 442,701
Trainable params: 442,701
Non-trainable params: 0
_________________________________________________________________
None


In [64]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [65]:
X_final.shape,y_final.shape

((18285, 20), (18285,))

In [66]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

### Model Training

In [67]:
### Finally Training
%%time
GRU_model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 2min 7s, sys: 7.95 s, total: 2min 15s
Wall time: 2min 24s


<keras.callbacks.History at 0x79bea07afa90>

In [68]:

X_test

array([[1040, 2610, 4371, ...,    0,    0,    0],
       [5809, 8866, 4577, ...,    0,    0,    0],
       [2351, 2660,  687, ...,    0,    0,    0],
       ...,
       [1697, 4220, 2766, ...,    0,    0,    0],
       [4820,    0,    0, ...,    0,    0,    0],
       [7468, 6557, 6888, ...,    0,    0,    0]], dtype=int32)

In [69]:
y_pred=GRU_model.predict(X_test)



In [70]:
y_pred=np.where(y_pred > 0.6, 1,0)

In [71]:
from sklearn.metrics import confusion_matrix

In [72]:
confusion_matrix(y_test,y_pred)

array([[3145,  274],
       [ 265, 2351]])

In [73]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.9106876553438277

In [74]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92      3419
           1       0.90      0.90      0.90      2616

    accuracy                           0.91      6035
   macro avg       0.91      0.91      0.91      6035
weighted avg       0.91      0.91      0.91      6035

