## Fake News Classifier Using LSTM

Author - Sagnick Bhar  
Dataset: https://www.kaggle.com/datasets/hassanamin/textdb3  
Accuracy = 75.27%

### Importing Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

###  Importing Dataset

In [2]:
#Importing Dataset
df=pd.read_csv('../input/textdb3/fake_or_real_news.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
##Drop Nan Values
df=df.dropna()

In [5]:
## Get the Independent Features
X=df.drop('label',axis=1)

In [6]:
## Get the Dependent features
y=df['label']

In [7]:
X.shape

(6335, 3)

In [8]:
y.shape

(6335,)

###  Data Preprocessing

In [9]:
### Vocabulary size
voc_size=6256

In [10]:
messages=X.copy()

In [11]:
messages['title'][1]

'Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO)'

In [12]:
messages.reset_index(inplace=True)

In [13]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
print("Process Completed")

Process Completed


In [14]:
corpus[1]

'watch exact moment paul ryan commit polit suicid trump ralli video'

In [15]:
# One Hot Representation of Sentence
onehot_repr=[one_hot(words,voc_size)for words in corpus] 
for i in range (len(y)):
    if(y[i]=="REAL"):
        y[i]=1
    else:
        y[i]=0

onehot_repr[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


[6175, 5219, 5251, 2905, 1518, 868, 504, 4971, 3917, 661, 92]

In [16]:
y

0       0
1       0
2       1
3       0
4       1
       ..
6330    1
6331    0
6332    0
6333    1
6334    1
Name: label, Length: 6335, dtype: object

In [17]:
list_len = [len(i) for i in onehot_repr]
print(max(list_len))

26


### Embedding Representation

In [18]:
sent_length=26
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)

print(embedded_docs)

[[   0    0    0 ... 1827 2980 3132]
 [   0    0    0 ... 3917  661   92]
 [   0    0    0 ...  405 3345 4753]
 ...
 [   0    0    0 ... 1635 4537 4220]
 [   0    0    0 ... 5478 5439 3291]
 [   0    0    0 ... 1961 3917   36]]


In [19]:
embedded_docs[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0, 1827, 2980, 3132], dtype=int32)

In [20]:
## Creating model
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

2022-04-30 13:24:03.958146: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-30 13:24:04.063329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-30 13:24:04.064756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-30 13:24:04.066644: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 26, 40)            250240    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               56400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 306,741
Trainable params: 306,741
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
len(embedded_docs),y.shape

(6335, (6335,))

In [22]:
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [23]:
X_final.shape,y_final.shape

((6335, 26), (6335,))

### Splitting Dataset into Train and Test 

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=0)

In [25]:
y_train1 =tf.convert_to_tensor(y_train, dtype=tf.int64)
y_test1 =tf.convert_to_tensor(y_test, dtype=tf.int64)

### Model Training

In [26]:
# Training
model.fit(X_train,y_train1 ,validation_data=(X_test,y_test1),epochs=50,batch_size=64)

2022-04-30 13:24:07.857214: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/50


2022-04-30 13:24:11.060028: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f3cebe00450>

### Performance Metrics And Accuracy

In [27]:
y_pred=(model.predict(X_test) > 0.5).astype("int32")

In [28]:
from sklearn.metrics import confusion_matrix

In [29]:
confusion_matrix(tf.convert_to_tensor(y_test, dtype=tf.int64),y_pred)

array([[748, 279],
       [218, 846]])

In [30]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test1,y_pred)

0.7623146819703491

**The End**