# LSTM FAKE NEWS CLASSIFIER
## Dataset Description
- train.csv: A full training dataset with the following attributes:

id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks the article as potentially unreliable
    
    - 1: unreliable (fake news)
    - 0: reliable (not fake news)

- test.csv: A testing training dataset with all the same attributes at train.csv without the label.

Process (Step-by-step analysis)
1. Importing the dataset
2. Independent and Dependent feartures
3. Cleaning the data 

        i) Stemming 
       ii) Stopwords
4. Padding (Fix the sentence length to fix the input)
5. One hot representation
6. Model creation with Embedding Layer 
7. LSTM Neural Network
8. Final Training and test labels
9. Train-test split the data
10. Fit the data
11. Prediction of the dataset
12. Performance metrics and Accuracy score.

In [186]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/fake-news/submit.csv
/kaggle/input/fake-news/train.csv
/kaggle/input/fake-news/test.csv


In [187]:
#Import the dataset
df = pd.read_csv("/kaggle/input/fake-news/train.csv")
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [188]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [189]:
#Find the null values
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

Now we have to decide what to do with the null values. Since it is a text dataset, we cant replace the text or the author. Thus, we drop these null values, because as compared to 20k+ values, 50-100 values wont affect the model much. Thus, we drop the values using df.dropna()

In [190]:
df = df.dropna()

In [191]:
df.info()
df.isnull().sum()
#Now we have no null values

<class 'pandas.core.frame.DataFrame'>
Index: 18285 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      18285 non-null  int64 
 1   title   18285 non-null  object
 2   author  18285 non-null  object
 3   text    18285 non-null  object
 4   label   18285 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 857.1+ KB


id        0
title     0
author    0
text      0
label     0
dtype: int64

In [192]:
#Independent (input) Features
X = df.drop("label", axis=1)
X.head()

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...


In [193]:
#Output Labels
y = df['label']
y.head()

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

In [194]:
X.shape,y.shape

((18285, 4), (18285,))

#### Importing important libraries to perform Word Embedding and implementing LSTM
- We are going to implement LSTM in the title section of X labels

In [195]:
import tensorflow as tf
print(tf.__version__)

2.13.0


In [196]:
#For Word Embedding
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
#For LSTM training
from tensorflow.keras.layers import LSTM, Dense

In [197]:
#Vocabulary Size
voc_size = 10000

In [198]:
messages=X.copy()
messages['title'][1]
messages
messages.reset_index(inplace=True)
messages

Unnamed: 0,index,id,title,author,text
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...,...
18280,20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
18281,20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
18282,20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
18283,20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


Before implementing the One Hot Encoding we have to pre-processing of the text like using stopwords, stemming and lemmatization. Thus, we use  a the nltk module for text preprocessing.

## Text Preprocesing

In [199]:
import re #regular expresssion to substitute punctuations, and unnecessary symbols
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [200]:
from nltk.stem.porter import PorterStemmer ##stemming purpose
ps = PorterStemmer()
# lemm = WordNetLemmatizer()
messages.reset_index(inplace=True)

corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
#     review = [lemm.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [201]:
#One Hot Representation
# print(corpus)
onehot_encode = [one_hot(word,voc_size) for word in corpus]
onehot_encode 

[[9961, 747, 3751, 5524, 6824, 1296, 9763, 5454, 1027, 1500],
 [5546, 3698, 6570, 1064, 2395, 6971, 5273],
 [1226, 8247, 2472, 4339],
 [1701, 3594, 1076, 142, 4788, 8056],
 [6137, 2395, 4784, 7442, 316, 3314, 2395, 1610, 5974, 7840],
 [8191,
  286,
  7090,
  547,
  6623,
  3375,
  473,
  9807,
  2057,
  9802,
  3807,
  1430,
  4267,
  2326,
  5273],
 [3437, 5404, 7007, 9083, 5472, 9642, 3140, 5647, 8733, 5649, 1740],
 [4379, 4088, 6771, 2710, 7558, 502, 3375, 9258, 8733, 5649, 1740],
 [9586, 802, 3205, 1074, 4456, 6863, 5868, 3064, 3375, 6170],
 [3117, 1319, 7167, 7898, 6533, 5757, 7497, 6369],
 [9261, 6350, 2172, 243, 449, 9621, 7385, 1995, 357, 7732, 5769],
 [142, 3994, 6824, 6863, 3375, 7558],
 [294, 7348, 2421, 9489, 8717, 4722, 2829, 2783, 4482],
 [8679, 3595, 2243, 9495, 1380, 4068, 9352, 8733, 5649, 1740],
 [386, 8598, 4926, 1662, 6904, 8733, 5649, 1740],
 [8583, 544, 92, 4484, 1528, 7040, 111, 8972, 1272, 6389],
 [3564, 3197, 3698],
 [7354, 74, 3068, 643, 3375, 7211, 5878, 5273

Basically what this OHE means is:
the ---> 6930
Out of the 10000 size of vectors, the index number 6930 is 1 and rest are zero, and this represents the word "the".
the = [0,0,0,0,0,......,1,0,0,0,......]
index position of 1: 6930

Similar is the case for all the other words.

In [202]:
#Calculating the length of each document in the corpus
for i in range(len(corpus)):
    len_ = corpus[i].split(" ")
    print(len(len_))

10
7
4
6
10
15
11
11
10
8
11
6
9
10
8
10
3
8
9
6
12
7
12
8
12
7
8
8
12
12
9
5
4
8
9
8
4
9
9
10
9
10
12
3
10
6
11
5
9
8
9
5
11
9
12
9
10
10
8
8
4
5
7
10
11
8
8
9
8
12
9
4
9
7
11
1
10
12
7
10
10
11
16
6
8
7
10
10
7
9
12
8
5
7
11
9
11
13
10
13
11
12
12
12
9
9
11
9
11
10
7
10
9
10
9
8
5
8
11
10
4
9
9
10
11
9
9
9
6
8
11
10
11
7
10
11
2
11
10
2
10
5
9
4
7
13
10
7
8
10
7
6
9
17
11
6
10
10
7
11
13
7
7
10
8
17
11
7
7
7
4
9
11
12
8
7
13
15
8
4
9
7
9
8
7
5
9
11
10
12
11
6
14
8
10
7
9
8
10
11
14
11
9
11
5
9
8
15
10
9
10
8
9
2
12
10
10
11
10
8
1
10
9
5
13
8
11
11
9
7
6
12
8
12
13
9
7
8
7
3
9
11
8
10
7
7
14
4
10
7
9
10
9
10
6
8
11
11
6
8
6
12
8
8
10
9
6
10
8
8
9
7
11
9
6
11
7
8
9
8
3
10
12
13
8
9
16
9
6
3
7
13
7
6
5
13
5
9
8
4
8
4
8
1
9
6
8
8
7
8
11
3
8
10
11
7
11
5
10
9
6
10
7
9
8
10
7
11
6
11
8
5
12
9
4
7
9
6
13
13
9
8
9
8
11
13
6
8
8
9
11
8
11
12
10
6
6
16
10
11
7
7
14
11
11
8
8
7
9
10
9
8
9
10
12
9
10
9
5
8
7
11
10
7
7
4
8
12
12
9
6
5
8
10
7
13
8
10
6
9
4
8
11
8
7
8
2
7
10
10
10
10
10
12
1
10
14

#### Now we can see that the elements in the corpus are of different length, thus in order to keep them in a general footing, we use padding. This is generally done so that each document in the corpus have same number of vectors.

In [203]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential

### Padding

Since most of the sentences have a length with maximum value is 17, thus I have selected 20 as my padding max length. 
SO, if a sentence has 14 words then after padding it will have a length of 20, out of which 6 will be zeros and 14 are values.

In [204]:
sent_length = 20
pre_pad = pad_sequences(onehot_encode,maxlen=sent_length,padding='pre')
pre_pad

array([[   0,    0,    0, ..., 5454, 1027, 1500],
       [   0,    0,    0, ..., 2395, 6971, 5273],
       [   0,    0,    0, ..., 8247, 2472, 4339],
       ...,
       [   0,    0,    0, ..., 8733, 5649, 1740],
       [   0,    0,    0, ..., 3480, 5537, 7230],
       [   0,    0,    0, ..., 9738, 8914, 4454]], dtype=int32)

### Model Creation
#### Embedding Representation

In [205]:
#Feature Representation
ndim = 40 #Hyperparameter, change accordingly to the dataset
#This is the output dimension of the vector. SO each word here is converted into a vector of 40 indices

model = Sequential()
model.add(Embedding(input_dim=voc_size, output_dim=ndim, input_length = sent_length))
model.add(LSTM(units = 100)) #this value can be changed in order to get accuracy
model.add(Dense(1,activation='sigmoid')) #Since output is binary, thus sigmoid is used.

model.compile(loss='binary_crossentropy',optimizer="adam",metrics=['accuracy'])

print(model.summary())

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 20, 40)            400000    
                                                                 
 lstm_7 (LSTM)               (None, 100)               56400     
                                                                 
 dense_7 (Dense)             (None, 1)                 101       
                                                                 
Total params: 456501 (1.74 MB)
Trainable params: 456501 (1.74 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [206]:
len(pre_pad), y.shape

(18285, (18285,))

### Final training and test labels

In [207]:
import numpy as np
X_final = np.array(pre_pad)
y_final = np.array(y)

X_final.shape, y_final.shape

((18285, 20), (18285,))

### Train-test Split


In [208]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=68)

In [209]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7faec552e980>

### Performance Metrics and Accuracy Score

In [210]:
y_pred = model.predict(X_test)



In [211]:
y_pred = np.where(y_pred > 0.6,1,0) #AUC-ROC Curve to get the threshold
#if an element in y_pred is greater than 0.6, it will be replaced by 1, and if it's less than or equal to 0.6, it will be replaced by 0.

In [212]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[2356,  218],
       [ 181, 1817]])

In [213]:
accuracy_score(y_test,y_pred)

0.9127296587926509

In [214]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.92      0.92      2574
           1       0.89      0.91      0.90      1998

    accuracy                           0.91      4572
   macro avg       0.91      0.91      0.91      4572
weighted avg       0.91      0.91      0.91      4572



Process (Step-by-step analysis)
1. Importing the dataset
2. Independent and Dependent feartures
3. Cleaning the data 
    i) Stemming 
    ii) Stopwords
4. Padding (Fix the sentence length to fix the input)
5. One hot representation
6. Model creation with Embedding Layer 
7. LSTM Neural Network
8. Final Training and test labels
9. Train-test split the data
10. Fit the data
11. Prediction of the dataset
12. Performance metrics and Accuracy score.