# (1) LSTM Model - Lifferth (2018) Dataset

Implementing a LSTM model
> REF https://www.kaggle.com/code/jsvishnuj/fakenews-detection-using-lstm-neural-network 

> Lifferth's (2018) dataset: https://kaggle.com/competitions/fake-news 

This model is run as an initial trial to see how well my laptop/ device capability I have and how long the model runs during training for future trials and testing in comparison with other datasets and model types.

This code is essentially a trial version of an LSTM model which demonstrates understanding how the model is implemented by code. Edits in the code are made where necessary. 

The code is implemented alongside a GRU model (Gated Recurrent Unit) which is a type of RNN that, in certain cases, has advantages over LSTMs. GRU uses less memory and is faster than LSTM, however, LSTM is more accurate when using datasets with longer sequences.

In [1]:
pip install sklearn

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install matplot lib

Note: you may need to restart the kernel to use updated packages.


In [3]:
# importing necessary libraries 
import pandas as pd
import tensorflow as tf
import os
import re
import numpy as np
from string import punctuation
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [4]:
# importing neural network libraries
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, GRU, LSTM, RNN, SpatialDropout1D

In [5]:
#applying the test and train data accordingly.

train = pd.read_csv(r'C:\Users\luoco\Documents\Dissertation-Project\fake-news-data\LSTM-data\train.csv\train.csv')
test = pd.read_csv(r'C:\Users\luoco\Documents\Dissertation-Project\fake-news-data\LSTM-data\test.csv\test.csv')
train_data = train.copy()
test_data = test.copy()

In [6]:
#dropping the id column
train_data = train_data.set_index('id', drop = True)

In [7]:
#printing shape and showing df headings of the training set
print(train_data.shape)
train_data.head()

(20800, 4)


Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [8]:
##printing shape and showing df headings of the test set, making sure they are different.
print(test_data.shape)
test_data.head()

(5200, 4)


Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [9]:
#checking for missing values
train_data.isnull().sum()

title      558
author    1957
text        39
label        0
dtype: int64

In [10]:
#dropping missing values from text type of columns only. Dropping anything else. 
train_data[['title', 'author']] = train_data[['title', 'author']].fillna(value = 'Missing')
train_data = train_data.dropna()
train_data.isnull().sum()

title     0
author    0
text      0
label     0
dtype: int64

In [11]:
#adding a 'length' heading applied to the text string.
length = []
[length.append(len(str(text))) for text in train_data['text']]
train_data['length'] = length
train_data.head()

Unnamed: 0_level_0,title,author,text,label,length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,4930
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,4160
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,7692
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,3237
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,938


In [12]:
min(train_data['length']), max(train_data['length']), round(sum(train_data['length'])/len(train_data['length']))

(1, 142961, 4553)

In [13]:
len(train_data[train_data['length'] < 50])

207

In [14]:
train_data['text'][train_data['length'] < 50]

id
82                                                   
169                                                  
173                                   Guest   Guest  
196            They got the heater turned up on high.
295                                                  
                             ...                     
20350                         I hope nobody got hurt!
20418                                 Guest   Guest  
20431    \nOctober 28, 2016 The Mothers by stclair by
20513                                                
20636                              Trump all the way!
Name: text, Length: 207, dtype: object

In [15]:
#dropping any of the outliers
train_data = train_data.drop(train_data['text'][train_data['length'] < 50].index, axis = 0)

In [16]:
min(train_data['length']), max(train_data['length']), round(sum(train_data['length'])/len(train_data['length']))

(50, 142961, 4598)

In [17]:
max_features = 4500

In [18]:
#Tokenizing the text - i.e. we are converting the words and or letters into counts or numbers. 
# We dont need to explicitly remove the punctuations. There is an inbuilt option in Tokenizer for this exact purpose
tokenizer = Tokenizer(num_words = max_features, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower = True, split = ' ')
tokenizer.fit_on_texts(texts = train_data['text'])
X = tokenizer.texts_to_sequences(texts = train_data['text'])

In [19]:
# now applying padding to make them even shaped.
X = pad_sequences(sequences = X, maxlen = max_features, padding = 'pre')

In [20]:
print(X.shape)
y = train_data['label'].values
print(y.shape)

(20554, 4500)
(20554,)


In [21]:
# splitting the data training data for training and validation.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

In [22]:
# LSTM Neural Network
lstm_model = Sequential(name = 'lstm_nn_model')
lstm_model.add(layer = Embedding(input_dim = max_features, output_dim = 120, name = '1st_layer'))
lstm_model.add(layer = LSTM(units = 120, dropout = 0.2, recurrent_dropout = 0.2, name = '2nd_layer'))
lstm_model.add(layer = Dropout(rate = 0.5, name = '3rd_layer'))
lstm_model.add(layer = Dense(units = 120,  activation = 'relu', name = '4th_layer'))
lstm_model.add(layer = Dropout(rate = 0.5, name = '5th_layer'))
lstm_model.add(layer = Dense(units = len(set(y)),  activation = 'sigmoid', name = 'output_layer'))
# compiling the model
lstm_model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

In [23]:
lstm_model_fit = lstm_model.fit(X_train, y_train, epochs = 1)



#### One Epoch (cycle) took around 48hrs.
#### loss: 0.4574 
#### accuracy: 0.8016 
Accuracy statistic could be better, as for the loss which could also be less.

# ___________________________________________________________________

### GRU model implementation

In [38]:
# GRU neural Network
gru_model = Sequential(name = 'gru_nn_model')
gru_model.add(layer = Embedding(input_dim = max_features, output_dim = 120, name = '1st_layer'))
gru_model.add(layer = GRU(units = 120, dropout = 0.2, 
                          recurrent_dropout = 0.2, recurrent_activation = 'relu', 
                          activation = 'relu', name = '2nd_layer'))
gru_model.add(layer = Dropout(rate = 0.4, name = '3rd_layer'))
gru_model.add(layer = Dense(units = 120, activation = 'relu', name = '4th_layer'))
gru_model.add(layer = Dropout(rate = 0.2, name = '5th_layer'))
gru_model.add(layer = Dense(units = len(set(y)), activation = 'softmax', name = 'output_layer'))
# compiling the model
gru_model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

In [39]:
gru_model.summary()

Model: "gru_nn_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 1st_layer (Embedding)       (None, None, 120)         540000    
                                                                 
 2nd_layer (GRU)             (None, 120)               87120     
                                                                 
 3rd_layer (Dropout)         (None, 120)               0         
                                                                 
 4th_layer (Dense)           (None, 120)               14520     
                                                                 
 5th_layer (Dropout)         (None, 120)               0         
                                                                 
 output_layer (Dense)        (None, 2)                 242       
                                                                 
Total params: 641882 (2.45 MB)
Trainable params: 64188

In [40]:
gru_model_fit = gru_model.fit(X_train, y_train, epochs = 1)



#### Again, took around 36-48hrs 
(Finished overnight, so this is a guess for the exact time it took to complete one epoch of training. 
#### loss: nan 
#### accuracy: 0.5576
Accuracy is a lot worse on the same dataset as before. I imagine running this through a longer training period/ more epochs would improve this accuracy, but for the purpose of time and relevant work to implemented, I will not run a the GRU model again. 

In [41]:
print(test.shape)
test_data = test.copy()
print(test_data.shape)

(5200, 4)
(5200, 4)


In [42]:
test_data = test_data.set_index('id', drop = True)
test_data.shape

(5200, 3)

In [43]:
test_data = test_data.fillna(' ')
print(test_data.shape)
test_data.isnull().sum()

(5200, 3)


title     0
author    0
text      0
dtype: int64

In [44]:
tokenizer.fit_on_texts(texts = test_data['text'])
test_text = tokenizer.texts_to_sequences(texts = test_data['text'])

In [45]:
test_text = pad_sequences(sequences = test_text, maxlen = max_features, padding = 'pre')

In [46]:
#original code
# lstm_prediction = lstm_model.predict_classes(test_text)
# issue fix ref: https://github.com/keras-team/keras/issues/15838


lstm_prediction = lstm_model.predict(test_text)



In [47]:
lstm_prediction.shape

(5200, 2)

In [48]:
test_data.index.shape

(5200,)

In [50]:
submission = pd.DataFrame({'id':test_data.index, 'label':len(lstm_prediction)})
#submission.shape

In [51]:
submission.head()

Unnamed: 0,id,label
0,20800,5200
1,20801,5200
2,20802,5200
3,20803,5200
4,20804,5200


In [52]:
submission.to_csv('submission.csv', index = False)

In [63]:
from sklearn.metrics import classification_report, accuracy_score
#https://github.com/Shaon2221/Fake-News-Detection-using-LSTM/blob/master/Fake%20News%20Detection%20using%20LSTM.ipynb

In [69]:
y_pred = (lstm_model.predict(X_test) >= 0.5).astype("int")

