<a href="https://colab.research.google.com/github/Apratimx/Fake-News-Detector-LSTM/blob/main/Fake_News_Detector_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='#badc58'> Fake News Detector with LSTM |<br> </font>  
# <font color='#3dc1d3'>  
1.  Preprocess data
2.  one_hot encoding
3.  create LSTM model
4.  observe the alteration in shape, flattening and then re-shaping - changes from embedding to flattening - to - dense layer <br>
5. Observe the total weight matrix size of the LSTM - mathematical verification

## <font color='#f9ca24'> LSTM 
 <font color='00BFEB'>'fit' or train on some training data; joins these two steps and is used for the initial fitting of parameters on the training set 𝑥, while also returning the transformed 𝑥′. Internally, the transformer object just calls first fit() and then transform() on the same data.<br>In the output, you will see (20000, 5) which means that each of the document has 5 columns where each column corresponds to the probability value of a particular topic. 

In [1]:
import re
import pandas as pd
import nltk
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Dense, LSTM, Dropout 
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from sklearn.metrics import classification_report 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from google.colab import files
upload = files.upload()

Saving fakenews.zip to fakenews.zip


In [None]:
!unzip fakenewsDataset

In [None]:
df = pd.read_csv("train.csv")
df.head()

In [None]:
df['title'][0]

In [None]:
df['label'][100]

In [None]:
df.shape

In [None]:
df.isnull().sum()

<font color='#badc58'>drop missing data</font>  <br/> 

In [None]:
df = df.dropna(subset=['title','text'])
df.isnull().sum()

<font color='#badc58'>Preparing to create the model


In [None]:
x = df.drop(columns='label')
y = df['label']

In [None]:
x.shape, y.shape#array dimensions 

<font color='#7ed6df'>Data Preprocessing

In [None]:
#create shallow copy - create a new object
copy = x.copy()
#copies the reference of nested objects
copy.reset_index(inplace=True)

In [None]:
ws = WordNetLemmatizer()
list_titles = []
for i in range(0, len(copy)):
  #print(i)#print copy['title' - i]
  headline = re.sub('[^a-zA-Z]', ' ', copy['title'][i])
  #matching the text string for any lower case letter or uppercase 
  #when the ^ is on the inside of [], itmatches any character that does not appear inside []
  #when ^ is on the outside of the []; i tmatches the beginning of the line/string - title 
  headline = headline.lower()
  headline = headline.split()
  headline = [ws.lemmatize(word) for word in headline if word not in stopwords.words("english")]
  headline = ' '.join(headline)
  list_titles.append(headline) #adds a single item to the existing list

In [None]:
list_titles[:4]

In [None]:
#for i in range(0, len(messages)):
    #print('\n', messages['title'][i])

Index of words located in the Dictionary

In [None]:
vocab = 10000
hot_title = [one_hot(i, vocab) for i in list_titles]
hot_title[:4]

<font color='#7ed6df'>Longest sentence<br>

In [None]:
longest = len(max(list_titles, key = len))
longest

<font color='#7ed6df'>Making every sentence of the data of Same Length<br> <font color='#f9ca24'>pad_sequences </font>is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence

In [None]:
max_length = 356
embed_input = pad_sequences(hot_title, maxlen = max_length, padding='pre')
print(embed_input)

<font color='#7ed6df'>Arguments: <br>
<font color='#f9ca24'>sequences </font>	
List of lists where each element is a sequence<br>
<font color='#f9ca24'>maxlen </font>		
int, maximum length of all sequences

<font color='#f9ca24'>dtype </font>	<font color='#7ed6df'>	
type of the output sequences

<font color='#f9ca24'>padding </font><font color='#7ed6df'>
'pre' or 'post', pad either before or after each sequence.



<font color='#7ed6df'>Input shape<br>2D tensor with shape: (batch_size, input_length).

<font color='#f9ca24'>Creating the Model

In [None]:
model = Sequential() #creating the sequential model incrementally vi the add() method
model.add(Embedding(input_dim=vocab, output_dim= 40, input_length=356))
model.add(LSTM(150))
model.add(Dense(1, activation='sigmoid'))#sigmoid returns a value close to zero
#config the model with losses and metrics - compile()
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
print(model.summary())

<font color='#7ed6df'>sequence()a plain stack of layers where each layer has exactly one input tensor and one output tensor<br>create a Sequential model incrementally via the add() method<br>the input of the LSTM is always a 3D array
(batch_size, time_steps, units)<br>
The output of the LSTM could be a 2D array or 3D array depending upon the return_sequences argument.
If return_sequence is False, the output is a 2D array. (batch_size, units)
If return_sequence is True, the output is a 3D array. (batch_size, time_steps, units)<br> in this case; the return_sequence is false - this is the default, therefore - 2D LSTM output

3D tensor with shape: (batch_size, input_length, output_dim).
alteration in shape, flattening and then re-shaping


<font color='#f9ca24'>Describe model

In [None]:
plot_model(model)

In [None]:
len(embed_input),y.shape

In [None]:
x_final = np.array(embed_input)
y_final = np.array(y)
x_final.shape, y_final.shape 

In [None]:
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x_final, y_final, test_size=0.33, random_state=42)

<font color='#7ed6df'>fitting the model

In [None]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=64) 

<font color='#7ed6df'>Describe performance of classificaiton model <br>tweak to make sure that 'acc' and 'val_acc' and final 'accuracy' are more closer to each other. It is normal for validation accuracy to be lower than accuracy. But ideally, these values should be kept similar range. If validation accuracy is much lower than accuracy, be cautious of over fitting<br>acc' refers to accuracy of what was trained against. <br>'val_acc' refers to validation set. Note that val_acc refers to a set of samples that was not shown to the network during training and hence refers to how much your model works in general for cases outside the training set.

In [None]:
y_pred = model.predict_classes(x_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

<font color='#7ed6df'>Evaluate Performance with Classification Report

In [None]:
 from sklearn.metrics import confusion_matrix
 confusion_matrix(y_test, y_pred)

In [None]:
model.evaluate(embed_input, y)