---

<h1 style="text-align: center;font-size: 40px;color: magenta">Fake News Classifier using LSTM</h1>

---

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
df = pd.read_csv("../input/newscsv/news.csv")
df.head()

In [3]:
df['label'] = pd.factorize(df['label'])[0]
df.head()

In [4]:
df.isnull().sum()

>We have so many Null values ,so let's drop these Null Values

In [5]:
df = df.dropna()

> <h4>We are going to use "Title" to classify our News is Fake or Real,So "title" is our Independent variable,whereas our target is to detect our News is Fake or Not,So here our dependent/target variable is label,So Let's get the Independent & Dependent variable</h4>

In [6]:
x = df.drop(['label','Unnamed: 0'],axis = 1)
y = df['label']

> <h3>Now  let's import Necessary libraries for LSTM</h3>

In [7]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Flatten,Dense,Embedding,LSTM,Dropout
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

>Data Preprocessing

In [8]:
messages = x.copy()
##Since we drop Null values from our Dataset ,so we need to Reset Index of our Dataset
messages.reset_index(inplace=True)

>Import necessary libraries for data preprocessing

In [10]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [11]:
nltk.download("stopwords")

> <h3> Some important information: </h3>

- re.sub()
  -  Replace with regular expression
  -  Replace multiple substrings with the same string
  -  Replace using the matched part
  -  Get the count of replaced parts
  - Here we are going to replace all regural expression which are not between a to z or A to Z with whitespace.

- result.lower:
 - By this we are going to replace all the words in Lower case,So that all are treated as equally

- ps.stem()
 -  Stemming is the process of producing morphological variants of a root/base word. Stemming programs are    
    commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words 
    “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” 
    reduce to the stem “retrieve”

In [12]:
ps = PorterStemmer()
corpus = []
for i in range(0,len(messages)):
    result = re.sub('[^a-zA-Z]',' ',messages['title'][i])  
    result = result.lower()
    result = result.split()
    
    result = [ps.stem(word) for word in result if not word in stopwords.words("english")]
    result = " ".join(result)
    corpus.append(result)

In [13]:
corpus

> <h3> One Hot representation</h3>

In [14]:
#Vocabulary Size
voc_size = 5000
onehot_repr = [one_hot(words,voc_size) for words in corpus]
onehot_repr

>Embedding Representation

- Pad_Sequences:
 - The pad_sequences() function in the Keras deep learning library can be used to pad variable length sequences.
   The default padding value is 0.0, which is suitable for most applications, although this can be changed by      specifying the preferred value via the “value” argument.
  - By this we are going to make all the sentances in same length.There are 2 types of Padding "Pre" and 
   "Post",pre means it's going to add 0 in front and post means it's goint add 0 in back

In [15]:
sent_length = 20
embeded_docs = pad_sequences(onehot_repr,padding= 'pre',maxlen = sent_length)
embeded_docs

In [16]:
embeded_docs[0]

> <h3>Creating Model</h3>

In [17]:
embedding_vector_features = 40
model = Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length = sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer = 'adam',loss ='binary_crossentropy',metrics=['accuracy'])

In [31]:
x_final = np.array(embeded_docs)
y_final = np.array(y)
x_train,x_test,y_train,y_test = train_test_split(x_final,y_final,test_size=0.3,random_state=40)

In [32]:
model.summary()

In [33]:
history = model.fit(x_train,y_train,validation_data =(x_test,y_test),batch_size=64,epochs=10)

In [34]:
import matplotlib.pyplot as plt
def plot_learning_curve(history,epochs):
    #Accuracy
    epoch_range = range(1,epochs+1)
    plt.plot(epoch_range,history.history['accuracy'])
    plt.plot(epoch_range,history.history["val_accuracy"])
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.legend(["Train","Val"],loc ="upper left")
    plt.show()

In [35]:
plot_learning_curve(history,10)

- So We should train our model by 2 epochs,Because at 2 epochs we are getting a Perfect model with perfect accuracy for both train & validation

>Performance Matrics & Accuracy

In [36]:
y_pred = model.predict_classes(x_test)
accuracy_score(y_test,y_pred)

In [37]:
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test,y_pred)
plot_confusion_matrix(mat,figsize=(6,6),show_normed=True)

---

<h1 style="text-align: center;font-size: 20px;color: magenta">Thanks for reading the Notebook</h1>

---