## Importing Important Libraries

## News

The term "news" refers to details about current events. This can be done in a variety of ways, including word of mouth, writing, postal services, broadcasting, electronic communication, and the testimony of incident observers and witnesses. War, government, politics, education, health, the environment, economy, industry, fashion, and entertainment, as well as sporting events and quirky or unusual events, are all common topics for news coverage. Technological and social advancements, also motivated by government communication and espionage networks, have accelerated the spread of news and influenced its content.

## Fake News

Fake news is content that is inaccurate or misleading and is perceived as news. It is sometimes used to damage a person's or entity's image or to profit from advertising revenue. Fake news, which was once popular in print, has become more prevalent with the rise of social media, especially the Facebook News Feed.The dissemination of fake news has been linked to political divide, post-truth politics, confirmation bias, and social media algorithms. It is sometimes created and spread by hostile foreign actors, particularly during elections. The use of anonymously hosted fake news websites has made prosecuting sources of fake news for libel more difficult.

By contrasting with real news, fake news can lessen the influence of real news; a Buzzfeed study showed that top fake news reports about the 2016 US presidential election generated more engagement on Facebook than top stories from major media outlets. It also has the ability to erode public confidence in serious news coverage. Thus making the classification of fake news at an early stage a very important need of the hour.

We worked on fake-and-real-news-dataset as provided by Clément Bisaillon to devise some Deep learning algorithms to make classification of Fake news easier.

Following are the models we worked on:
Finals Deep Learning models designed :

**1.      Detection of the topics that are emerging most in the analysis of Fake news.**

**2.      Fake news detection using only news titles with CNN+LSTM.**

**3.      Fake news Classification using RNN(LSTM) on whole text.**

**4.      Fake news classification using CNN.**





In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

!pip install gensim # Gensim is an open-source library for unsupervised topic modeling and natural language processing
import nltk
nltk.download('punkt')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import nltk
import re
from nltk.corpus import stopwords
import seaborn as sns 
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf

import time
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## Reading the Dataset

In [None]:
true_news = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
fake_news = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')

**Creating a target variable and merging the datasets for true and false news**

In [None]:
true_news['target'] = 1
fake_news['target'] = 0
df = pd.concat([true_news, fake_news]).reset_index(drop = True)
df['complete'] = df['title'] + ' ' + df['text']
df.head()

**Checking for the null values in the data**

In [None]:
df.isnull().sum()

## Data Cleaning

In [None]:
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use','says'])
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2 and token not in stop_words:
            result.append(token)
            
    return result

In [None]:
# Transforming the unmatching subjects to the same notation
df.subject=df.subject.replace({'politics':'PoliticsNews','politicsNews':'PoliticsNews'})

**Distribution of Subjects between the True and Fake News**

In [None]:
sub_tf_df=df.groupby('target').apply(lambda x:x['title'].count()).reset_index(name='Counts')
sub_tf_df.target.replace({0:'False',1:'True'},inplace=True)
sub_tf_df

In [None]:
fig = px.bar(sub_tf_df, x="target", y="Counts",
             color='Counts', barmode='group',
             height=400)
fig.show()

**Observation The dataset looks really balanced and hence working on this is pretty easy. Thus we need not work on to make this dataset more balanced, and can safely assume this is a balanced dataset**

## Detection of the topics that are emerging most in the analysis of Fake news.

### Subjects receiving the most News Coverage

In [None]:
sub_check=df.groupby('subject').apply(lambda x:x['title'].count()).reset_index(name='Counts')
fig=px.bar(sub_check,x='subject',y='Counts',color='Counts',title='Count of News Articles by Subject')
fig.show()

**Observations Political News and World News hold the most domination counts in the data set that we have considered.**

## Analysis to check how efficient News Headlines are to predict if the news are fake or not.

In [None]:
df['clean_title'] = df['title'].apply(preprocess)
df['clean_title'][0]

In [None]:
df['clean_joined_title']=df['clean_title'].apply(lambda x:" ".join(x))
df.head()

In [None]:
#wordcloud for true news 
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800 , stopwords = stop_words).generate(" ".join(df[df.target == 1].clean_joined_title))
plt.imshow(wc, interpolation = 'bilinear')

**Official, White House, trump, China, North Korea are some of the most evident words present in Real news dataset.**

In [None]:
#wordcloud for fake news 
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800 , stopwords = stop_words).generate(" ".join(df[df.target == 0].clean_joined_title))
plt.imshow(wc, interpolation = 'bilinear')

**Video, Obama, trump, hillary are some of the most evident words present in Fake news dataset.**

## Lets Look at the Count of Words Distribution in the Title

In [None]:
maxlen = -1
for doc in df.clean_joined_title:
    tokens = nltk.word_tokenize(doc)
    if(maxlen<len(tokens)):
        maxlen = len(tokens)
print("The maximum number of words in a title is =", maxlen)
fig = px.histogram(x = [len(nltk.word_tokenize(x)) for x in df.clean_joined_title], nbins = 50)
fig.show()

**Observation: The maximum number of titles ranges from 7-8 words on average. It will be difficult to determine whether the news is real or false based on these few words alone. But we're hoping we won't get a lot of accuracy just by looking at the title. Let us continue with our forecast.**

## Model 1: Fake news detection using only news titles with CNN + LSTM

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.clean_joined_title, df.target, test_size = 0.2,random_state=2, stratify = df.target)

In [None]:
print(X_test.head())
print(y_test.head())

In [None]:
lengths = [len(x) for x in df.clean_joined_title]
max_length = max(lengths)
max_length
trunc_type = 'post'
padding_type = 'post'



**Tokenizing and the padding the titles so that all have the same length, which is the length of the longest title**

In [None]:
embedding_dim = 100
oov_tok = "<OOV>"

tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index
vocab_size=len(word_index)


In [None]:
sequences = tokenizer.texts_to_sequences(X_train)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)



In [None]:
test_sequences = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=max_length, padding=padding_type, truncating=trunc_type)


In [None]:
print(X_train[0], y_train[0])


**We will be using CNN + LSTM, which are genrally used for the task of generating textual descriptions of images. 
In our model CNN will be used as feature extracter on the textual input and pass the it to LSTM through hidden layer for classification.**

In [None]:
from keras.callbacks import EarlyStopping
overfitCallback = EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=5,
                              verbose=0, mode='auto')

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, 15, input_length=max_length),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

num_epochs = 50
history = model.fit(padded, y_train, epochs=num_epochs, validation_data=(test_sequences, y_test), verbose=2, callbacks=[overfitCallback])

print("Training Complete")

**To avoid too much overfitting, an early stopping callback was described. The model is made up of six layers that are stacked in a specific order. After 7 epochs, the model stopped early and has a 95 percent validation accuracy.**

In [None]:
history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = history.epoch

plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'r', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss', size=20)
plt.xlabel('Epochs', size=20)
plt.ylabel('Loss', size=20)
plt.legend(prop={'size': 20})
plt.show()

plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy', size=20)
plt.xlabel('Epochs', size=20)
plt.ylabel('Accuracy', size=20)
plt.legend(prop={'size': 20})
plt.ylim((0.5,1))
plt.show()


## Model 2: Text Classification with RNN on whole text

In [None]:
df.head()

In [None]:
df_whole = df.loc[:,["complete","target"]]
df_whole.head()

In [None]:
df_whole['clean_text'] = df['complete'].apply(preprocess)
df_whole['clean_text'][0]

**Now we'll tokenize our data using Tensorflow's tokenizer**

In [None]:

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df_whole['complete'])
x_tokenized = tokenizer.texts_to_sequences(df_whole['complete'])


In [None]:
x = df_whole["complete"]
y = df_whole["target"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=18)

**Normalizing our data: changing it to lower case, getting rid of extra spaces, and url links.**

In [None]:
def normalize(data):
    normalized = []
    for i in data:
        i = i.lower()
        # get rid of urls
        i = re.sub('https?://\S+|www\.\S+', '', i)
        # get rid of non words and extra spaces
        i = re.sub('\\W', ' ', i)
        i = re.sub('\n', '', i)
        i = re.sub(' +', ' ', i)
        i = re.sub('^ ', '', i)
        i = re.sub(' $', '', i)
        normalized.append(i)
    return normalized

X_train = normalize(X_train)
X_test = normalize(X_test)

In [None]:
#Convert text to vectors, our classifier only takes numerical data. 
max_vocab = 10000
tokenizer = Tokenizer(num_words=max_vocab)
tokenizer.fit_on_texts(X_train)

In [None]:
# tokenize the text into vectors 
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

**Apply padding so we have the same length for each article**

In [None]:
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, padding='post', maxlen=256)
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, padding='post', maxlen=256)

### Building the RNN.

**RNNs are a form of Neural Network in which the output from the previous step is used as input in the current step.**

**Here we built a Sequential model that processes sequences of texts, embeds each texts into a 32-dimensional vector, then processes the sequence of vectors using 2 Bidirectional LSTM layers of 64 units and 16 units respectively because when working with text its important to take into account the context of the text.**

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_vocab, 32),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

model.summary()

In [None]:
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=10,validation_split=0.1, batch_size=30, shuffle=True, callbacks=[early_stop])

**To avoid too much overfitting, an early stopping callback was described. The model is made up of 5 layers that are stacked in a specific order. After 4 epochs, the model stopped early and has a 98
percent validation accuracy.**

**Visualizing our training over time**

In [None]:
history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = history.epoch

plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'r', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss', size=20)
plt.xlabel('Epochs', size=20)
plt.ylabel('Loss', size=20)
plt.legend(prop={'size': 20})
plt.show()

plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'g', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy', size=20)
plt.xlabel('Epochs', size=20)
plt.ylabel('Accuracy', size=20)
plt.legend(prop={'size': 20})
plt.ylim((0.5,1))
plt.show()

In [None]:
model.evaluate(X_test, y_test)

**Evaluation on test data gave loss: 0.0512 - accuracy: 0.9860**

In [None]:
pred = model.predict(X_test)

binary_predictions = []

for i in pred:
    if i >= 0.5:
        binary_predictions.append(1)
    else:
        binary_predictions.append(0) 

In [None]:
print('Accuracy on testing set:', accuracy_score(binary_predictions, y_test))
print('Precision on testing set:', precision_score(binary_predictions, y_test))
print('Recall on testing set:', recall_score(binary_predictions, y_test))

In [None]:
matrix = confusion_matrix(binary_predictions, y_test, normalize='all')
plt.figure(figsize=(16, 10))
ax= plt.subplot()
sns.heatmap(matrix, annot=True, ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted Labels', size=20)
ax.set_ylabel('True Labels', size=20)
ax.set_title('Confusion Matrix', size=20) 
ax.xaxis.set_ticklabels([0,1], size=15)
ax.yaxis.set_ticklabels([0,1], size=15)

## Model 3:  Fake news classification on whole text Using CNN

In [None]:
length_array = [len(s) for s in X_train]
SEQUENCE_LENGTH = int(np.quantile(length_array,0.75))
print(SEQUENCE_LENGTH)

**Building and training our convolutional neural network using keras' sequential api.**

In [None]:
# We've added 1 because or word index has numbers from 1 to end but we've added 0 tokens in padding so our vocab now has 
#len(tokenizer.word_index) + 1
VOCAB_LENGTH = len(tokenizer.word_index) + 1
VECTOR_SIZE = 100

def getModel():
    """
    Returns a trainable Sigmoid Convolutional Neural Network
    """
    model = keras.Sequential()
    model.add(layers.Embedding(input_dim= VOCAB_LENGTH, output_dim=VECTOR_SIZE, input_length=SEQUENCE_LENGTH))
    
    model.add(layers.Conv1D(128,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Conv1D(256,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Conv1D(512,kernel_size=4))
    model.add(layers.BatchNormalization())
    model.add(layers.Activation("relu"))
    model.add(layers.MaxPooling1D(2))
    
    model.add(layers.Flatten())
    model.add(layers.Dense(1,activation="sigmoid"))
    
    model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])
    
    return model


In [None]:
model = getModel()
model.summary()

In [None]:
history = model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=1)

**Convolutiona Neural network with**:
* 3 Conv1D layers with 128, 256 and 512 filters with kernel size set to 4 meaning each output is calculated based on previous 4 time steps and relu activation.
* 3 MaxPooling1D layers to downsample the input representation by taking the maximum value over the window of size 2.
* 1 Flatten layer to flatten the output of the convolutional layers to create a single long feature vector.
* 1 Dense layers with 1 neuron and sigmoid activation.

**Gives 97% validation accuracy with 1 epoch.**

## Creating a deployable model

**Saving weights of our model and pickle our tokenizer.**

In [None]:
model.save_weights("trained_model.h5")

In [None]:
import pickle
with open("tokenizer.pickle",mode="wb") as F:
    pickle.dump(tokenizer,F)

**Saving our label map using json library.**

In [None]:
import json
label_map = {0:"Fake",
             1:"Real"
            }

json.dump(label_map,open("label_map.json",mode="w"))

**Making sure our text data is clean.**

In [None]:
def cleanText(text):
    cleaned = re.sub("[^'a-zA-Z0-9]"," ",text)
    lowered = cleaned.lower().strip()
    return lowered

In [None]:
x_cleaned = [cleanText(t) for t in x]

In [None]:
class DeployModel():
    
    def __init__(self,weights_path,tokenizer_path,seq_length,label_map_path
                ):
        
        self.model = getModel()
        self.model.load_weights(weights_path)
        self.tokenizer = pickle.load(open(tokenizer_path,mode="rb"))
        self.seq_len = seq_length
        self.label_map = json.load(open(label_map_path))
    
    def _prepare_data(self,text):
        
        cleaned = cleanText(text)
        tokenized = self.tokenizer.texts_to_sequences([cleaned])
        padded = pad_sequences(tokenized,maxlen=self.seq_len)
        return padded
    
    def _predict(self,text):
        
        text = self._prepare_data(text)
        pred = int(self.model.predict_classes(text)[0])
        return str(pred)
    
    def result(self,text):
        
        pred = self._predict(text)
        return self.label_map[pred]

In [None]:
deploy_model = DeployModel(weights_path="./trained_model.h5",
                           tokenizer_path="./tokenizer.pickle",
                           seq_length=SEQUENCE_LENGTH,
                           label_map_path="./label_map.json"
                          )

In [None]:
test_text_real = x_cleaned[1000]

In [None]:
print(test_text_real)
print("\n\n===========================")
print("Results: ",deploy_model.result(test_text_real))

In [None]:
test_text_fake = x_cleaned[30000]

In [None]:
print(test_text_fake)
print("\n\n===========================")
print("Results: ",deploy_model.result(test_text_fake))