# Topic: Kaggle Campaign - Fake News Detection Challenge KDD 2020
Description: Using LSTM and RNN model to detect the fake news. Please refer to the dataset: https://www.kaggle.com/c/fakenewskdd2020

## Step1: Import the needed tools and datasets
Firstly, import all tools and dataset which are needed, including the the tools for manage dataframe, array, and the tools for characters transform to vectors, and the tools for model construct from scikit-learn. Also, import the score to measure the model performance. 
____
After loading the data, we can see there are only two columns, one is "Text", which is the Variable X. And the other column is "Label", when 1 representing Fake, and 0 representing True, which is the Target Variable, Value Y.

In [1]:
#import library
import pandas as pd
import numpy as np
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
#RNN Model
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
#LSTM Model
from keras.layers.recurrent import LSTM

In [2]:
#load data
df_train = pd.read_csv("train.csv", "\t",encoding='utf-8',header=(0)) 
df_test = pd.read_csv("test.csv", "\t",encoding='utf-8',header=(0))
df_sub = pd.read_csv("sample_submission.csv",encoding='utf-8',header=(0))

## Step2: Data Pre-processing
Firstly, checking the correction of datasets. Then seperate X and Y, Train and Test. After setting Stop words to get rid of the meanless words, we can transfer the text to vectors to caculate their features. Take the 1800 features only with the largest TF-IDF value among all features, which are Variable X. 

In [3]:
#check and remove of df_train row#1615
df_train.drop(axis=0, index=1615, inplace= True)
df_train = df_train.reset_index(drop=True)

In [4]:
df_train[1614:1617]

Unnamed: 0,text,label
1614,Justin Bieber has had his fair share of accide...,1
1615,Mel B has filed for divorce from her husband o...,0
1616,Recognizing The Potential And Perils Of A 'Lim...,1


In [5]:
#set x&y train and test
x_train = df_train['text']
y_train_r = df_train['label']
x_test = df_test['text']
y_test=pd.to_numeric(df_sub['label'])

In [6]:
print(x_train.shape)
print(y_train_r.shape)
print(x_test.shape)
print(y_test.shape)

(4986,)
(4986,)
(1247,)
(1247,)


In [7]:
y_train = []
def strtoint(y_r, y_n):
    for i in range(len(y_r)):
        if y_r[i] =='0':
            y_n.append(0)
        if y_r[i] =='1':
            y_n.append(1)

strtoint(y_train_r, y_train)
y_train = np.array(y_train)

In [8]:
#set stop words
stopwords= text.ENGLISH_STOP_WORDS

In [9]:
#transform text to vector by Tfidf
vectorizer = TfidfVectorizer(
            norm='l2',                      
            stop_words=stopwords,
            max_features=1800               
            )

X_train = vectorizer.fit_transform(x_train).toarray()
X_test = vectorizer.fit_transform(x_test).toarray()

## Step 3: Construct Model
### RNN Model
The steps will be : Creating the sequential model for Keras > Creating the layers of RNN model > Training Model
___
Creating the sequential model for Keras
1. Output dimension is 32 bits
2. Input dimension is 2800 (After several trying)
3. Input length is 1800 which we adapt in the previous step
___
Creating the layers of RNN model
1. Layer of SimpleRNN outputs 16 bits
2. Dense layer uses relu as the activation function
3. Add another doupout layer to avoid overfitting
___
Training Model
1. Iterate 10 times, using 100 data per iteration, showing the process of each iteration

In [10]:
#set keras sequential model
modelRNN = Sequential()
modelRNN.add(Embedding(output_dim = 32,
                      input_dim = 2800,
                      input_length = 1800))
modelRNN.add(Dropout(0.2))

In [11]:
#set keras NN model
modelRNN.add(SimpleRNN(units = 16, return_sequences=True))
modelRNN.add(Dense(units = 256, activation = 'relu'))
modelRNN.add(Dropout(0.35))

modelRNN.add(Dense(units = 1, activation = 'sigmoid'))

In [12]:
#model summary
modelRNN.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1800, 32)          89600     
_________________________________________________________________
dropout (Dropout)            (None, 1800, 32)          0         
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 1800, 16)          784       
_________________________________________________________________
dense (Dense)                (None, 1800, 256)         4352      
_________________________________________________________________
dropout_1 (Dropout)          (None, 1800, 256)         0         
_________________________________________________________________
dense_1 (Dense)              (None, 1800, 1)           257       
Total params: 94,993
Trainable params: 94,993
Non-trainable params: 0
____________________________________________________

In [13]:
#define training model
modelRNN.compile(loss = 'binary_crossentropy',
                optimizer = 'adam',
                metrics = ['accuracy'])

train_history = modelRNN.fit(X_train, y_train,
                            epochs = 10,
                            batch_size = 100,
                            verbose = 2)

Epoch 1/10
50/50 - 59s - loss: 0.6771 - accuracy: 0.5944
Epoch 2/10
50/50 - 56s - loss: 0.6755 - accuracy: 0.5961
Epoch 3/10
50/50 - 58s - loss: 0.6752 - accuracy: 0.5961
Epoch 4/10
50/50 - 63s - loss: 0.6756 - accuracy: 0.5961
Epoch 5/10
50/50 - 62s - loss: 0.6756 - accuracy: 0.5961
Epoch 6/10
50/50 - 57s - loss: 0.6752 - accuracy: 0.5961
Epoch 7/10
50/50 - 54s - loss: 0.6752 - accuracy: 0.5961
Epoch 8/10
50/50 - 50s - loss: 0.6754 - accuracy: 0.5961
Epoch 9/10
50/50 - 50s - loss: 0.6757 - accuracy: 0.5961
Epoch 10/10


KeyboardInterrupt: 

In [14]:
#evaluate model with test data
scores = modelRNN.evaluate(X_test, y_test, verbose = 1)



LSTM Model

In [18]:
#set keras sequential model
modelLSTM = Sequential()
modelLSTM.add(Embedding(output_dim = 32,
                      input_dim = 2000,
                      input_length = 1800))
modelRNN.add(Dropout(0.2))

In [19]:
#set keras NN model
modelLSTM.add(LSTM(units = 32, return_sequences=True))
modelLSTM.add(Dense(units = 256, activation = 'relu'))
modelLSTM.add(Dropout(0.35))
#output layer
modelLSTM.add(Dense(units = 1, activation = 'sigmoid'))

In [20]:
#model summary
modelLSTM.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1800, 32)          64000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 1800, 32)          8320      
_________________________________________________________________
dense_4 (Dense)              (None, 1800, 256)         8448      
_________________________________________________________________
dropout_5 (Dropout)          (None, 1800, 256)         0         
_________________________________________________________________
dense_5 (Dense)              (None, 1800, 1)           257       
Total params: 81,025
Trainable params: 81,025
Non-trainable params: 0
_________________________________________________________________


In [21]:
#define training model
modelLSTM.compile(loss = 'binary_crossentropy',
                optimizer = 'adam',
                metrics = ['accuracy'])

train_history = modelLSTM.fit(X_train, y_train,
                            epochs = 10,
                            batch_size = 100,
                            verbose = 2)

Epoch 1/10
50/50 - 77s - loss: 0.6786 - accuracy: 0.5906
Epoch 2/10
50/50 - 70s - loss: 0.6744 - accuracy: 0.5961
Epoch 3/10
50/50 - 69s - loss: 0.6757 - accuracy: 0.5961
Epoch 4/10
50/50 - 69s - loss: 0.6751 - accuracy: 0.5961
Epoch 5/10
50/50 - 73s - loss: 0.6749 - accuracy: 0.5961
Epoch 6/10
50/50 - 70s - loss: 0.6752 - accuracy: 0.5961
Epoch 7/10
50/50 - 74s - loss: 0.6751 - accuracy: 0.5961
Epoch 8/10
50/50 - 78s - loss: 0.6749 - accuracy: 0.5961
Epoch 9/10


KeyboardInterrupt: 

In [22]:
#evaluate model with test data
scores = modelLSTM.evaluate(X_test, y_test, verbose = 1)

