# Experiment 6: MLP network with Bag of Words features

Our first deep learning experiment following the proof of concept in the parent directory, this notebook will experiment with the most basic form of neural network, which is simply a multi-layer feedforward network (multilayer perceptron) with all nodes connected to every other node. However, in this one we will use a much bigger dataset, because the previous notebook is deceptive(lol) in its accuracy figures because it's such a small, similar dataset.

First, let's import and split our data. We'll use 20,000 reviews.

In [4]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from keras.preprocessing import text
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers as th

reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()
reviews = reviews_set[:20000]
X = [x.review_content for x in reviews]
y = np.array([x.label for x in reviews])

First, lets limit the number of words we use from our reviews, to filter out some of the nonsense. 10,000 is as good a number as any.

In [5]:
NUM_WORDS = 10000
tokenizer = text.Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(X)

Now, let's define a function that takes a bunch of reviews and returns them as tfidf count vectors of size 10,000.

In [6]:
def tokenize(data):
    return tokenizer.texts_to_matrix(data, mode='tfidf')

X = np.array(tokenize(X))

Let's take a look at a review to make sure it looks ok.

In [7]:
print(X)

[[0.         1.28969329 1.32839609 ... 0.         0.         0.        ]
 [0.         1.5985416  1.32839609 ... 0.         0.         0.        ]
 [0.         2.34565261 2.95986798 ... 0.         0.         0.        ]
 ...
 [0.         1.98764444 1.64651272 ... 0.         0.         0.        ]
 [0.         1.5985416  0.         ... 0.         0.         0.        ]
 [0.         0.         1.32839609 ... 0.         0.         0.        ]]


## Initial fully connected FF network.

Now let's train and validate a model to see what our accuracies look like. 
This time, to help prevent overfitting, we're going to use k-fold cross validation, with k=5, to make sure our model isn't biased to a particular chunk of the dataset.
We wont use any regularization methods this time around so we can see how they affect it when we add them in to our layers later on.

In [None]:
from keras import regularizers

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
        keras.layers.Dense(8, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
        keras.layers.Dense(8, activation=tf.nn.relu,),
        keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=6, batch_size=2048, validation_split=0.3, verbose=0)
    scores = model.evaluate(X[test], y[test], verbose=1)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Awesome. With only one hidden layer, 8 nodes, simple word count embeddings as features and no dropout or regularization, we get 64.3% accuracy. Not as good as our POC, but with over 30x the data, definitley a good first step.

We can see above that validation loss begins to decrease, but then starts to increase, while training loss and accuracy continues to increase. 
This indicates that the model is overfitting. It continues to get better and better at fitting the data that it sees (training data) while getting worse and worse at fitting the data that it does not see (validation data).


## Tackling overfitting

There's multiple methods of tackling overfitting. We can look at:

- Regularization functions (L1, L2)
- Dropout layers (Sets random fraction of inputs to 0 during training)
- Early stopping callbacks, to stop training when validation loss is not improving

We'll try all of these, as well as a couple of different architectures.

First, let's define our early stopping callback function, so we don't waste time training with diminishing returns. This will stop training when validation loss doesn't decrease after 3 epochs.


In [22]:
from keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=3)

Now let's train our previous model, with early stopping and regularization.

There are three types of layer regularization in Keras: kernel, bias, and activity.

Kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.

Bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.

Activity: is applied to the output vector, the y in y = f(Wx + b).

To use regularization to prevent overfitting, we want to apply it to the kernel weights.

Let's start with adding L2 regularization with early stopping.

In [24]:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
        keras.layers.Dense(8, activation=tf.nn.relu, input_shape=(NUM_WORDS,), kernel_regularizer=regularizers.l2(0.01)),
        keras.layers.Dense(8, activation=tf.nn.relu, kernel_regularizer=regularizers.l2(0.01)),
        keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=30, batch_size=1024, validation_split=0.3, verbose=0, callbacks=[early_stop])
    scores = model.evaluate(X[test], y[test], verbose=2)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

acc: 63.38%
acc: 65.20%
acc: 64.05%
acc: 65.12%
acc: 65.69%
64.69% (+/- 0.84%)


Already a slight improvement, and the models validation loss decreases far better. Let's add a dropout layer to the mix before we change up the architecture.

In [28]:
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
        keras.layers.Dense(8, activation=tf.nn.relu, input_shape=(NUM_WORDS,), kernel_regularizer=regularizers.l2(0.01)),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(8, activation=tf.nn.relu, kernel_regularizer=regularizers.l2(0.01)),
        keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=6, batch_size=512, validation_split=0.3, verbose=0, callbacks=[early_stop])
    scores = model.evaluate(X[test], y[test], verbose=1)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

acc: 63.06%
acc: 65.33%
acc: 64.83%
acc: 64.78%
acc: 65.14%
64.63% (+/- 0.81%)


Not much of a difference. Let's change up the number of nodes and see if we can get any increases.

In [32]:
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
        keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,), kernel_regularizer=regularizers.l2(0.01)),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(8, activation=tf.nn.relu, kernel_regularizer=regularizers.l2(0.01)),
        keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=7, batch_size=2048, validation_split=0.3, verbose=0, callbacks=[early_stop])
    scores = model.evaluate(X[test], y[test], verbose=1)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

acc: 64.57%
acc: 65.20%
acc: 61.54%
63.77% (+/- 1.60%)


Seems like we've hit a plateau with BOW features on this type of neural network. On the next experiment, we'll use a far more complex method of representing words known as word embeddings, which is sure to help us.