# Experiment 6: MLP network with Bag of Words features

Our first deep learning experiment following the proof of concept in the parent directory, this notebook will experiment with the most basic form of neural network, which is simply a multi-layer feedforward network (multilayer perceptron) with all nodes connected to every other node. However, in this one we will use a much bigger dataset, because the previous notebook is deceptive(lol) in its accuracy figures because it's such a small, similar dataset.

First, let's import and split our data. We'll use 50,000 reviews.

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from keras.preprocessing import text
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers as th

reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()
reviews = reviews_set[:50000]
X = [x.review_content for x in reviews]
y = np.array([x.label for x in reviews])

Using TensorFlow backend.


First, lets limit the number of words we use from our reviews, to filter out some of the nonsense. 10,000 is as good a number as any.

In [2]:
NUM_WORDS = 10000
tokenizer = text.Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(X)

Now, let's define a function that takes a bunch of reviews and returns them as word count vectors of size 10,000.

In [3]:
def tokenize(data):
    return tokenizer.texts_to_matrix(data, mode='count')

X = np.array(tokenize(X))

Let's take a look at a review to make sure it looks ok.

In [4]:
print(X[0])

[0. 0. 0. ... 0. 0. 0.]


## Initial fully connected FF network.

Now let's train and validate a model to see what our accuracies look like. 
This time, to help prevent overfitting, we're going to use k-fold cross validation, with k=10, to make sure our model isn't biased to a particular chunk of the dataset.
We wont use any regularization methods this time around so we can see how they affect it when we add them in to our layers later on.

In [5]:
from keras.callbacks import EarlyStopping
from keras import regularizers

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
    keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, activation=tf.nn.relu,),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=10, batch_size=2048, validation_split=0.3, verbose=1)
    scores = model.evaluate(X[test], y[test], verbose=2)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Train on 31499 samples, validate on 13500 samples
Epoch 1/10
 - 15s - loss: 0.6676 - acc: 0.5913 - val_loss: 0.6390 - val_acc: 0.6581
Epoch 2/10
 - 7s - loss: 0.6079 - acc: 0.6834 - val_loss: 0.6196 - val_acc: 0.6719
Epoch 3/10
 - 6s - loss: 0.5725 - acc: 0.7050 - val_loss: 0.6158 - val_acc: 0.6741
Epoch 4/10
 - 7s - loss: 0.5457 - acc: 0.7224 - val_loss: 0.6182 - val_acc: 0.6729
Epoch 5/10
 - 7s - loss: 0.5223 - acc: 0.7393 - val_loss: 0.6282 - val_acc: 0.6690
Epoch 6/10
 - 7s - loss: 0.5005 - acc: 0.7534 - val_loss: 0.6374 - val_acc: 0.6656
Epoch 7/10
 - 7s - loss: 0.4796 - acc: 0.7675 - val_loss: 0.6505 - val_acc: 0.6603
Epoch 8/10
 - 6s - loss: 0.4597 - acc: 0.7818 - val_loss: 0.6662 - val_acc: 0.6584
Epoch 9/10
 - 6s - loss: 0.4428 - acc: 0.7924 - val_loss: 0.6807 - val_acc: 0.6556
Epoch 10/10
 - 7s - loss: 0.4232 - acc: 0.8061 - val_loss: 0.7014 - val_acc: 0.6526
acc: 65.49%
Train on 31499 samples, validate on 13500 samples
Epoch 1/10
 - 15s - loss: 0.6731 - acc: 0.5609 - val_los

 - 7s - loss: 0.6293 - acc: 0.6724 - val_loss: 0.6247 - val_acc: 0.6648
Epoch 3/10
 - 5s - loss: 0.5866 - acc: 0.6990 - val_loss: 0.6156 - val_acc: 0.6697
Epoch 4/10
 - 5s - loss: 0.5570 - acc: 0.7197 - val_loss: 0.6154 - val_acc: 0.6685
Epoch 5/10
 - 6s - loss: 0.5304 - acc: 0.7378 - val_loss: 0.6211 - val_acc: 0.6682
Epoch 6/10
 - 6s - loss: 0.5052 - acc: 0.7539 - val_loss: 0.6304 - val_acc: 0.6646
Epoch 7/10
 - 7s - loss: 0.4791 - acc: 0.7718 - val_loss: 0.6457 - val_acc: 0.6608
Epoch 8/10
 - 7s - loss: 0.4525 - acc: 0.7890 - val_loss: 0.6603 - val_acc: 0.6577
Epoch 9/10
 - 5s - loss: 0.4252 - acc: 0.8079 - val_loss: 0.6836 - val_acc: 0.6544
Epoch 10/10
 - 4s - loss: 0.3997 - acc: 0.8249 - val_loss: 0.7010 - val_acc: 0.6536
acc: 63.21%
65.06% (+/- 0.86%)


Awesome. With only one hidden layer, 8 nodes, simple word count embeddings as features and no dropout or regularization, we get 65% accuracy. Not as good as our POC, but with over 30x the data, definitley a good first step.

We can see above that validation loss begins to decrease, but then starts to increase, while training loss and accuracy continues to increase. 
This indicates that the model is overfitting. It continues to get better and better at fitting the data that it sees (training data) while getting worse and worse at fitting the data that it does not see (validation data).


## Tackling overfitting

There's multiple methods of tackling overfitting. We can look at:

- Regularization functions (L1, L2)
- Dropout layers (Sets random fraction of inputs to 0 during training)
- Early stopping callbacks, to stop training when validation loss is not improving

We'll try all of these, as well as a couple of different architectures.

First, let's define our early stopping callback function, so we don't waste time training with diminishing returns. This will stop training when validation loss doesn't decrease after 3 epochs.


In [14]:
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

Now let's train our previous model, with early stopping and regularization.

There are three types of layer regularization in Keras: kernel, bias, and activity.

Kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.

Bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.

Activity: is applied to the output vector, the y in y = f(Wx + b).

To use regularization to prevent overfitting, we want to apply it to the kernel weights.

Let's start with adding L2 regularization with early stopping.

In [15]:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
    keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,), kernel_regularizer=regularizers.l2(0.01)),
    keras.layers.Dense(16, activation=tf.nn.relu, kernel_regularizer=regularizers.l2(0.01)),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=30, batch_size=1024, validation_split=0.3, verbose=0, callbacks=[early_stop])
    scores = model.evaluate(X[test], y[test], verbose=2)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Train on 27999 samples, validate on 12000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
acc: 66.66%
Train on 27999 samples, validate on 12000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
acc: 68.02%
Train on 28000 samples, validate on 12000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30


Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
acc: 66.38%
Train on 28000 samples, validate on 12001 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
acc: 66.77%
Train on 28000 samples, validate on 12001 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
acc: 66.29%
66.82% (+/- 0.62%)


Already a slight improvement, and the models validation loss decreases far better. Let's add a dropout layer to the mix before adding more dense layers.

In [19]:
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
    keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,), kernel_regularizer=regularizers.l2(0.01)),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(16, activation=tf.nn.relu, kernel_regularizer=regularizers.l2(0.01)),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=30, batch_size=1024, validation_split=0.3, verbose=1, callbacks=[early_stop])
    scores = model.evaluate(X[test], y[test], verbose=2)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Train on 27999 samples, validate on 12000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
acc: 66.63%
Train on 27999 samples, validate on 12000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
acc: 67.80%
Train on 28000 samples, validate on 12000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30


Epoch 16/30
Epoch 17/30
Epoch 18/30
acc: 66.52%
Train on 28000 samples, validate on 12001 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
acc: 66.84%
Train on 28000 samples, validate on 12001 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
acc: 66.64%
66.89% (+/- 0.47%)


Not much of a difference, but with only one hidden layer that's not overly surpising. Let's increase the number of nodes in the first layer to 32, add another hidden layer of size 16 with another dropout layer before it, and see what happens. Also, let's reduce cross validation to 3 folds to save some time.

In [25]:
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
    keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,), kernel_regularizer=regularizers.l2(0.01)),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(8, activation=tf.nn.relu, kernel_regularizer=regularizers.l2(0.01)),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=30, batch_size=512, validation_split=0.3, verbose=1, callbacks=[early_stop])
    scores = model.evaluate(X[test], y[test], verbose=2)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Train on 23332 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
acc: 66.43%
Train on 23333 samples, validate on 10001 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
acc: 66.08%
Train on 23333 samples, validate on 10001 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30


Epoch 18/30
acc: 66.60%
66.37% (+/- 0.22%)
