# Experiment 6: Fully connected feedforward neural network with Bag of Words features

Our first deep learning experiment following the proof of concept in the parent directory, this notebook will experiment with the most basic form of neural network, which is simply a multi-layer feedforward Perceptron with all nodes connected to every other node. However, in this one we will use a much bigger dataset, because the previous notebook is deceptive(lol) in its accuracy figures because it's such a small, similar dataset.

First, let's import and split our data. We'll use 20,000 reviews.

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from keras.preprocessing import text
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from exp4_data_feature_extraction import get_balanced_dataset
from scripts import training_helpers as th

reviews_set, fake_reviews, genuine_reviews, unused_genuine_reviews = get_balanced_dataset()
reviews = reviews_set[:50000]
X = [x.review_content for x in reviews]
y = np.array([x.label for x in reviews])

Using TensorFlow backend.


First, lets limit the number of words we use from our reviews, to filter out some of the nonsense. 10000 is as good a number as any.

In [2]:
NUM_WORDS = 10000
tokenizer = text.Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(X)

Now, let's define a function that takes a bunch of reviews and returns them as word count vectors of size 10,000.

In [3]:
def tokenize(data):
    return tokenizer.texts_to_matrix(data, mode='count')

X = np.array(tokenize(X))

Let's take a look at a review to make sure it looks ok.

In [4]:
print(X[0])

[0. 1. 3. ... 0. 0. 0.]


Looks good. Now let's train and validate a model to see what our accuracies look like. 
This time, to help prevent overfitting, we're going to use k-fold cross validation, with k=10, to make sure our model isn't biased to a particular chunk of the dataset.
We wont use any regularization methods this time around so we can see how they affect it when we add them in to our layers later on.

In [5]:
from keras.callbacks import EarlyStopping
from keras import regularizers

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    model = keras.Sequential([
    keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, activation=tf.nn.relu,),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
    ])
    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
    model.fit(X[train], y[train], epochs=10, batch_size=2048, validation_split=0.3, verbose=2)
    scores = model.evaluate(X[test], y[test], verbose=2)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

Train on 31499 samples, validate on 13500 samples
Epoch 1/10
 - 12s - loss: 0.6842 - acc: 0.5746 - val_loss: 0.6617 - val_acc: 0.6341
Epoch 2/10
 - 5s - loss: 0.6348 - acc: 0.6687 - val_loss: 0.6269 - val_acc: 0.6601
Epoch 3/10
 - 4s - loss: 0.5959 - acc: 0.6959 - val_loss: 0.6159 - val_acc: 0.6687
Epoch 4/10
 - 4s - loss: 0.5673 - acc: 0.7144 - val_loss: 0.6139 - val_acc: 0.6694
Epoch 5/10
 - 5s - loss: 0.5424 - acc: 0.7327 - val_loss: 0.6175 - val_acc: 0.6693
Epoch 6/10
 - 5s - loss: 0.5188 - acc: 0.7513 - val_loss: 0.6258 - val_acc: 0.6634
Epoch 7/10
 - 5s - loss: 0.4955 - acc: 0.7660 - val_loss: 0.6362 - val_acc: 0.6621
Epoch 8/10
 - 5s - loss: 0.4729 - acc: 0.7819 - val_loss: 0.6488 - val_acc: 0.6615
Epoch 9/10
 - 5s - loss: 0.4488 - acc: 0.7988 - val_loss: 0.6665 - val_acc: 0.6561
Epoch 10/10
 - 5s - loss: 0.4247 - acc: 0.8144 - val_loss: 0.6823 - val_acc: 0.6547
acc: 65.75%
Train on 31499 samples, validate on 13500 samples
Epoch 1/10
 - 10s - loss: 0.6726 - acc: 0.5759 - val_los

 - 10s - loss: 0.6177 - acc: 0.6792 - val_loss: 0.6214 - val_acc: 0.6682
Epoch 3/10
 - 12s - loss: 0.5831 - acc: 0.7013 - val_loss: 0.6118 - val_acc: 0.6731
Epoch 4/10
 - 7s - loss: 0.5571 - acc: 0.7202 - val_loss: 0.6114 - val_acc: 0.6715
Epoch 5/10
 - 5s - loss: 0.5339 - acc: 0.7373 - val_loss: 0.6185 - val_acc: 0.6671
Epoch 6/10
 - 5s - loss: 0.5121 - acc: 0.7508 - val_loss: 0.6252 - val_acc: 0.6688
Epoch 7/10
 - 5s - loss: 0.4909 - acc: 0.7652 - val_loss: 0.6377 - val_acc: 0.6623
Epoch 8/10
 - 5s - loss: 0.4699 - acc: 0.7794 - val_loss: 0.6521 - val_acc: 0.6600
Epoch 9/10
 - 5s - loss: 0.4472 - acc: 0.7950 - val_loss: 0.6668 - val_acc: 0.6568
Epoch 10/10
 - 7s - loss: 0.4261 - acc: 0.8094 - val_loss: 0.6877 - val_acc: 0.6528
acc: 65.65%
64.83% (+/- 0.83%)


Awesome. With only one hidden layer, 8 nodes, simple word count embeddings as features and no dropout or regularization, we get 65% accuracy. Not as good as our POC, but with almost 12x the data, definitley a good first step.

Next steps: Add regularizations (L1 and L2), compare, add dropout, compare, add early stop callback, try TFIDF, add some more features, and call it a day.