## Keras Sentiment Analysis Part 2: How not to tune your neural network

Manipulating some variables might yield better results but what would work best? Let's experiment and see what happens.

### Setting up the data

In [1]:
%matplotlib inline
import numpy
from keras.datasets import imdb
from matplotlib import pyplot
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

# Pad the small reviews
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

Using Theano backend.


### Implementing the Suggestions
The authors suggest tweaking the model could improve performance:

>Again, there is a lot of opportunity for further optimization, such as the use of deeper and/or larger convolutional layers. One interesting idea is to set the max pooling layer to use an input length of 500. This would compress each feature map to a single 32 length vector and may boost performance.

Our nets are going to get larger and training is going to take longer. Let's be somewhat tactical about our improvements and start with the suggestion of increasing the max pool length.

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=500))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

We've just modified the pool_length and nothing else. Let's train the model:

In [None]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Very interesting, we actually lost some performance, contrary to the author's suggestions. Instead, our accuracy is more comparable with the baseline model. We should test a few more values. Let's assume there is some improvement to be made between pool_length=2 and pool_length=500 and search for it like a binary search. Let's put the model definition, compilation, training and evaluation all in one method so we can more easily tweak the settings. Let's also remove the summary.

In [3]:
def test_pool_length(p_length):
    # create the model
    model = Sequential()
    model.add(Embedding(top_words, 32, input_length=max_words))
    model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
    model.add(MaxPooling1D(pool_length=p_length))
    model.add(Flatten())
    model.add(Dense(250, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    # Fit the model
    model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
    # Final evaluation of the model
    scores = model.evaluate(X_test, y_test, verbose=0)
    print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
test_pool_length(250)

In [None]:
test_pool_length(125)
test_pool_length(375)

We are getting more poor results! Oh no! At this point, I am concerned something is missing. Let's try to reproduce our original results:

In [None]:
test_pool_length(2)

Excellent, we our within a reasonable range of our original results. Perhaps we should test a few values closer to here instead of jumping to a pool length of 500. This is still part of our brute force approach. While these run, we can do some homework and some contemplation for better approaches.

In [None]:
test_pool_length(3)
test_pool_length(4)
test_pool_length(5)
test_pool_length(10)

It seems there might be a small but upward trend. Let's try a few more:

In [None]:
test_pool_length(15)
test_pool_length(20)
test_pool_length(25)
test_pool_length(50)
test_pool_length(100)

This doesn't really seem like it's helping us. I am curious though, if we make the pool_length 500 as suggested, what is the impact to the other layers? Let's experiment with the next suggestion of wider and/or deeper layers. The next step will revert the pool_length but we'll add another full layer.

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

It seems that it doesn't hurt. What happens if we widen it?

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Again, no harm there but no serious improvements either. Let's try our pool/wider layer theory:

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=500))
model.add(Flatten())
model.add(Dense(500, activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Somewhat of a drop? I noticed we are only running for 2 epochs. Perhaps we are not giving the model enough epochs for the model to converge. Let's see if that helps:

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=25, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

It's time for me to start looking for other solutions. I am wondering if ensembling is necessary to make the approx. 10% jump in performance I am hoping for. After reading a simple [stackexchange answer](http://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw), a rule of thumb is that hidden layers of sizes between the input and output often work better. Let's try a narrower net instead:

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Interesting results, not really a change. I did see varying/slightly improved results with smaller pool lengths earlier. Let's try the new size of the hidden layer.

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=32))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Definitely a step in the wrong direction but perhaps we can at least take our earlier improvements.

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=10))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=2, batch_size=128, verbose=0)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

The process seems to be largely unproductive. I have been wondering if I can set these parameters in an automated way, preferably not in a brute force approach or if I need to use another technique like ensembling.

### Conclusion

It seems our authors have handled this too!
* http://machinelearningmastery.com/machine-learning-model-running/
* http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
* http://machinelearningmastery.com/how-to-improve-machine-learning-results/
* http://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/
* http://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
* http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/
* http://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/