# Sentiment analysis with CNN's and Keras

This task was the objective in a [kaggle competition](https://www.kaggle.com/c/sentiment-analysis-on-imdb-movie-reviews/leaderboard). At the time, the best performing submission evaluated by log loss was 0.24591 so we're trying to beat that.

In [14]:
from keras.preprocessing import sequence
from keras.layers import LSTM, Convolution1D, Flatten, Dropout, Dense, MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.models import Sequential

from keras.datasets import imdb

# Load & prep dataset

In [2]:
top_words = 1000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [3]:
max_review_len = 1600
embedding_vector_len = 300

In [4]:
X_train = sequence.pad_sequences(X_train, maxlen=max_review_len)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_len)

# Build straight forward CNN model

In [5]:
mdl = Sequential()
mdl.add(Embedding(top_words, embedding_vector_len, input_length=max_review_len))

mdl.add(Convolution1D(64, 3, padding='same'))
mdl.add(Convolution1D(32, 3, padding='same'))
mdl.add(Convolution1D(16, 3, padding='same'))
mdl.add(Flatten())
mdl.add(Dropout(0.2))

mdl.add(Dense(180, activation='sigmoid'))
mdl.add(Dropout(0.2))
mdl.add(Dense(1, activation='sigmoid'))
mdl.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

## Train model

In [6]:
mdl.fit(X_train, y_train, epochs=4, batch_size=64)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x114e69710>

In [8]:
scores = mdl.evaluate(X_test, y_test, verbose=0)

Accuracy: 84.132
[0.43232870185852051, 0.84131999999999996]


In [10]:
nms = mdl.metrics_names
print("{}: {}".format(nms[0], scores[0]))
print("{}: {}".format(nms[1], scores[1]*100))

loss: 0.432328701859
acc: 84.132


In [11]:
mdl.fit(X_train, y_train, epochs=4, batch_size=64)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x1157b2cd0>

In [12]:
scores = mdl.evaluate(X_test, y_test, verbose=0)

nms = mdl.metrics_names
print("{}: {}".format(nms[0], scores[0]))
print("{}: {}".format(nms[1], scores[1]*100))

loss: 0.660322692337
acc: 82.288


The testing loss and accuracy are getting worse while the training loss and accuracy are getting better. It's probably overfitting, so moving on to another model.

# Slightly more complicated CNN model

Adding a max pooling layer, bumping up the dropout, and doubling the CNN stack.

In [16]:
mdl = Sequential()
mdl.add(Embedding(top_words, embedding_vector_len, input_length=max_review_len))

mdl.add(Convolution1D(64, 3, padding='same'))
mdl.add(Convolution1D(32, 3, padding='same'))
mdl.add(Convolution1D(16, 3, padding='same'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Dropout(0.25))

mdl.add(Convolution1D(64, 3, padding='same'))
mdl.add(Convolution1D(32, 3, padding='same'))
mdl.add(Convolution1D(16, 3, padding='same'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Flatten())
mdl.add(Dropout(0.25))

mdl.add(Dense(180, activation='sigmoid'))
mdl.add(Dropout(0.25))
mdl.add(Dense(1, activation='sigmoid'))
mdl.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

## Train model

In [17]:
mdl.fit(X_train, y_train, epochs=4, batch_size=64)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x10c7e9990>

In [18]:
scores = mdl.evaluate(X_test, y_test, verbose=0)

nms = mdl.metrics_names
print("{}: {}".format(nms[0], scores[0]))
print("{}: {}".format(nms[1], scores[1]*100))

loss: 0.299845308571
acc: 87.376


At 0.29984 we're not too far away from 0.24591 but I'm pretty sure we can do better than this.

# More complicated model

messing with the convolution stacks after adding another, tweaking dropout to accomodate an extra layer

In [19]:
mdl = Sequential()
mdl.add(Embedding(top_words, embedding_vector_len, input_length=max_review_len))

mdl.add(Convolution1D(128, 3, padding='same'))
mdl.add(Convolution1D(64, 3, padding='same'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Dropout(0.25))

mdl.add(Convolution1D(128, 3, padding='same'))
mdl.add(Convolution1D(64, 3, padding='same'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Dropout(0.25))

mdl.add(Convolution1D(64, 3, padding='same'))
mdl.add(Convolution1D(32, 3, padding='same'))
mdl.add(Convolution1D(16, 3, padding='same'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Flatten())
mdl.add(Dropout(0.25))

mdl.add(Dense(256, activation='sigmoid'))
mdl.add(Dropout(0.25))
mdl.add(Dense(1, activation='sigmoid'))
mdl.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

## Training the model

In [20]:
mdl.fit(X_train, y_train, epochs=1, batch_size=64)

Epoch 1/1


<keras.callbacks.History at 0x122c18590>

In [21]:
mdl.optimizer.lr = 0.01

In [22]:
mdl.fit(X_train, y_train, epochs=2, batch_size=64)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x11f03a050>

In [24]:
mdl.optimizer.lr = 0.001

In [25]:
mdl.fit(X_train, y_train, epochs=2, batch_size=64)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x115694650>

In [26]:
scores = mdl.evaluate(X_test, y_test, verbose=0)

nms = mdl.metrics_names
print("{}: {}".format(nms[0], scores[0]))
print("{}: {}".format(nms[1], scores[1]*100))

loss: 0.336828304973
acc: 87.652


Ok, well that did worse. Better accuracy, but worse on log loss.

- Benchmark: 0.24591
- Model 1: 0.29984
- Model 2: 0.33682

# Progressive window CNN architecture

altering our convolution stack to create progressively larger convolution windows, pooling in each stage.

In [29]:
mdl = Sequential()
mdl.add(Embedding(top_words, embedding_vector_len, input_length=max_review_len))
mdl.add(Dropout(0.25))

mdl.add(Convolution1D(64, 3, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))

mdl.add(Convolution1D(64, 4, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))

mdl.add(Convolution1D(64, 5, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))

mdl.add(Convolution1D(64, 6, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Flatten())

mdl.add(Dropout(0.25))
mdl.add(Dense(128, activation='relu'))
mdl.add(Dropout(0.5))
mdl.add(Dense(1, activation='sigmoid'))
mdl.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

## Train model

In [30]:
mdl.fit(X_train, y_train, epochs=1, batch_size=64)

Epoch 1/1


<keras.callbacks.History at 0x12265eed0>

In [31]:
mdl.optimizer.lr = 0.01

In [32]:
mdl.fit(X_train, y_train, epochs=1, batch_size=64)

Epoch 1/1


<keras.callbacks.History at 0x12340bb10>

In [33]:
mdl.optimizer.lr = 0.001

In [34]:
mdl.fit(X_train, y_train, epochs=2, batch_size=64)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x11f143710>

In [35]:
scores = mdl.evaluate(X_test, y_test, verbose=0)

nms = mdl.metrics_names
print("{}: {}".format(nms[0], scores[0]))
print("{}: {}".format(nms[1], scores[1]*100))

loss: 0.265067278109
acc: 88.836


Better log loss & accuracy. Headed in the right direction.

- Benchmark: 0.24591
- Model 1: 0.29984
- Model 2: 0.33682
- Model 3: 0.26506

This score would have gotten us a rank of 16 had it been entered into the contest.

It doesn't appear to be overfitting so let's try a few more epochs and see where it goes. _Breaking best practices by introducing the test data to the training intervals for live updates. Should have split out a validation set._

In [36]:
mdl.fit(X_train, y_train, epochs=4, validation_data=(X_test, y_test), batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x122c29c50>

Now we're starting to overfit. Adding more dropout.

## Tweaking model 3

Increasing dropout

In [41]:
mdl = Sequential()
mdl.add(Embedding(top_words, embedding_vector_len, input_length=max_review_len))
mdl.add(Dropout(0.3))

mdl.add(Convolution1D(64, 3, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))

mdl.add(Convolution1D(64, 4, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))

mdl.add(Convolution1D(64, 5, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))

mdl.add(Convolution1D(64, 6, padding='same', activation='relu'))
mdl.add(MaxPooling1D(pool_size=2))
mdl.add(Flatten())

mdl.add(Dropout(0.3))
mdl.add(Dense(128, activation='relu'))
mdl.add(Dropout(0.3))
mdl.add(Dense(1, activation='sigmoid'))
mdl.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

In [42]:
mdl.fit(X_train, y_train, epochs=1, validation_data=(X_test, y_test), batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x11648f5d0>

In [43]:
mdl.optimizer.lr = 0.01

In [44]:
mdl.fit(X_train, y_train, epochs=2, validation_data=(X_test, y_test), batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x12543c3d0>

In [45]:
mdl.optimizer.lr = 0.001

In [46]:
mdl.fit(X_train, y_train, epochs=2, validation_data=(X_test, y_test), batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x12543c790>

Looks like this model is starting to overfit now as well, and didn't perform better than the original configuration of model 3, so going back to the first dropout settings.