In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Conv1D, MaxPooling1D, GlobalMaxPooling1D
from keras.optimizers import RMSprop, adam
from keras import preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.datasets import imdb

Dataset this time is the imdb review set, how do we evaluate the sentiment expressed by people who bother to write movie reviews?  At some level this is a harder problem than we may expect - if we're not careful with how we build our model we may end up ranking a review like "this movie is the bomb" the same as "this movie is a bomb" even though they clearly express different sentiments.  Our first model will purposefully suffer from this problem as an illustration of what to avoid.

There are a few ways to skin this cat, and you've seen one of the less efficient ones with the statistical NLP approach already.  The other two ones at this stage are one hot encoding (creating large numbers of sparse vectors to represent words in a text) and word embeddings (dense, low-dimensional and data specific.)  For a scale comparison, you'll expect to see word embeddings with a few hundreds to low thousands of dimensions while one hot encoding produces dimensionality on the order of tens of thousands(!!!) We'll be working with the latter.  

Broadly speaking there are two ways to learn and use word embeddings: learn them on the fly or use pretrained embeddings.  We'll examine both, but in practice will generally rely on the latter. 

In [None]:
#These are some variables we're setting for easy maneuver later, don't worry about their names now
max_features = 10000
maxlen = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_features)

#A drawback of a static setup like what we're about to do is that the inputs are expected
#to be of a consistent size - in this case 20 words.  To accomodate shorter sentences
#we simply pad the length of all sentences to match this length limit.
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen = maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen = maxlen)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [None]:
#Build our model
model = Sequential()
#This is the first time we've used the Embedding layer.  For simplicity, we can
#think of it as a dictionary that maps integer indices to dense vectors.  As
#input we take a 2D tensor (samples, sequence_length) and returns a 3D tensor
#(samples, sequence_length, embedding_dim)
model.add(Embedding(10000, 8, input_length = maxlen)) #(10000 samples, maxlen, 8) 
model.add(Flatten()) #2D tensor (samples, maxlen * 8)
model.add(Dense(1, activation = 'sigmoid')) #1D probability output
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(x_train, y_train, epochs = 10, batch_size = 32, validation_split = 0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f1e1baff400>

In [None]:
test_loss, test_acc = model.evaluate(x_test, y_test)

print(test_acc)

0.735759973526001


So without any real structure to our network and a gimped view of the data, e.g. considerations of relationships between words, context, looking at only the first 20 words in a review, etc., we scored around 73% accuracy in our automated review checker.  We can improve this by using 1D convnets or Recurrent layers in our next iteration.  Let's start with 1D convnets.

We looked at convnets last time to deal with picture data, and found it valuable for extracting local features and building modular and efficient data representations.  We can make use of these features by considering our text sequence to be something like a stream of time flowing forwards.  We'll find later that RNN's are generally better at text tasks, but small 1D convnets can perform better and more quickly for simple tasks.

Whereas previously we were considering patches of an image with our convnets, now we are looking at subsequences of our sequences.  These layers will help to recognize local patterns of a larger sentence.  For example, in the ideal, a window of size three should be able to learn words or word fragments of length three or less, and should be able to recognize that pattern anywhere else in the sequence.

In [None]:
#We'll implement a new model for comparison down here.
max_features = 10000
max_len = 50

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_features)

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen = max_len)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen = max_len)

In [None]:
cnn = Sequential()
cnn.add(Embedding(max_features, 128, input_length = max_len))
cnn.add(Conv1D(32, 7, activation = 'relu')) #Note the different window size here, we can afford a large one now at 7
cnn.add(MaxPooling1D(5))
cnn.add(Conv1D(32, 7, activation = 'relu'))
cnn.add(GlobalMaxPooling1D())
cnn.add(Dense(1))

cnn.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 128)           1280000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 44, 32)            28704     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 8, 32)             0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 2, 32)             7200      
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 1,315,937
Trainable params: 1,315,937
Non-trainable params: 0
____________________________________________

In [None]:
cnn.compile(optimizer = adam(lr=1e-4), loss = 'binary_crossentropy', metrics = ['accuracy'])
cnn.fit(x_train, y_train, epochs = 8, batch_size = 128, validation_split = 0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 5000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.callbacks.History at 0x7f1dc0cffdd8>

In [None]:
test_loss, test_acc = cnn.evaluate(x_test, y_test)

print(test_acc)

0.7619600296020508


So we got about a 3% absolute improvement for what seems like a fairly significant increase in complexity.  WHat are the lessons here?

1) We can move away from default settings, and often should.  I encourage you to try to run this notebook with a default learning rate.  This is also a hard problem, don't be discouraged by the low accuracy we're getting.

2) CNN's may give us a local sense of information, but we lose this beyond the window.  RNN will alleviate this to a degree.

3) We did not try regularization, which may have helped us bridge the accuracy gap.