# Sequence classification

In this exercise, you will get familiar with how to build RNNs in Keras. You will build a recurrent model to classify moview reviews as either positive or negative.

In [1]:
%matplotlib inline

import numpy as np
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding
from tensorflow.keras.layers import LSTM, SimpleRNN, GRU
from tensorflow.keras.datasets import imdb

## IMDB Sentiment Dataset

The large movie review dataset is a collection of 25k positive and 25k negative movie reviews from [IMDB](http://www.imdb.com). Here are some excerpts from the dataset, both easy and hard, to get a sense of why this dataset is challenging:

> Ah, I loved this movie.

> Quite honestly, The Omega Code is the worst movie I have seen in a very long time.

> The wit and pace and three show stopping Busby Berkley numbers put this ahead of the over-rated 42nd Street. 

> There simply was no suspense, precious little excitement and too many dull spots, most of them trying to show why "Nellie" (Monroe) was so messed up.

The dataset can be found at http://ai.stanford.edu/~amaas/data/sentiment/. Since this is a common dataset for RNNs, Keras has a preprocessed version built-in.

In [29]:
# We will limit to the most frequent 20k words defined by max_features, our vocabulary size
max_features = 100
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

The data is preprocessed by replacing words with indexes - review [Keras's docs](http://keras.io/datasets/#imdb-movie-reviews-sentiment-classification). Here's the first review in the training set.

In [31]:
review = X_train[0]
review

[1,
 14,
 22,
 16,
 43,
 2,
 2,
 2,
 2,
 65,
 2,
 2,
 66,
 2,
 4,
 2,
 36,
 2,
 5,
 25,
 2,
 43,
 2,
 2,
 50,
 2,
 2,
 9,
 35,
 2,
 2,
 5,
 2,
 4,
 2,
 2,
 2,
 2,
 2,
 2,
 39,
 4,
 2,
 2,
 2,
 17,
 2,
 38,
 13,
 2,
 4,
 2,
 50,
 16,
 6,
 2,
 2,
 19,
 14,
 22,
 4,
 2,
 2,
 2,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 2,
 38,
 76,
 15,
 13,
 2,
 4,
 22,
 17,
 2,
 17,
 12,
 16,
 2,
 18,
 2,
 5,
 62,
 2,
 12,
 8,
 2,
 8,
 2,
 5,
 4,
 2,
 2,
 16,
 2,
 66,
 2,
 33,
 4,
 2,
 12,
 16,
 38,
 2,
 5,
 25,
 2,
 51,
 36,
 2,
 48,
 25,
 2,
 33,
 6,
 22,
 12,
 2,
 28,
 77,
 52,
 5,
 14,
 2,
 16,
 82,
 2,
 8,
 4,
 2,
 2,
 2,
 15,
 2,
 4,
 2,
 7,
 2,
 5,
 2,
 36,
 71,
 43,
 2,
 2,
 26,
 2,
 2,
 46,
 7,
 4,
 2,
 2,
 13,
 2,
 88,
 4,
 2,
 15,
 2,
 98,
 32,
 2,
 56,
 26,
 2,
 6,
 2,
 2,
 18,
 4,
 2,
 22,
 21,
 2,
 2,
 26,
 2,
 5,
 2,
 30,
 2,
 18,
 51,
 36,
 28,
 2,
 92,
 25,
 2,
 4,
 2,
 65,
 16,
 38,
 2,
 88,
 12,
 16,
 2,
 5,
 16,
 2,
 2,
 2,
 32,
 15,
 16,
 2,
 19,
 2,
 32]

We can convince ourselves that these are movies reviews, using the vocabulary provided by keras:

In [32]:
word_index = imdb.get_word_index()

First we create a dictionary from index to word, notice that words are indexed starting from the number 3, while the first three entries are for special characters:

In [33]:
index_word = {i+3: w for w, i in word_index.items()}
index_word[0]=''
index_word[1]='start_char'
index_word[2]='oov'

Then we can covert the first review to text:

In [34]:
' '.join([index_word[i] for i in review])

"start_char this film was just oov oov oov oov story oov oov really oov the oov they oov and you oov just oov oov there oov oov is an oov oov and oov the oov oov oov oov oov oov from the oov oov oov as oov so i oov the oov there was a oov oov with this film the oov oov oov the film were great it was just oov so much that i oov the film as oov as it was oov for oov and would oov it to oov to oov and the oov oov was oov really oov at the oov it was so oov and you oov what they oov if you oov at a film it oov have been good and this oov was also oov to the oov oov oov that oov the oov of oov and oov they were just oov oov are oov oov out of the oov oov i oov because the oov that oov them all oov up are oov a oov oov for the oov film but oov oov are oov and oov be oov for what they have oov don't you oov the oov story was so oov because it was oov and was oov oov oov all that was oov with oov all"

#### Exercise 1 - prepare the data

The reviews are different lengths but we need to fit them into a matrix to feed to Keras. We will do this by picking a maximum word length and cutting off words from the examples that are over that limit and padding the examples with 0 if they are under the limit.

Refer to the [Keras docs](http://keras.io/preprocessing/sequence/#pad_sequences) for the `pad_sequences` function. Use `pad_sequences` to prepare both `X_train` and `X_test` to be `maxlen` long at the most.

In [63]:
maxlen = 80
# Pad and clip the example sequences
X_train_t = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_t = sequence.pad_sequences(X_test, maxlen=maxlen)

X_train_t.shape

(25000, 80)

#### Exercise 2 - build an RNN for classifying reviews as positive or negative

Build a single-layer RNN model and train it. You will need to include these parts:

* An `Embedding` layer for efficiently one-hot encoding the inputs - [docs](http://keras.io/layers/embeddings/)
* A recurrent layer. Keras has a [few variants](http://keras.io/layers/recurrent/) you could use. LSTM layers are by far the most popular for RNNs.
* A `Dense` layer for the hidden to output connection.
* A softmax to produce the final prediction.

You will need to decide how large your hidden state will be. You may also consider using some dropout on your recurrent or embedding layers - refers to docs for how to do this.

Training for longer will be much better overall, but since RNNs are expensive to train, you can use 1 epoch to test. You should be able to get > 70% accuracy with 1 epoch. How high can you get?

In [62]:
# Design an recurrent model
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=100, input_length=maxlen))
model.add(LSTM(12))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 80, 100)           10000     
_________________________________________________________________
lstm_9 (LSTM)                (None, 12)                5424      
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 13        
Total params: 15,437
Trainable params: 15,437
Non-trainable params: 0
_________________________________________________________________


In [65]:
model.fit(X_train_t, y_train, batch_size=32, epochs=1, validation_data=(X_test_t, y_test))
loss, acc = model.evaluate(X_test_t, y_test, batch_size=32)
print('Test loss:', loss)
print('Test accuracy:', acc)

Train on 25000 samples, validate on 25000 samples
Test loss: 0.5781713641166687
Test accuracy: 0.695
