# Sequence Classification with LSTM Recurrent Neural Networks with Keras

Original post by [J. Brownlee](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/), editted by Dr. J. Tao.

Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence. In any event, a *sentence* can be considered as a **sequence of words**.

What makes this problem difficult includes:

- the sequences can vary in length;
    - traditional ML methods, including CNN, can only deal with fixed-length inputs; 
- be comprised of a very large vocabulary of input symbols; 
    - thus we have to deal with the *curse of dimensionality*; 
- and may require the model to learn the long-term context or dependencies between symbols in the input sequence; 
    - so far no other ML models can deal with such dependencies.

In this tutorial, you will discover how you can develop LSTM recurrent neural network models for sequence classification problems in Python using the Keras deep learning library.

If you are interested in sentence classification with CNN, [here](https://towardsdatascience.com/understanding-how-convolutional-neural-network-cnn-perform-text-classification-with-word-d2ee64b9dd0b) is a good post about it.

Upon completion of this tutorial, you should be able to:

- develop an LSTM model for a sequence classification problem.
- reduce overfitting in your LSTM models through the use of dropout.
- combine LSTM models with Convolutional Neural Networks that excel at learning spatial relationships.

## Analysis Step 1: Framing Your Analytical Problem

The problem that we will use to demonstrate sequence learning in this tutorial is the [IMDB movie review sentiment classification problem](http://ai.stanford.edu/~amaas/data/sentiment/). Each movie review is a variable sequence of words and the sentiment of each movie review must be classified - hence, the reviews are the *unit of analysis* in this study.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a **positive** or **negative** sentiment. We are only playing with two sentiments here - of course you can have multi-class sentiment analysis (positive, negative, neutral, uncertain).

The data was collected by [Stanford researchers and was used in a 2011 paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) where a split of **50-50** of the data was used for training and testing. An accuracy of **88.89%** was achieved.

Keras provides access to the *IMDB dataset built-in* - which means you do not have to download and import the dataset. The imdb.load_data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

## Coding Step 1: Importing Packages

In [2]:
import numpy as np #### Python's numeric function package
from keras.datasets import imdb #### Analysis dataset
from keras.models import Sequential #### required layer in our LSTM network
from keras.layers import Dense #### required layer in our LSTM network
from keras.layers import LSTM #### required layer in our LSTM network
from keras.layers.embeddings import Embedding #### required layer in our LSTM network
from keras.preprocessing import sequence #### Packaged preprocessing step in Keras
# fix random seed for reproducibility
np.random.seed(7)

In [4]:
#### I am using Keras 2.0.8, TensorFlow 1.3.0, please check your version here
#### If your version is too low, you should use pip or Anaconda interface to update your Keras and TF packages
#### Usually newer version would be fine
import keras
import tensorflow as tf
print('Your Keras version is:', keras.__version__)
print('Your TensorFlow version is:', tf.__version__)

Your Keras version is: 2.0.8
Your TF version is: 1.3.0


### Word Embedding

You should have noticed that we imported a Keras layer called embedding - which would do the word embedding for us.

**What is Word Embedding?**

We will map each movie review into a real vector domain, a popular technique when working with text called word embedding. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer.

We will map each word onto a 32 length real valued vector. We will also limit the total number of words that we are interested in modeling to the 5000 most frequent words, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.

Now that we have defined our problem and how the data will be prepared and modeled, we are ready to develop an LSTM model to classify the sentiment of movie reviews.

## Coding Step 2: Loading data and Preprocessing

We need to load the IMDB dataset. We are constraining the dataset to the top 5,000 words - hence, we are working our way down to deal with the 'curse of dimensionality'. By default, we also split the dataset into train (50%) and testing (50%) sets.

In [5]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

### YOUR TURN HERE

We know that 50%/50% split is too harsh on training, maybe we want to use a more lenient 70/30 split for training/testing. 

Fill the following code block for that purpose.

In [None]:
from sklearn.model_selection import train_test_split
# fill-in your seed here, and un-comment the statement
# seed = 
numpy.random.seed(seed)
X = numpy.concatenate((X_train, X_test), axis=0)
y = numpy.concatenate((y_train, y_test), axis=0)
# complete the following statement to split into 67% for train and 33% for test
X_train1, X_test1, y_train1, y_test1 = train_test_split()

Next, we need to truncate and pad the input sequences so that they are all the **same length** for modeling. The model will learn the zero values carry no information so indeed the sequences are not the same length in terms of content, **but same length vectors is required to perform the computation in Keras**.

In [6]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [7]:
# Look at the shape of our data
# summarize size
print("Training data: ")
print(X_train.shape)
print(y_train.shape)
print("Testing data: ")
print(X_test.shape)
print(y_test.shape)

Training data: 
(25000, 500)
(25000,)
Testing data: 
(25000, 500)
(25000,)


In [15]:
#### You can look at what is in your data
print(X_train[0])
print(y_train[0])
#print(len(X_train[0]))

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0  

What happened to sentences? What are these numbers for? Type your answer below:

**Answer**:

### YOUR TURN HERE

You may want to print out your newly-splited (train1, test1) datasets' shapes to check whether the split is correct.

Fill-in the code block below for that.

We can also print the unique class values.

In [9]:
# Summarize number of classes: 0 - negative, 1 - positive
print("Classes: ")
print(np.unique(y_train))

Classes: 
[0 1]


This is a very clean dataset, so split is the only preprocessing step we need to do. We can now move to the modeling (training) phase.

*In your actual projects, preprocessing will take a lot of time - please refer to your experience in IS 540 for this.*

## Analysis Step 3: Modeling/Training and Evaluation/Optimization

We can now define, compile and fit our LSTM model.

The first layer is the Embedded layer that uses 32 length vectors to represent each word. The next layer is the LSTM layer with 100 memory units (smart neurons). Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem.

#### Evaluation Metric and Optimization Method
Because it is a binary classification problem, log loss is used as the loss function ([**binary_crossentropy in Keras**](https://keras.io/losses/)). The efficient [ADAM optimization algorithm](https://keras.io/optimizers/) is also used. The model is fit for only 2 epochs because it quickly overfits the problem. A large batch size of 64 reviews is used to space out weight updates.

Please notice that we also look at the classification accuracy at each epoch.

Following step took about 25 minutes on my machine - so it is a good time to go for a bio-break, or grab a cup of coffee.

In [11]:
# %timeit
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x118a262e8>

### YOUR TURN HERE

We claimed that the previous model will overfit, how did we know? (Type your answer below)

**Answer**:

## Analysis Step 4: Testing/Deployment

Once the model is fit, we estimate the performance of the model on unseen reviews.

In [12]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.53%


### YOUR TURN HERE

Remember we created our own training and testing datasets? Now let's put them to use. Fill in following code block for that.

In [None]:
embedding_vecor_length = 32
model1 = Sequential()
model1.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model1.add(LSTM(50)) #### less neurons will make training faster
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model1.summary())
#Fill in here

## Addendum: Fighting Overfitting
### LSTM For Sequence Classification With Dropout

Recurrent Neural networks like LSTM generally have the problem of overfitting.

Dropout can be applied between layers using the Dropout Keras layer. We can do this easily by adding new Dropout layers between the Embedding and LSTM layers and the LSTM and Dense output layers.

In [14]:
from keras.layers import Dropout

model2 = Sequential()
model2.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model2.add(Dropout(0.2))
model2.add(LSTM(100))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())
model2.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model2.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 32)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy: 87.38%


We can see dropout having the desired impact on training with a slightly slower trend in convergence and in this case a lower final accuracy. The model could probably use a few more epochs of training and may achieve a higher skill (try it an see).

### YOUR TURN HERE

Add more epochs to above model to see if the result improves - also observe if the model **overfits**.

In [None]:
model2 = Sequential()
model2.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model2.add(Dropout(0.2))
model2.add(LSTM(100))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())
model2.fit(X_train, y_train, epochs=, batch_size=64) ####change number of epochs to 5
# Final evaluation of the model
scores = model2.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Alternately, dropout can be applied to the input and recurrent connections of the memory units with the LSTM precisely and separately.

Keras provides this capability with parameters on the LSTM layer, the dropout for configuring the input dropout and recurrent_dropout for configuring the recurrent dropout. For example, we can modify the first example to add dropout to the input and recurrent connections as follows:

In [None]:
#embedding_vecor_length = 32
model3 = Sequential()
model3.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model3.add(LSTM(50, dropout=0.2, recurrent_dropout=0.2))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model3.summary())
model3.fit(X_train, y_train, epochs=5, batch_size=64)
# Final evaluation of the model
scores = model3.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

We can see that the LSTM specific dropout has a more pronounced effect on the convergence of the network than the layer-wise dropout. As above, the number of epochs was kept constant and could be increased to see if the skill of the model can be further lifted.

Dropout is a powerful technique for combating overfitting in your LSTM models and it is a good idea to try both methods, but you may bet better results with the gate-specific dropout provided in Keras.

## Addendum 2: Using LSTM and CNN Together
### LSTM and Convolutional Neural Network For Sequence Classification

Convolutional neural networks excel at learning the spatial structure in input data.

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.

We can easily add a one-dimensional CNN and max pooling layers after the Embedding layer which then feed the consolidated features to the LSTM. We can use a smallish set of 32 features with a small filter length of 3. The pooling layer can use the standard length of 2 to halve the feature map size.

For example, we would create the model as follows:

In [None]:
from keras.layers.convolutional import Conv1D, MaxPooling1D

model5 = Sequential()
model5.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model5.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model5.add(MaxPooling1D(pool_size=2))
####DROPOUT
model5.add(LSTM(100)) ####EMBED DROPOUT
model5.add(Dense(1, activation='sigmoid'))
model5.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model5.summary())
model5.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model5.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

We can see that we achieve similar results to the first example although with less weights and faster training time.

I would expect that even better results could be achieved if this example was further extended to use dropout.

### YOUR TURN HERE

Please try add dropout (layers or embedded in LSTM) in above model - and observe the results.

## Summary

In this tutorial you discovered how to develop LSTM network models for sequence classification predictive modeling problems.

Specifically, you learned:

- How to develop a simple single layer LSTM model for the IMDB movie review sentiment classification problem.
- How to extend your LSTM model with layer-wise and LSTM-specific dropout to reduce overfitting.
- How to combine the spatial structure learning properties of a Convolutional Neural Network with the sequence learning of an LSTM.

Please save your tutorial file for submission.