# Deep Sequence Modelling
We will go through some techniques of modelling sequences with Deep Learning models, namely we want to predict sentiment of IMDB movie reviews.

### RNN  

<img src="rnn.png" width="20%">

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior. It is the simplest sequence model and hence we will start with this one, in Keras this model is defined as:

```python
keras.layers.recurrent.SimpleRNN(units, activation='tanh', use_bias=True, 
                                 kernel_initializer='glorot_uniform', 
                                 recurrent_initializer='orthogonal', 
                                 bias_initializer='zeros', 
                                 kernel_regularizer=None, 
                                 recurrent_regularizer=None, 
                                 bias_regularizer=None, 
                                 activity_regularizer=None, 
                                 kernel_constraint=None, recurrent_constraint=None, 
                                 bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
```

#### Arguments:

<ul>
<li><strong>units</strong>: Positive integer, dimensionality of the output space.</li>
<li><strong>activation</strong>: Activation function to use
    (see <a href="http://keras.io/activations/">activations</a>).
    If you pass None, no activation is applied
    (ie. "linear" activation: <code>a(x) = x</code>).</li>
<li><strong>use_bias</strong>: Boolean, whether the layer uses a bias vector.</li>
<li><strong>kernel_initializer</strong>: Initializer for the <code>kernel</code> weights matrix,
    used for the linear transformation of the inputs.
    (see <a href="https://keras.io/initializers/">initializers</a>).</li>
<li><strong>recurrent_initializer</strong>: Initializer for the <code>recurrent_kernel</code>
    weights matrix,
    used for the linear transformation of the recurrent state.
    (see <a href="https://keras.io/initializers/">initializers</a>).</li>
<li><strong>bias_initializer</strong>: Initializer for the bias vector
    (see <a href="https://keras.io/initializers/">initializers</a>).</li>
<li><strong>kernel_regularizer</strong>: Regularizer function applied to
    the <code>kernel</code> weights matrix
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>recurrent_regularizer</strong>: Regularizer function applied to
    the <code>recurrent_kernel</code> weights matrix
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>bias_regularizer</strong>: Regularizer function applied to the bias vector
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>activity_regularizer</strong>: Regularizer function applied to
    the output of the layer (its "activation").
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>kernel_constraint</strong>: Constraint function applied to
    the <code>kernel</code> weights matrix
    (see <a href="https://keras.io/constraints/">constraints</a>).</li>
<li><strong>recurrent_constraint</strong>: Constraint function applied to
    the <code>recurrent_kernel</code> weights matrix
    (see <a href="https://keras.io/constraints/">constraints</a>).</li>
<li><strong>bias_constraint</strong>: Constraint function applied to the bias vector
    (see <a href="https://keras.io/constraints/">constraints</a>).</li>
<li><strong>dropout</strong>: Float between 0 and 1.
    Fraction of the units to drop for
    the linear transformation of the inputs.</li>
<li><strong>recurrent_dropout</strong>: Float between 0 and 1.
    Fraction of the units to drop for
    the linear transformation of the recurrent state.</li>
</ul>

In [1]:
%matplotlib inline

In [1]:
import numpy as np
import pandas as pd
import keras 

import numpy as np
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.preprocessing import image

from keras.datasets import imdb
from keras.datasets import mnist

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D

from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU, SimpleRNN

from keras.layers import Activation, TimeDistributed, RepeatVector
from keras.callbacks import EarlyStopping, ModelCheckpoint

Using TensorFlow backend.


## IMDB sentiment classification task

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. 

IMDB provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. 

There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. 

http://ai.stanford.edu/~amaas/data/sentiment/

### Data Preparation - IMDB

In [2]:
max_features = 20000
maxlen = 100  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print("Loading data...")
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Example:')
print(X_train[:1])

print("Pad sequences (samples x time)")
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Loading data...
25000 train sequences
25000 test sequences
Example:
[list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 1

#### Model building 

In [5]:
### BUILD VANILLA RNN
### EPOCHS CAN BE SLOW SO FOR NOW SET NR_EPOCHS TO 1
hidden_size = 128
output_size = 1
dropout_ratio = 0.2

model = Sequential()
model.add(Embedding(max_features, hidden_size, input_length=maxlen))
model.add(SimpleRNN(hidden_size))  
model.add(Dropout(dropout_ratio))
model.add(Dense(output_size, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, epochs=1, 
          validation_data=(X_test, y_test))

loss, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print("Test set loss is: %s" % loss)
print("Test set accuracy is: %s" % acc)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/1
Test set loss is: 0.5452150621986389
Test set accuracy is: 0.73208


The results are acceptable but we can probably do better by using more advanced sequence models. Let's look at the LSTM first.

### LSTM  

A LSTM network is an artificial neural network that contains LSTM blocks instead of, or in addition to, regular network units. A LSTM block may be described as a "smart" network unit that can remember a value for an arbitrary length of time. 

Unlike traditional RNNs, an Long short-term memory network is well-suited to learn from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.

<img src="gru.png" width="60%">

### GRU  

Gated recurrent units are a gating mechanism in recurrent neural networks and an efficient alternative to LSTMs. 

They are very similar to the way LSTMs work but they have fewer parameters than LSTM, as they lack an output gate. Still, in theory, they are able to model long-term dependencies. Since we do not have access to a GPU we are going to use this one.

<img src="../imgs/gru.png" />

```python
keras.layers.recurrent.GRU(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, 
                           kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', 
                           bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, 
                           bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, 
                           recurrent_constraint=None, bias_constraint=None, 
                           dropout=0.0, recurrent_dropout=0.0)
```

## GRU

In [6]:
### IMPLEMENT GRU
hidden_size = 128
output_size = 1
dropout_ratio = 0.2

model = Sequential()
model.add(Embedding(max_features, hidden_size, input_length=maxlen))

# !!! Play with those! try and get better results!
model.add(GRU(hidden_size))   

model.add(Dropout(dropout_ratio))
model.add(Dense(output_size, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, 
          epochs=1, validation_data=(X_test, y_test))
loss, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print("Test set loss is: %s" % loss)
print("Test set accuracy is: %s" % acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               98688     
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 2,658,817
Trainable params: 2,658,817
Non-trainable params: 0
_________________________________________________________________
None
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/1
Test set loss is: 0.35703638779640195
Test set accuracy is: 0.84492


What do you notice compared to the results of the vanilla RNN?

## Convolutional LSTM

This section demonstrates the use of a **Convolutional LSTM network**.

We can use a convolutional layer before the LSTM layer to reduce the input size and speed up training.
We get the following architecture:

1. Embedding layer
2. Dropout layer
3. Convolutional layer (Relu)
4. Max Pooling Layer
5. LSTM Cell
6. Fully Connected layer (Sigmoid)


In [4]:
from keras.layers.convolutional import Conv1D, MaxPooling1D

In [5]:
### IMPLEMENT CONVOLUTIONAL LSTM
### REMEMBER WE ARE WORKING WITH 1D CONVOLUTIONAL LAYERS HERE
hidden_size = 128
output_size = 1
dropout_ratio = 0.2
nr_filters = 64
filter_size = 5
pool_size = 4

model = Sequential()
model.add(Embedding(max_features, hidden_size, input_length=maxlen))
model.add(Dropout(dropout_ratio))
model.add(Conv1D(nr_filters, filter_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(GRU(hidden_size))
model.add(Dense(output_size, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, batch_size=batch_size, 
          epochs=1, validation_data=(X_test, y_test))

loss, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print("Test set loss is: %s" % loss)
print("Test set accuracy is: %s" % acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 96, 64)            41024     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 24, 64)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               74112     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 2,675,265
Trainable params: 2,675,265
Non-trainable params: 0
_________________________________________________________________


If everything went correctly, the model trained significantly faster and obtained a higher accuracy, awesome! 

## Convolutional sequence model

We can also fit only a convolutional model to the sequence. We get the following architecture:

1. Embedding layer
2. Dropout layer
3. Convolutional layer (Relu)
4. Max Pooling Layer
5. Flatten Layer
6. Fully Connected layer (Sigmoid)

In [9]:
### IMPLEMENT CONVOLUTIONAL SEQUENCE MODEL
hidden_size = 128
output_size = 1
dropout_ratio = 0.2
nr_filters = 64
filter_size = 5
pool_size = 4

model = Sequential()
model.add(Embedding(max_features, hidden_size, input_length=maxlen))
model.add(Dropout(dropout_ratio))
model.add(Conv1D(nr_filters, filter_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())
model.fit(X_train, y_train, batch_size=batch_size, 
          epochs=1, validation_data=(X_test, y_test))

loss, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print("Test set loss is: %s" % loss)
print("Test set accuracy is: %s" % acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
dropout_4 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 96, 64)            41024     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 24, 64)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1536)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 1537      
Total params: 2,602,561
Trainable params: 2,602,561
Non-trainable params: 0
_________________________________________________________________


Removing the LSTM cell increases the training speed significantly again while preserving performance levels! What else is catching your attention here?