### A Crash Course on Neural Networks with Keras Part 3 - Recurrent Neural Networks

So far we have seen convnets, which are great at dealing with images via the extraction of translation invariant local features.

In the context of sequence modelling recurrent neural networks (and in particular Long Short Term Memory (LSTM) networks) have become very popular, and so it's also worth having a look.

#### 1a) What are RNN's, and why use them?

This explanation is going to be an extremely brief summary of this truly excellent [blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) (that I have taken all the images from). This is just to give a flavour, and to point out their existence, but I strongly suggest having a look at the linked blog to get more details and intuition!

Very roughly, recurrent neural networks aim to gain an advantage in sequence modelling by at every time step in the sequence feeding into the network both the new time step, and the previous output of the network (i.e. by having recurrent connections). 

<center><img src="images/RNN-rolled.png",width=200,height=200><center>

The idea is that one output of the cell should contain a representation/memory that is useful for cells seeing future information, and another output of the cell should be useful for any further processing at this time step. This is sometimes easier to see in an "unrolled" picture, but remember that the cells are identical (i.e. they share weights, in a way similar to convolutional filters):

<center><img src="images/RNN-unrolled.png",width=600,height=600><center>

There are many different types of [recurrent cells](https://keras.io/layers/recurrent/), but arguably the most popular is the Long Short Term Memory (LSTM) cell:

<center><img src="images/LSTM3-chain.png",width=600,height=600><center>

I don't want to go into the details - see [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)!

#### 1b) A quick note on some practical considerations:

Historically LSTM's have been the go-to tool for sequence modelling. However, they are expensive to train, often difficult to tune, and very recently it seems like consensus is shifting towards 1D convnets (as we discussed as an example CNN) being more robust, efficient and effective for sequence modelling:

<center><img src="images/wavenet.gif",width=600,height=600><center>

Google's ["WaveNet"](https://deepmind.com/blog/wavenet-generative-model-raw-audio/) above is a great example of this, and a very recent discussion (arXiv last week) can be found [here](https://arxiv.org/abs/1803.01271).

So, the moral is, perhaps this section is more for historical interest - things are changing fast :)

#### 2) Lets build an LSTM

For this example we will use the [IMDB sentiment dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification). Its a collection of movie reviews, labelled with a binary label representing positive or negative sentiment, and the idea is to predict the sentiment from an unlabelled review.

In [None]:
# ----- Imports --------

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

Each movie review is encoded as a string of integers, where the integer represents the words, which are indexed by overall frequency in the dataset.

So, as preprocessing we need to:

   - decide how many words we will keep (less frequent words will all be allocated 0)
   - the maximum number of words we want to keep from each review (i.e. how long will our sequence be)
   
We choose to retain only the top 20000 words, and to keep only the last 80 words of the review if the review is longer than that:

In [6]:
max_features = 20000
maxlen = 80  
batch_size = 32

Load the data:

In [7]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


Of course, by construction, LSTM's can be made to work with datasets consisting of sequences of varying length (see [Dynamic RNN](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) in TensorFlow).

However currently Keras supports only datasets consisting of fixed length sequences, so we have to pad the sequences which are shorter than 80 words, and truncate the ones that are longer. Again, Keras has built in functionality for this:

In [8]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)


At this point, a single sequence (truncated review) looks something like this:

In [9]:
x_train[0]

array([   15,   256,     4,     2,     7,  3766,     5,   723,    36,
          71,    43,   530,   476,    26,   400,   317,    46,     7,
           4, 12118,  1029,    13,   104,    88,     4,   381,    15,
         297,    98,    32,  2071,    56,    26,   141,     6,   194,
        7486,    18,     4,   226,    22,    21,   134,   476,    26,
         480,     5,   144,    30,  5535,    18,    51,    36,    28,
         224,    92,    25,   104,     4,   226,    65,    16,    38,
        1334,    88,    12,    16,   283,     5,    16,  4472,   113,
         103,    32,    15,    16,  5345,    19,   178,    32],
      dtype=int32)

Now we can build the model.

Unfortuntely, integer encodings of words would make *extremely* bad features (ideally, we want features to be normalized with mean 0 variance 1 - see earlier discussion and references) - as a result, we need our first layer to be a word [embedding layer](https://keras.io/layers/embeddings/), which learns a vector representation of the words.

In other situations, where each element of the sequence is naturally given as a vector of features, we could start with the LSTM layer, which expects its input in the shape:

[batch_size, sequence_length, num_features]

Note here also that we only have a single LSTM layer, and that by default these layers only return the output from the final LSTM cell (indicated by return_sequences=False) which we then push into a feed forward layer. 

In principal though we can stack many LSTM layers on top of each other (extracting sequences of sucessively more abstract features), but to make this work you have to set return_sequences=True on all intermediate LSTM layers!

In [10]:
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


In [19]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         2560000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


And, once again, training is now easy!

In [20]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.0828352694833279
Test accuracy: 0.81252
