# RNN Architecture

You will learn about the vanishing and exploding gradient problems, often occurring in RNNs, and how to deal with them with the GRU and LSTM cells. Furthermore, you'll create embedding layers for language models and revisit the sentiment classification task.

# (1) Vanishing and exploding gradients

## Training RNN models

<p align='center'>
    <img src='image/Screenshot 2021-02-11 151311.png'>
    <img src='image/Screenshot 2021-02-11 151431.png'>
</p>

Example:
$$a_2 = f(W_a , a_1 , x_2)$$ 
$$= f(W_a , f(W_a , a_0 , x_1), x_2)$$

<p align='center'>
    <img src='image/Screenshot 2021-02-11 152644.png'>
</p>

**Remember that**
$$a_T = f(W_a , a_T-1 , x_T)$$
$a_T$ also depends on $a_T-1$ which depends on $a_T-2$ and so on!

## BPTT continuation
**Computing derivatives leads to**
$$\frac{\partial a_t}{\partial W_a = (W_a)^{t-1} g(X)}$$

- $(W_a)^{t-1}$ **can cpnverge to 0**
- **or diverge to** $+\infty$**!**

## Solutions to the gradient problems
Some solutions are known:

### Exploding gradinets
- Gradient clipping / scaling

### Vanishing gradients
- Better initialize the matrix W
- Use regularization
- Use ReLU instead of tanh / sigmoid / softmax
- **Use LSTM or GRU cells!**

# Exercise I: Exploding gradient problem

In the video exercise, you learned about two problems that may arise when working with RNN models: the vanishing and exploding gradient problems.

This exercise explores the exploding gradient problem, showing that the derivative of a function can increase exponentially, and how to solve it with a simple technique.

The data is already loaded on the environment as `X_train`, `X_test`, `y_train` and `y_test`.

You will use a **Stochastic Gradient Descent** (SGD) optimizer and **Mean Squared Error** (MSE) as the loss function.

In the first step you will observe the gradient exploding by computing the MSE on the train and test sets. On step 2, you will change the optimizer using the `clipvalue` parameter to solve the problem.

The Stochastic Gradient Descent in Keras is loaded as `SGD`.

### Instructions 1/2

- Use `SGD()` as optimizer and `(X_test, y_test)` as validation data.
- Evaluate train performance and print all the **MSE** values.

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

### Instructions 2/2

- Set the `SGD()` parameter `clipvalue` equal to `3.0`.
- Compute the MSE values and store them on `train_mse` and `test_mse` variables.

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9, clipvalue=3.0))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# Exercise II: Vanishing gradient problem

The other possible gradient problem is when the gradients vanish, or go to zero. This is a much harder problem to solve because it is not as easy to detect. If the loss function does not improve on every step, is it because the gradients went to zero and thus didn't update the weights? Or is it because the model is not able to learn?

This problem occurs more often in RNN models when long memory is required, meaning having long sentences.

In this exercise you will observe the problem on the IMDB data, with longer sentences selected. The data is loaded in `X` and `y` variables, as well as classes `Sequential`, `SimpleRNN`, `Dense` and `matplotlib.pyplot` as `plt`. The model was pre-trained with 100 epochs and its weights are stored on the file `model_weights.h5`.

### Instructions

- Add a `SimpleRNN` layer to the model.
- Load the pre-trained weights on the model using the method `.load_weights()`.
- Add the accuracy of the training data available on the attribute `'acc'` to the plot.
- Display the plot using the method `.show()`.

In [None]:
# Create the model
model = Sequential()
model.add(SimpleRNN(units=600, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Plot the accuracy x epoch graph
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.legend(['train', 'val'], loc='upper left')
plt.show()

<p align='center'>
    <img src='image/[2021-02-11 160543].svg'>
</p>

# (2) GRU and LSTM cells

## SimpleRNN cell
<p algin='center' >
    <img src='image/Screenshot 2021-02-12 174743.png'>
</p>

## GRU cell
<p align='center'>
    <img src='image/Screenshot 2021-02-12 174900.png'>
</p>

## LSTM cell

<p align='center'>
    <img src='image/Screenshot 2021-02-12 175037.png'>
</p>

## No more vanishing gradients
- The `simpleRNN` cell can have gradient problems.
    - The weight matrix power t multiplies the other terms
- `GRU` and `LSTM` cells don't have vanishing gradient problems
    - Because of their gates
    - Don't have the weight matrics terms multiplying the rest
    - Exploding gradient problems are easier to solve

## Usage in keras

In [None]:
# Import the layers
from keras.layers import GRU, LSTM

In [None]:
# Add the layers to a model
model.add(GRU(units=128, return_sequences=True, name='GRU layer'))
model.add(GRU(units=64, return_sequences=False, name='LSTM layer'))

# Exercise III: GRU cells are better than simpleRNN

In this exercise you will re-run the same model as the first chapter of the course to compare the accuracy of the model by simpling changing the `SimpleRNN` cell to a `GRU` cell.

The model was already trained with 10 epochs, as in the previous model with a `SimpleRNN` cell. In order to compare the models, a test set `(x_test, y_test)` is already loaded in the environment, as well as the old model `SimpleRNN_model`.

### Instructions

- Import the `GRU` cell.
- Print the models' summaries.
- Print the accuracy of each model.

In [None]:
# Import the modules
from keras.layers import GRU, Dense

# Print the old and new model summaries
SimpleRNN_model.summary()
gru_model.summary()

# Evaluate the models' performance (ignore the loss value)
_, acc_simpleRNN = SimpleRNN_model.evaluate(X_test, y_test, verbose=0)
_, acc_GRU = gru_model.evaluate(X_test, y_test, verbose=0)

# Print the results
print("SimpleRNN model's accuracy:\t{0}".format(acc_simpleRNN))
print("GRU model's accuracy:\t{0}".format(acc_GRU))

# Exercise IV: Stacking RNN layers

Deep RNN models can have tens to hundreds of layers in order to achieve state-of-the-art results.

In this exercise, you will get a glimpse of how to create deep RNN models by stacking layers of LSTM cells one after the other.

To do this, you will set the `return_sequences` argument to `True` on the firsts two `LSTM` layers and to `False` on the last `LSTM` layer.

To create models with even more layers, you can keep adding them one after the other or create a function that uses the `.add()` method inside a loop to add many layers with few lines of code.

### Instructions

- Import the `LSTM` layer.
- Return the sequences in the first two layers and don't return the sequences in the last `LSTM` layer.
- Load the pre-trained weights.
- Print the loss and accuracy obtained.

In [None]:
# Import the LSTM layer
from keras.layers.recurrent import LSTM

# Build model
model = Sequential()
model.add(LSTM(units=128, input_shape=(None, 1), return_sequences=True))
model.add(LSTM(units=128, return_sequences=True))
model.add(LSTM(units=128, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('lstm_stack_model_weights.h5')

print("Loss: %0.04f\nAccuracy: %0.04f" % tuple(model.evaluate(X_test, y_test, verbose=0)))

# (3) The Embedding layer

## Why embeddings
Advantages:
- Reduce the dimension
```
    one_hot = np.array((N, 100000))
    embedd = np.array((N, 300))
```
- Dense representation
    - `king - man %20 woman = queen`
- Transfer learning
Disadvantages:
- Lots of parameters to train: training takes longer

## How to use in keras
In keras:

In [None]:
from keras.layers import Embedding
model = Sequential()

# Use as the first layer
model.add(Embedding(input_dim=100000,
                    output_dim=300,
                    trainable=True,
                    embeddings_initializer=None,
                    input_length=120))


## Transger learning
Transfer learning for language models

- GloVe
- word2vec
- BERT

In keras:

In [None]:
from keras.initializers import Constant
model.add(Embedding(input_dim=vocabulary_size,
                    output_dim=embedding_dim,
                    embeddings_initializer=Constant(pre_trained_vectors)))

## Using GloVE pre-trained vectors
Official site: https://nlp.stanford.edu/projects/glove

In [None]:
# Get the GloVE vectors
def get_glove_vectors(filename="glove.6B.300d.txt"):
    # Get all word vectors from pre-trained model
    glove_vector_dict = {}
    with open(filename) as f:
        for line in f:
            values = line.split()
            word = value[0]
            coefs = values[1:]
            glove_vector_dict[word] = np.asarray(coefs, dtype='float32')

## Using the GloVE on a specific task

In [None]:
# Filter GloVE vectors to specific task
def filter_glove(vocabulary_dict, glove,_dict, wordvec_dim=300):
    # Create a matrix to store the vectors
    embedding_matrix = np.zeros((len(vocabulary_dict) + 1, wordvec_dim))
    for worrd, i in vocabulary_dict.items():
        embedding_vector = glove_dict.get(word)
        if embedding_vector is not None:
            # words not found in glove_dict will be all-zeros.
            embedding_matrix[i] = embedding_vector

# Exercise V: Number of parameters comparison

You saw that the one-hot representation is not a good representation of words because it is very sparse. Using the `Embedding` layer creates a dense representation of the vectors, but also demands a lot of parameters to be learned.

In this exercise you will compare the number of parameters of two models using `embeddings` and `one-hot` encoding to see the difference.

The model `model_onehot` is already loaded in the environment, as well as the `Sequential`, `Dense` and `GRU` from `keras.` Finally, the parameters `vocabulary_size=80000` and `sentence_len=200` are also loaded.

### Instructions

- Import the `Embedding` layer from `keras.layers`.
- On the embedding layer, use vocabulary size plus one as input dimension and sentence size as input length.
- Compile the model.
- Print the summary of the model with embedding.

In [None]:
# Import the embedding layer
from keras.layers import Embedding

# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size+1, output_dim=wordvec_dim, input_length=sentence_len, trainable=True))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the summaries of the one-hot model
model_onehot.summary()

# Print the summaries of the model with embeddings
model.summary()

# Exercise VI: Transfer learning

You saw that when training an embedding layer, you need to learn a lot of parameters.

In this exercise, you will see that when using transfer learning it is possible to use the pre-trained weights and don't update them, meaning that all the parameters of the embedding layer will be fixed, and the model will only need to learn the parameters from the other layers.

The function `load_glove` is already loaded on the environment and retrieves the glove matrix as a `numpy.ndarray` vector. It uses the function covered on the lesson's slides to retrieve the glove vectors with 200 embedding dimensions for the vocabulary present in this exercise.

### Instructions

- Use the pre-defined function to load the glove vectors.
- Use the initializer `Constant` on the pre-trained vectors.
- Add the output layer as a `Dense` with one unit.
- Print the summary and check the trainable parameters.

In [None]:
# Load the glove pre-trained vectors
glove_matrix = load_glove('glove_200d.zip')

# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size + 1, output_dim=wordvec_dim, 
                    embeddings_initializer=Constant(glove_matrix), 
                    input_length=sentence_len, trainable=False))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the summaries of the model with embeddings
model.summary()

# Exercise VII: Embeddings improves performance

Does the embedding layer improves the accuracy of the model? Let's check it out in the same IMDB data.

The model was already trained with 10 epochs, as in the previous model with `simpleRNN` cell. In order to compare the models, a test set `(X_test, y_test)` is available in the environment, as well as the old model `simpleRNN_model`. The old model's accuracy is loaded in the variable `acc_SimpleRNN`.

All required modules and functions as loaded in the environment: `Sequential()` from `keras.models`, `Embedding` and `Dense` from `keras.layers` and `SimpleRNN` from `keras.layers.recurrent`.

### Instructions

- Add the embedding layer to the model.
- Compute the model's accuracy and store on the variable `acc_embeddings`.
- Print the accuracy of the old and new models.

In [None]:
# Create the model with embedding
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=max_vocabulary, output_dim=wordvec_dim, input_length=max_len))
model.add(SimpleRNN(units=128))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('embedding_model_weights.h5')

# Evaluate the models' performance (ignore the loss value)
_, acc_embeddings = model.evaluate(X_test, y_test, verbose=0)

# Print the results
print("SimpleRNN model's accuracy:\t{0}\nEmbeddings model's accuracy:\t{1}".format(acc_simpleRNN, acc_embeddings))