# RNN Architecture

You will learn about the vanishing and exploding gradient problems, often occurring in RNNs, and how to deal with them with the GRU and LSTM cells. Furthermore, you'll create embedding layers for language models and revisit the sentiment classification task.

# (1) Vanishing and exploding gradients

## Training RNN models

<p align='center'>
    <img src='image/Screenshot 2021-02-11 151311.png'>
    <img src='image/Screenshot 2021-02-11 151431.png'>
</p>

Example:
$$a_2 = f(W_a , a_1 , x_2)$$ 
$$= f(W_a , f(W_a , a_0 , x_1), x_2)$$

<p align='center'>
    <img src='image/Screenshot 2021-02-11 152644.png'>
</p>

**Remember that**
$$a_T = f(W_a , a_T-1 , x_T)$$
$a_T$ also depends on $a_T-1$ which depends on $a_T-2$ and so on!

## BPTT continuation
**Computing derivatives leads to**
$$\frac{\partial a_t}{\partial W_a = (W_a)^{t-1} g(X)}$$

- $(W_a)^{t-1}$ **can cpnverge to 0**
- **or diverge to** $+\infty$**!**

## Solutions to the gradient problems
Some solutions are known:

### Exploding gradinets
- Gradient clipping / scaling

### Vanishing gradients
- Better initialize the matrix W
- Use regularization
- Use ReLU instead of tanh / sigmoid / softmax
- **Use LSTM or GRU cells!**

# Exercise I: Exploding gradient problem

In the video exercise, you learned about two problems that may arise when working with RNN models: the vanishing and exploding gradient problems.

This exercise explores the exploding gradient problem, showing that the derivative of a function can increase exponentially, and how to solve it with a simple technique.

The data is already loaded on the environment as `X_train`, `X_test`, `y_train` and `y_test`.

You will use a **Stochastic Gradient Descent** (SGD) optimizer and **Mean Squared Error** (MSE) as the loss function.

In the first step you will observe the gradient exploding by computing the MSE on the train and test sets. On step 2, you will change the optimizer using the `clipvalue` parameter to solve the problem.

The Stochastic Gradient Descent in Keras is loaded as `SGD`.

### Instructions 1/2

- Use `SGD()` as optimizer and `(X_test, y_test)` as validation data.
- Evaluate train performance and print all the **MSE** values.

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

### Instructions 2/2

- Set the `SGD()` parameter `clipvalue` equal to `3.0`.
- Compute the MSE values and store them on `train_mse` and `test_mse` variables.

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9, clipvalue=3.0))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse = model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# Exercise II: Vanishing gradient problem

The other possible gradient problem is when the gradients vanish, or go to zero. This is a much harder problem to solve because it is not as easy to detect. If the loss function does not improve on every step, is it because the gradients went to zero and thus didn't update the weights? Or is it because the model is not able to learn?

This problem occurs more often in RNN models when long memory is required, meaning having long sentences.

In this exercise you will observe the problem on the IMDB data, with longer sentences selected. The data is loaded in `X` and `y` variables, as well as classes `Sequential`, `SimpleRNN`, `Dense` and `matplotlib.pyplot` as `plt`. The model was pre-trained with 100 epochs and its weights are stored on the file `model_weights.h5`.

### Instructions

- Add a `SimpleRNN` layer to the model.
- Load the pre-trained weights on the model using the method `.load_weights()`.
- Add the accuracy of the training data available on the attribute `'acc'` to the plot.
- Display the plot using the method `.show()`.

In [None]:
# Create the model
model = Sequential()
model.add(SimpleRNN(units=600, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Plot the accuracy x epoch graph
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.legend(['train', 'val'], loc='upper left')
plt.show()

<p align='center'>
    <img src='image/[2021-02-11 160543].svg'>
</p>