# Long sequences
By default, RNNs are better at short sequences.
They are worse at long sequences because 
the (unrolled) RNN is very deep.
All DNN have unstable gradients and long trainging requirements.
RNNs tend to forget the earlier parts of the sequences.

## Unstable gradient problem
Addressed this with...
* Good parameter initialization.
* Faster optimizers.
* Dropout.
* Different activations. 
* Use TensorBoard to monitor gradient size.
* Gradient Clipping.
* Layer Normalization (2016).

The default activation is tanh().
Avoid ReLU with RNN
ReLU is better for DNN because it does not saturate 
i.e. it keeps increasing.
In RNN, weights are shared across time steps,
and ReLU tends to increase the weight at every time step.

## Layer normalization
Batch normalization does not work well on RNN. 
On DNN, it is applied across instances of a batch.
But RNN has time steps within an instance.
Even when configured to work on RNN, it does not do so hot.

Layer normalization computes stats per layer, per instance, 
across all the features or units of that layer.
Layer normalization works the same way for training and testing.
Best applied just after input and before activation.

Geron writes his own subclass of Keras Layer
using Keras SimpleRNNCell.
Now, Keras has a subclass of Layer called [LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization).
There is a simple demo at [Keras](https://keras.io/api/layers/normalization_layers/layer_normalization/).
Maybe this didn't exist when the book came out.
(Layer normalization was published in 2016. The book came out in 2018).

LayerNormalization has lots of options. 
Its parameters can be set or learned.

In [13]:
import sys
import sklearn
import tensorflow
import numpy as np
import tensorflow as tf
from tensorflow import keras
import os
from pathlib import Path
np.random.seed(42)
tf.random.set_seed(42)
def generate_time_series (batch_size, n_steps):
    freq1, freq2, offset1, offset2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offset1) * (freq1 * 10 + 10))
    series += 0.2 * np.sin((time - offset2) * (freq2 * 20 + 20))
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)
    return series[..., np.newaxis].astype(np.float32)

In [14]:
n_steps = 50
series = generate_time_series(10000,n_steps+10)
X_train,y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid,y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test,y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

Y = np.empty((10000,n_steps,10))
for step_ahead in range(1,10+1):
    Y[:,:,step_ahead-1] = series[:,step_ahead:step_ahead+n_steps,0]
y_train = Y[:7000]
y_valid = Y[7000:9000]
y_test  = Y[9000:]

In [9]:
# First, repeat this run with no normalization.
rnn1 = keras.models.Sequential([
    keras.layers.SimpleRNN(20,return_sequences=True,input_shape=[None,1]),
    keras.layers.SimpleRNN(20,return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

rnn1.compile(loss="mse", optimizer="adam")
history = rnn1.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))  

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [10]:
# Second, repeat the compute with LayerNormalization.
rnn2 = keras.models.Sequential([
    keras.layers.LayerNormalization(axis=1),
    keras.layers.SimpleRNN(20,return_sequences=True,input_shape=[None,1]),
    keras.layers.SimpleRNN(20,return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

rnn2.compile(loss="mse", optimizer="adam")
history = rnn2.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))  

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Layer normalization did not help but we didn't expect it to help.
It would only help if we were having an exploding gradient.
This demo just shows that adding it did no harm.