# Long-term memory.
The simple RNN has short term memory.
It rarely remembers over 10 time steps back.
This is due to the huge data transformation at every time step.
Nobody uses the simple RNN anymore.
Everyone uses some form of enhanced long-term memory.
These are capable of remembering about 100 time steps.

## LSTM cells
Long short-term memory has been improved since its introduction in 1997.

The Keras LSTM converges faster and remembers more.
There are two ways to use it.
* Pass an LSTMCell() instance to a Keras RNN() layer.
* Use the LSTM() layer. This is optimized for GPU.

In [1]:
import sys
import sklearn
import tensorflow
import numpy as np
import tensorflow as tf
from tensorflow import keras
import os
from pathlib import Path
np.random.seed(42)
tf.random.set_seed(42)
def generate_time_series (batch_size, n_steps):
    freq1, freq2, offset1, offset2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offset1) * (freq1 * 10 + 10))
    series += 0.2 * np.sin((time - offset2) * (freq2 * 20 + 20))
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)
    return series[..., np.newaxis].astype(np.float32)

n_steps = 50
series = generate_time_series(10000,n_steps+10)
X_train,y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_valid,y_valid = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test,y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

Y = np.empty((10000,n_steps,10))
for step_ahead in range(1,10+1):
    Y[:,:,step_ahead-1] = series[:,step_ahead:step_ahead+n_steps,0]
y_train = Y[:7000]
y_valid = Y[7000:9000]
y_test  = Y[9000:]

In [2]:
# First, train the simple RNN.
rnn1 = keras.models.Sequential([
    keras.layers.SimpleRNN(20,return_sequences=True,input_shape=[None,1]),
    keras.layers.SimpleRNN(20,return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

rnn1.compile(loss="mse", optimizer="adam")
history = rnn1.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))  

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [3]:
# Second, substitute LSTM for simple RNN.
rnn2 = keras.models.Sequential([
    keras.layers.LSTM(20,return_sequences=True,input_shape=[None,1]),
    keras.layers.LSTM(20,return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

rnn2.compile(loss="mse", optimizer="adam")
history = rnn2.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))  

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Much slower with normalization (ach epoch took about twice as long)
but loss went down faster.

Geron has a great figure of the LSTM cell.
See Figure 15.9 on page 516.
It is also online [png](https://learning.oreilly.com/library/view/neural-networks-and/9781492037354/assets/mlst_1413.png) 
in [ch4](https://learning.oreilly.com/library/view/neural-networks-and/9781492037354/ch04.html).
The equations for each layer are on page 517.

## Peephole LSTM
This is a recent enhancement.
It adds more highways for remembering prior cell state.
Book says unclear when/if it works.
It is marked experimental in Keras: [PeepholdLSTM](https://www.tensorflow.org/api_docs/python/tf/keras/experimental/PeepholeLSTMCell)


## GRU cells
Simplied LSTM that performs just as well.

LSTM has 2 vectors carried forward at each time step. GRU merges c and h into just h. 
* LSTM c = cell state for long term memory
* LSTM h = output of this hidden layer at previous time step.


LSTM has 2 extra gates. GRU merges these into one gate.
* LSTM Input gate takes in x, changes c and h.
* LSTM Forget gate takes in c, changes c and h.

GRU overwrites memory when it stores a new memory (huh?).

GRU has extra gate controller that decides which part of memory to use (huh?).

See figure on page 519. See equations on page 520.

In [4]:
# Third, substitute GRU for LSTM.
rnn3 = keras.models.Sequential([
    keras.layers.GRU(20,return_sequences=True,input_shape=[None,1]),
    keras.layers.GRU(20,return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

rnn3.compile(loss="mse", optimizer="adam")
history = rnn3.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))  

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## 1D convolution
To tackle sequences longer than 100 time steps,
break the sequences into shorter pieces.
This is similar to 2D convolution for CNN.

Slide kernels (filters) over the sequence (or image).
Each kernel learns to match of subsequence of length = kernel size.

Geron says this is so effective that you can
even omit the RNN/LSTM/GRU layers!

In [10]:
# Fourth, add Conv1D.
# This is equivalent to K-mers, K=2, skip every other one.
rnn4 = keras.models.Sequential([
    keras.layers.Conv1D(filters=20,
                       kernel_size=4,  
                       strides=2,
                       padding="valid",
                       input_shape=[None,1]),
    keras.layers.GRU(20,return_sequences=True), 
    keras.layers.GRU(20,return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])
# Specify input shape in first layer only.

rnn4.compile(loss="mse", optimizer="adam")
history = rnn4.fit(X_train, y_train[:,3::2], epochs=5,
                    validation_data=(X_valid, y_valid[:,3::2]))
# Note new shape of y_valid

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## WaveNet
Invented 2016 for generating human speech.
Stacked 1D convolutional layers.
Doubling dilation rate means 
process every time step, then every other, etc. 
The paper used three iterations of 1,2,4,8,...512.

In [11]:
# Fifth, do WaveNet i.e. stacked 1D conv.
rnn5 = keras.models.Sequential()
rnn5.add(keras.layers.InputLayer(input_shape=[None,1]))
for rate in (1,2,4,8)*2:
    rnn5.add(keras.layers.Conv1D(filters=20,kernel_size=2,  
            padding="causal",activation="relu",dilation_rate=rate))
rnn5.add(keras.layers.Conv1D(filters=10,kernel_size=1))  

rnn5.compile(loss="mse", optimizer="adam")
history = rnn5.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
