# Recurrent Neural Networks

In this chapter, we will discuss RNN's that "can analyze time series data" and more generally can "work on sequences of arbitrary lengths, rather than on fixed-sized inputs like all the nets we have discussed so far."

"In this chapter, we will look at the fundamental concepts underlying RNNs, the main problem they face (namely, vanishing/exploding gradients), and the solutions widely used to fight it: LSTM and GRU cells. Along the way, as always, we will show how to implement RNNs using TensorFlow. Finally, we will take a look at the architecture of a machine translation system."

### Recurrent Neurons

"Up to now, we have mostly looked at feedforward neural networks, where the activations flow only in one direction, from the input layer to the output layer. A recurrent neural network loos very much like a feedforward neural network, except it also has connections pointing backward. Let's look at the simplest possible RNN, composed of just one neuron receiving inputs, producing an output, and sending that output back to itself."

"At each time step t, this recurrent neuron receives the inputs **x**(t) as well as its own output from the previous time step y(t-1)."

Now, instead of inputting the weights (transposed) times the inputs + the bias, we have

activation( wx_t x **x** + wy_t x **y**(t-1) + b) where we have two separate weight vectors, wx and wy for the inputs and the outputs of the previous step.

### Memory Cells

"Since the output of a recurrent neuron at time step t is a function of all the inputs from previous time steps, you could say that it has a form of *memory*. A part of a neural network that preserves some state across time steps is called a *memory cell* (or simply a *cell*). A single recurrent neuron, or a layer of recurrent neurons, is a very basic cell, but later in this chapter we will look at some more complex and powerful types of cells."

### Input and Output Sequences

There are 4 possible approaches that RNNs can take in terms of their inputs and outputs.

- **Sequence-Sequence**: i.e., you input a sequence of data into the net, and it outputs another sequence. "This is useful for predicting time series such as stock prices: you feed it the prices over the last N days, and it must output the prices shifted by one day into the future (i.e., from N-1 days ago to tomorrow).

- **Sequence-Vector**: the idea here is that you input a sequence and then simply ignore all the outputs except the last one. This will give you one vector (or scalar if it's only one output neuron). "For example: you could feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score."

- **Vector-Sequence**: "You could feed the network a single input at the first time step (and zeros for all other time steps), and let it output a sequence... For example, the input could be an image, and the output could be a caption for that image."

- **Encoder-Decoder**: This approach actually combines a sequence-to-vector encoder followed by a vector-to-sequence decoder. "For example, this can be used for translating a sentence from one language to another. You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language. This two-step model... works much better than trying to translate on the fly with a single sequence-to-sequence RNN... since the last words of a sentence can affect the first words of the translation, so you need to wait until you have heard the whole sentence before translating it."

![inputoutput](./inputoutput.jpg)

In [27]:
import tensorflow as tf
import math
import numpy as np
from sklearn.model_selection import train_test_split

In [7]:
input_layer = tf.keras.layers.Input((28, 28))
rec = tf.keras.layers.SimpleRNN(150, activation='relu')(input_layer)
dense = tf.keras.layers.Dense(10, activation='softmax')(rec)
rnn = tf.keras.models.Model(inputs=input_layer, outputs=dense)
rnn.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 28, 28)]          0         
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 150)               26850     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1510      
Total params: 28,360
Trainable params: 28,360
Non-trainable params: 0
_________________________________________________________________


In [8]:
rnn.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [9]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
(x_train.shape, x_test.shape)

((60000, 28, 28), (10000, 28, 28))

In [10]:
x_train, x_test = x_train/255., x_test/255.

In [11]:
rnn.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1d257c10d30>

### Training to Predict Time Series

"Each training instance is a randomly selected sequence of 20 consecutive values from the time series, and the target sequence is the same as the input sequence, except it is shifted by one time step into the future."

In [102]:
# We will use the function f(t) = sin(t)/3 + 2*sin(5t) for our time series.
num_values = 800
timeSeries = [math.sin(t)/3 + 2*math.sin(5*t) for t in np.linspace(0,30,num=num_values)]
timeSeries[:5]

[0.0,
 0.3857803980708031,
 0.7584261902904045,
 1.1052637379071897,
 1.4145251384022537]

In [103]:
n_steps = 20
n_inputs = 1
n_neurons = 100
n_outputs = 1

In [104]:
x, y = [], []
for i in range(int(num_values/n_steps)-1):
    x_batch = timeSeries[n_steps*i:n_steps*(i+1)]
    y_batch = timeSeries[n_steps*i+1:n_steps*(i+1)+1]
    x.append(x_batch)
    y.append(y_batch)
np.array(y).shape

(39, 20)

"At each timestep, we now have an output vector of size 100. But what we actually want is a single output value at each time step." We can solve this by adding a Dense layer of 1 neuron with no activation.

In [96]:
rnn = tf.keras.models.Sequential()
rnn.add(tf.keras.layers.SimpleRNN(n_neurons, activation='relu', input_shape=(20, 1)))
rnn.add(tf.keras.layers.Dense(1, activation=None))
rnn.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_17 (SimpleRNN)    (None, 100)               10200     
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 101       
Total params: 10,301
Trainable params: 10,301
Non-trainable params: 0
_________________________________________________________________


In [97]:
rnn.compile(optimizer='adam',loss='mse')

In [105]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)

In [109]:
x_train = np.array(x_train).reshape(-1, 20, 1)
x_test = np.array(x_test).reshape(-1, 20, 1)
y_train = np.array(y_train)
y_test = np.array(y_test).reshape(-1,20,1)

In [110]:
x_train.shape

(31, 20, 1)

In [112]:
rnn.fit(x_train, y_train, epochs=30, validation_data=(x_test,y_test), shuffle=False)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x1d2618e3370>

In [61]:
rnn.output.shape

TensorShape([None, 1])

### The Difficulty of Training Over Many Time Steps

As you increase the number of time steps of your network, the unrolled RNN will get enormous and super deep. This of course will lead to the vanishing/exploding gradients problem that we had before. You can still apply all the tricks that we discussed before: non-saturating activation functions, good parameter initialization, batch normalization, etc., but it still may not be enough and your network could take forever to train.

"The simplest and most common solution to this problem is to unroll the RNN only over a limited number of time steps during training. This is called **truncated backpropogation through time**... for example, in the time series prediction problem, you would simply reduce *n_steps* during training."

"A second problem faced by long-running RNNs is the fact that the memory of the first inputs gradually fades away... To solve this problem, various types of cells with long-term memory have been introduced."

### LSTM Cell

LSTM stands for *long short-term memory*. The LSTM cell was first proposed in 1997, and "if you consider the LSTM cell as a black box, it can be used very much like a basic cell, except it will perform much better; training will converge faster and it will detect long-term dependencies in the data."

![LSTM Cell](./lstm.jpg)

"The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it. As the long-term state **c**(t-1) traverses the network from left to right, you can see that it first goes through a *forget gate*, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an *input gate*). The result **c**(t) is sent straight out, without any further transformation. So at each time step, some memories are dropped and some memories are added. Moreover, after the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the *output gate*. This produces the short-ter, state **h**(t) (which is equal to the cell's output for this timestep **y**(t))."

But as you can see, there are 4 fully connected layers, **f**, **g**, **i**, and **o**. Let's go over what these *gates* are for.

- **g** is the basic logic. If this were an ordinary basic cell, this would be the whole operation. It takes in the inputs **x**(t) and the previous output **h**(t-1), passes it through a fully connected layer, and outputs **h**(t).

- **f** is the *forget gate*. It takes **x**(t), **h**(t-1) and has a logistic function, thus it will output a vector of approximately 0's and 1's, and then this is elementwise multiplied by the long-term memory **c**(t-1) to forget some memories and keep some others.

- **i** is the *input gate*. It takes the same inputs, applies a logistic function, and the multiplies it by the output of **g**(t), thus determining which parts of **g** will be added to the long-term memory.

- **o** is the *output gate*. This finally determines what parts of the long-term meory that are going to be outputted, both as **h**(t) and **y**(t).

"In short, an LSTM cell can learn to recognize an important input (that's the role of the input gate), store it in the long-term state, learn to preserve it for as long as it is needed (that's the role of the forget gate), and learn to extract it whenever it is needed. This explains why they have been amazingly successful at capturing long-term patterns in time series, long texts, audio recordings, and more."