Recurrent Neural Networks are extremely useful in natural language processing, time series analysis and even autonomous driving systems.

An RNN looks very similar to a feedforward neural network (FNN) but unlike FNNs RNNs also have connections pointing backward.

The simplest RNN is composed of 1 neuron that takes some input at time step t $\bf{x}_{(t)}$ as well as its own output from the previous time step, $y_{(t-1)}$.

With a layer of neurons at each time step t, every neuron receives an input vector $\bf{x}_t$ and the output vector from the previous time step $\bf{y}_{(t-1)}$.

Each neuron has two sets of weights: $\bf{w}_x$ and $\bf{w}_y$ for the input $\bf{x}_{(t)}$ and $\bf{y}_{(t - 1)}$. For the whole network these two weights can be place in two weight matrices, $\bf{W}_x$ and $\bf{W}_y$. The output vector for the entire layer can be computed as $\newline$

$\Large \bf{y}_{(t)} = \phi\left(\bf{W}_{x}^T\bf{x}_{(t)} + \bf{W}_{y}^T\bf{y}_{(t - 1)} + \bf{b} \right)$

this is the output for a single instance. For a mini-batch we have

$\Large \bf{Y}_{(t)} = \phi\left(\bf{X}_{(t)}\bf{W}_{x} + \bf{Y}_{(t - 1)}\bf{W}_{y} + \bf{b} \right)$ $\newline$
                     $ \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \,\Large = \phi\left(\bigl[\bf{X}_{(t)} \bf{Y}_{(t - 1)}\bigr]\bf{W} + \bf{b} \right)$
                     
Where $\Large \bf{W} = \begin{bmatrix}
W_x \\
W_y
\end{bmatrix}$

$\newline$


1. $\bf{Y}_{(t)}$
 is an $m \times n_{neurons}$ matrix containing the layer’s outputs at time step t for each
instance in the mini-batch (m is the number of instances in the mini-batch and
nneurons is the number of neurons).


2. $X_{(t)}$
 is an $m \times n_{inputs}$ matrix containing the inputs for all instances ($n_{inputs}$ is the
number of input features).


3. $W_x$
 is an ninputs × nneurons matrix containing the connection weights for the inputs
of the current time step.


4. $W_y$
 is an nneurons × nneurons matrix containing the connection weights for the out‐
puts of the previous time step.


5. $\bf{b}$ is a vector of size nneurons containing each neuron’s bias term.


6. The weight matrices $W_x$
 and $W_y$
 are often concatenated vertically into a single
weight matrix $\bf{W}$ of shape $(n_{inputs} + n_{neurons}) \times n_{neurons}$.


7. The notation $[X_{(t)} Y_{(t–1)}]$ represents the horizontal concatenation of the matrices
$X_{(t)}$
 and $Y_{(t–1)}$.
 


It should be noted that $Y_{(t)}$ is a function of $X_{(t)}$  and $Y_{(t-1)}$ which is a function of $X_{(t - 1)}$  and $Y_{(t-2)}$, which is a function of $X_{(t - 2)}$  and $Y_{(t-3)}$ and so on. At the first time step these are all zero.

# Memory Cells

We could say that a recurrent neuron has a form of memory since the output is a function of all the inputs from previous time steps. A single neuron, or a layer of recurrent neurons is an example of a basic memory cell, capable of learning short patterns that are typically 10 steps long.

In general a cell's state at time step t, $h_{(t)}$ is a function of some inputs at that time step and its state the previous time step: $h_{(t)} = f(h_{(t-1)}, x_{(t)})$. It's output at time step t, $y_{(t)}$ is also a function of the previous state and current inputs. In the cases discussed so far the state and output are equal but this might not always be the case.

# Input and Output Sequences

An RNN can take a sequence of inputs and produce a sequence of outputs. For example it can take stock prices over the last N days and then it outputs stock prices from N-1 days ago to tomorrow. This type of network is called a sequence-to-sequence network and is useful for predicting time series data.

We can also take a sequence of inputs and ignore all outputs except for the last one. For example we could feed the network a sequence of words corresponding to a movie review, and the network could ouput a sentiment score. This type of network is called a sequence-to-vector network.

We also have vector-to-sequence networks. Here we feed the network the same input vector over and over again and let it output a sequence. For example the input could be an image (or the output of a CNN), and the output could be a caption for that image.

Another possibility is we first use a sequence-to-vector network called an encoder, followed by a vector-to-sequence network called a decoder. We can feed the network a sentence in one language, convert this to a vector representation and then the decoder would decode this vector into a sentence in another language.

# Training RNNs

Like regular backpropagation, there is a first forward pass. Here the current inputs and aswell as the outputs from the previous timestamp are fed into the network. The output sequence is evaluated using a cost function. The gradients of this cost function are then propagated backward through the unrolled network. The model parameters are then updated using the gradients.
In certain cases the cost function may ignore some outputs and in this case the gradients flow through only the outputs that were used to compute the cost function.

# Forecasting a Time Series

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras



We will create a function that produces as many series as requested.

In [3]:
def generate_time_series(batch_size, n_steps):
 freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
 time = np.linspace(0, 1, n_steps)
 series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10)) # wave 1
 series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
 series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5) # + noise
 return series[..., np.newaxis].astype(np.float32)

In [4]:
n_steps = 50
series = generate_time_series(10000, n_steps + 1) # 10,000 series each with 50 time steps.

## Creating Training, Test and Validation sets

In [5]:
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_val, y_val = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

In [6]:
y_train.shape

(7000, 1)

In X_train we have 7000 series each with 50 time steps and y_train is the target vector which contains the last or the 51st time step from the original series. So essentially we will try to predict this value based on the previous 50 time steps.

# Baseline Metrics

## Naive Forecasting

This is a basic model that just predicts the last value in the series. So all it does is given the last 50 time steps it just predicts the last value in the series as the 51st value. This is a baseline metric that we can use to see if our RNN model is doing better than a basic model such as this one.

In [7]:
y_pred = X_val[:, -1]
np.mean(keras.losses.mse(y_val, y_pred))

0.019714875

## Simple Linear Regression

In this case we fit a simple linear regression model to the time series and see how it performance compared to RNN.

In [8]:
model = keras.models.Sequential([
 keras.layers.Flatten(input_shape = [50, 1]),
 keras.layers.Dense(1)
])

In [9]:
model.compile(loss='mse', optimizer='Adam')

In [10]:
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1ec0e3d9190>

In [11]:
model.evaluate(X_val, y_val)



0.003515490097925067

In [153]:
model.get_weights() # 50 params

[array([[-0.0335925 ],
        [ 0.23121265],
        [-0.12124686],
        [-0.18946454],
        [ 0.3157627 ],
        [-0.09711541],
        [-0.09314709],
        [ 0.10129108],
        [ 0.15074727],
        [-0.11804308],
        [ 0.11568407],
        [-0.13561237],
        [ 0.04642106],
        [-0.04041718],
        [ 0.21244538],
        [-0.04088664],
        [ 0.07193008],
        [-0.07880671],
        [-0.10231889],
        [-0.02343206],
        [ 0.19717428],
        [-0.12191384],
        [ 0.07610077],
        [-0.02141514],
        [-0.00764074],
        [-0.08880735],
        [-0.20378035],
        [-0.09500337],
        [ 0.12794864],
        [ 0.16298828],
        [-0.10310844],
        [-0.23570614],
        [-0.18816598],
        [-0.16300747],
        [ 0.12291079],
        [ 0.06325466],
        [-0.15044206],
        [ 0.02334505],
        [-0.14727363],
        [-0.16614589],
        [-0.28066596],
        [ 0.02584621],
        [-0.02880346],
        [ 0

# Implementing a Simple RNN

In [154]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=([None, 1]))
])

Here we have a simple RNN with just one neuron. Here we do not need to specify the length of the input sequence as an RNN can process any number of time steps. The SimpleRNN layers uses the hyperbolic tangent activation function by default. The initial state $h_{(init)}$ is set to 0, and it is passed to a single neuron along with the value of the first time step, $x_{(0)}$. The neuron computes a weighted sum of these values and applies the activation function to the result to give the first output $y_0$. For a SimpleRNN this output is also the new state $h_0$. This new state is passed to the same recurrent neuron along with next input value $x_1$ and the process is repeated until the last time step. The layer just outputs the last value, $y_{49}$.

In [155]:
model.compile(loss='mse', optimizer='adam')

In [156]:
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0632
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0409
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0282
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0186
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0141
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0122
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0114
Epoch 8/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0117
Epoch 9/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0112
Epoch 10/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - lo

<keras.src.callbacks.history.History at 0x1e243e7cc10>

In [157]:
model.evaluate(X_val, y_val)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0103 


0.010736643336713314

We can see that this Simple RNN outperforms the naive approach and but does not do better than the simple linear regression model. The linear regression model has 50 parameters for the values of the 50 time steps as well as a bias term giving it 51 parameters in total.
The SimpleRNN has one parameter per input and per hidden state dimension plus a bias term giving it 3 total parameters.

In [158]:
model.get_weights() # 3 params

[array([[1.6609677]], dtype=float32),
 array([[-0.66444606]], dtype=float32),
 array([0.01183371], dtype=float32)]

# Deep RNNs

In [159]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.SimpleRNN(1)
])

In [160]:
model.summary()

In [161]:
model.compile(loss='mse', optimizer='adam')

In [162]:
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - loss: 0.0441
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0048
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0036
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0033
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0032
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0030
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0032
Epoch 8/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0030
Epoch 9/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 0.0030
Epoch 10/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - lo

<keras.src.callbacks.history.History at 0x1e253ef8ac0>

In [163]:
model.evaluate(X_val, y_val)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0028


0.0027923425659537315

This outperforms all the models we have considered so far.

# Forecasting Several Times Steps Ahead

In [164]:
series = generate_time_series(10000, n_steps + 10)

In [165]:
X_train, y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_val, y_val = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

In [166]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10)
])

In [167]:
model.compile(loss='mse', optimizer='adam')

In [168]:
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - loss: 0.1288
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0310
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0199
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0156
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0132
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0127
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0121
Epoch 8/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0108
Epoch 9/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0116
Epoch 10/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - lo

<keras.src.callbacks.history.History at 0x1e25513bf10>

In [169]:
series2 = generate_time_series(1, n_steps + 10)

In [170]:
X_new, Y_new = series2[:, :n_steps], series2[:, n_steps:]

In [171]:
Y_pred = model.predict(X_new)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 158ms/step


In [172]:
model.evaluate(X_val, y_val)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0086 


0.008478774689137936

In [173]:
np.mean(keras.losses.mse(Y_new, Y_pred))

0.0028288208

In [174]:
Y = np.empty((10000, n_steps, 10)) # each target is a sequence of 10D vectors
for step_ahead in range(1, 10 + 1):
 Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:7000]
Y_valid = Y[7000:9000]
Y_test = Y[9000:]

In [175]:
model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]),
    keras.layers.SimpleRNN(20, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

In [179]:
model.compile(loss='mse', optimizer=keras.optimizers.Adam(learning_rate=0.01))

In [180]:
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - loss: 0.0473
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0284
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0265
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0260
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0241
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0227
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 8ms/step - loss: 0.0213
Epoch 8/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0208
Epoch 9/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 8ms/step - loss: 0.0203
Epoch 10/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - lo

<keras.src.callbacks.history.History at 0x1e255126df0>

In [181]:
model.evaluate(X_val, Y_valid)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 0.0181


0.018122989684343338

# The Unstable Gradients Problem

We can deploy the same tricks we used in deep neural networks to alleviate the vanishing/exploding gradients problems in RNNs aswell. Good initialization, faster optimizers, dropout, etc. Non-saturating activation function like ReLU don't help much because they can cause gradients to explode and so saturating activation functions like tanh are used to stabilise the gradients. The risk of exploding gradients can also be reduced by using smaller learning rates or gradient clipping.

One form of normalization that works well with RNNs is layer normalization. It is very similar to Batch Normalization, but instead of normalizing across the batch dimension, it normalizes across the features dimension. One advantage is that it can compute the required statistics on the fly, at each time step, independently for each instance. 

## Implementing Layer Normalization

In [16]:
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation='tanh', **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units, activation=None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

The LNSimpleRNNCell inherits from the keras.layers.Layer class just like any customer layer. The constructor takes the number of units and the desired activation function and sets the state size and output size. Both of these are equal to the number of units for simplernn. We then create a simple rnn cell with no activation function because we want to perform layer normalization after the linear operation but before the activation function. We then create the LayerNormalization layer and fetch the activation function.

In the call() method applies the SimpleRNN cell and computes a linear combination of the current inputs and previous hidden states. It then returns the result twice as outputs = new_states[0]. Next we apply Layer Normalization followed by the activation function.

Similarly, you could create a custom cell to apply dropout between each time step. But there’s a simpler way: all recurrent layers (except for keras.layers.RNN) and all cells provided by Keras have a dropout hyperparameter and a recurrent_dropout
hyperparameter: the former defines the dropout rate to apply to the inputs (at each time step), and the latter defines the dropout rate for the hidden states (also at each time step). No need to create a custom cell to apply dropout at each time step in an RNN.

# The Short-Term Memory Problem

After all the transformation the data goes through when traversing the RNN, some information is lost at each time step. This means after some time, the RNN's hidden state contains virtually no trace of the initial inputs. To tackle this problem various types of long-term memory cells have been developed making SimpleRNN obsolete the most popular one being the LSTM cell.

## LSTM Cells

The LSTM cell can be used just like a basic cell but it will perform better, training will converge faster, and also it will detect long-term dependecies in the data.

In [4]:
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences = True, input_shape=[None, 1]),
    keras.layers.LSTM(20, return_sequences = True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

The state of an LSTM is split into two vector $\bf{h}_{(t)}$ and $\bf{c}_{(t)}$ which are short-term state and long-term state respectively.

The long-term state $\bf{c}_{(t-1)}$ first goes through a forget gate, dropping some memories, then go through the addition operation adding some new memories. Then it goes through. However, at the addition operation, it is copied and passed through tanh and then filtered by the output gate. This produces the short-term state $\bf{h}_{(t)}$.

The current input and the previous short-term state $\bf{x}_{(t)}$, $\bf{h}_{(t)}$ are fed to four different fully connected layers.

1. The main layer outputs $\bf{g}_{(t)}$ which has the usual role of analysing the current inputs and the previous short-term state. The most important parts of this output are stored in the long-term state and the rest is dropped.

2. The other three layers are gate controllers. They use the logistic activation function with output range from 0 to 1. The outputs are fed to element-wise multiplication operations. If they output 0s they close the gate and if they output 1s they open it.

- The forget gate controls which parts of the long-term state should be erased.
- The input gate controls which parts of $\bf{g}_{(t)}$ should be added to the long-term state.
- The output gate controls which parts of the long-term state should be read and output at this time step, both to         $\bf{h}_{(t)}$ and to $\bf{y}_{(t)}$.

## LSTM computations

$ \Large i_{(t)} = \sigma\left(W_{xi}^{T} x_{(t)} + W_{hi}^{T}h_{(t-1)} + b_i\right)$ $\newline$
$ \Large f_{(t)} = \sigma\left(W_{xf}^{T} x_{(t)} + W_{hf}^{T}h_{(t-1)} + b_f\right)$ $\newline$
$ \Large o_{(t)} = \sigma\left(W_{xo}^{T} x_{(t)} + W_{ho}^{T}h_{(t-1)} + b_o\right)$ $\newline$
$ \Large g_{(t)} = \tanh\left(W_{xg}^{T} x_{(t)} + W_{hg}^{T}h_{(t-1)} + b_g\right)$ $\newline$
$ \Large c_{(t)} = f_{(t)} \otimes c_{(t-1)} + i_{(t)} \otimes g_{(t)}$ $\newline$
$ \Large y_{(t)} = h_{(t)} = o_{(t)} \otimes \tanh(c_{(t)}$

In this equation:

• $W_{xi}$, $W_{xf}$, $W_{xo}$, $W_{xg}$ are the weight matrices of each of the four layers for their connection to the input vector $x_{(t)}$.


• $W_{hi}$, $W_{hf}$, $W_{ho}$, and $W_{hg}$ are the weight matrices of each of the four layers for their
connection to the previous short-term state $h_{(t–1)}$.


• $b_i$, $b_f$, $b_o$, and $b_g$ are the bias terms for each of the four layers. Note that TensorFlow initializes bf to a vector full of 1s instead of 0s. This prevents forgetting everything at the beginning of training.



## Peephole Connections

In a regular LSTM the gate controllers can only look at the current input $x){(t)}$ and the previous short-term state $h_{(t-1)}$. But we can also feed it the previous long-term state $c_{(t-1)}$ for some extra context, this is fed to the forget gate and the input gate. The current long-term state $c_{(t)}$ is added as an input to the the controller of the output gate. This usually improves performance but not always. 

## GRU Cells

The GRU cell is a simplified version of the LSTM cell, and it performs similary. It works in the following way

Both states are merged into a single vector $h_{(t)}$.

A single gate controller $z_{(t)}$ controls both the forget gate and the input gate. If the gate controller outputs a 1, the forget gate is open and the input gate is closed and vice versa.

There is no output gate. However, there is a new gate controller $r_{(t)}$ that controls which part of the previous state will be shown to the main layer $g_{(t)}$.
