# Processing Sequences Using RNNs & CNNs

The batter hits the ball. The outfielder immediately starts running, anticipating the ball's trajectory. He tracks it, adapts his movements, & finally catches it (under a thunder of applause). Predicting the future is something you do all the time, whether you are finishing a friend's sentence or anticipating the smell of cofee at breakfast. In this lesson, we will discuss recurrent neural networks (RNNs), a class of nets that can predict the future (up to a point). They can analyse time series data such as stock prices, & tell you when to buy or sell. In autonomous driving systems, they can anticipate car trajectories & help avoid accidents. More generally, they can work on sequences of arbitrary lengths, rather than on fixed-size inputs like all the nets we have considered so far. For example, they can take sentences, documents, or audio samples as input, making them extremely useful for natural language processing applications such as automatic translation or speech-to-text.

In this lesson, we will first look at the fundamental concepts underlying RNNs & how to train them using backpropagation through time, then we will use them to forecast a time series. After that we'll explore the two main difficulties that RNNs face:

* Unstable gradients, which can be alleviated using various techniques, including recurrent dropout & recurrent layer normalisation
* A limited short-term memory, which can be extended using LSTM & GRU cells

RNNs are not the only types of neural networks capable of handling sequential data: for small sequences, a regular dense network can do the trick; & for very long sequences, such as audio samples or text, convolutional neural networks can actually work quite well too. We will discuss both of these possibilities & finish the lesson by implementing a *WaveNet*: this is a CNN architecture capable of handling sequences of tens of thousands of time steps.

---

# Recurrent Neurons & Layers

Up to now, we have been focused on feedforward neural networks, where the activations flow only in one direction, from the input layer to the output layer. A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward. Let's look at the simplest possible RNN, composed of one neuron receiving inputs, producing an output, & sending that output back to itself, as shown in the below figure (left).

<img src = "Images/Recurrent Neuron.png" width = "600" style = "margin:auto"/>

At each *time step t* (also called a *frame*), this *recurrent neuron* receives the inputs $x_{(t)}$ as well as its own output from the previous time step, $y_{(t - 1)}$. Since there is no previous output at the first time step, it is generally set to 0. We can represent this tiny network against the time axis, as shown in the figure (right). This is called *unrolling th network through time* (it's the same recurrent neuron represented once per time step).

You can easily create a layer of recurrent neuron's. At each time step *t*, every neuron receives both tin input vector $x_{(t)}$ & the output vector from the previous time step $y_{(t - 1)}$, as shown in the figure below.

<img src = "Images/Layer of Recurrent Neurons.png" width = "600" style = "margin:auto"/>

Note that both the inputs & outputs are vectors now (when there was just a single neuron, the output was a scalar).

Each recurrent neuron has two sets of weights: one for the inputs $x_{(t)}$ & the other for the outputs of the previous time step, $y_{(t - 1)}$. Let's call these weight vectors $w_x$ & $w_y$. If we consider the whole recurrent layer instead of just one recurrent neuron, we place all the weight vectors in two weight matrices, $W_x$ & $W_y$. The output vector of the whole recurrent layer can then be computed pretty much as you might expect, as shown in the below equation ($b$ is the bias vector & $\phi(.)$ is the activation function (e.g., ReLU)):

$$y_{t} = \phi(W_x^{\intercal} x_{(t)} + W_y^{\intercal}y_{(t - 1)} + b)$$

Just as with feedforward neural networks, we can compute a recurrent layer's output in one shot for a whole mini-batch by placing all the inputs at time step *t* in an input matrix $X_{(t)}$.

$$\begin{split}
Y_{(t)} = \phi(X_{(t)}W_x + Y_{(t - 1)}W_y + b) \\
= \phi([X_{(t)} Y_{(t - 1)}]W + b)\ with\ W = \left[\begin{split}
W_x \\
W_y
\end{split}
\right]
\end{split}$$

In this equation:

* $Y_{(t)}$ is an $m * n_{neurons}$ matrix containing the layer's outputs at time step *t* for each instance in the mini-batch ($m$ is the number of instances in the mini-batch & $n_{neurons}$ is the number of neurons).
* $X_{(t)}$ is an $m * n_{inputs}$ matrix containing the inputs for all instances ($n_{inputs}$ is the number of input features).
* $W_x$ is an $n_{inputs} * n_{neurons}$ matrix containing the connection weights for the inputs of the current time step.
* $W_y$ is an $n_{neurons} * n_{neurons}$ matrix containing the connection weights for the outputs of the previous time step.
* $b$ is a vector of size $n_{neurons}$ containing each neuron's bias term.
* The weight matrices $W_x$ & $W_y$ are often concatenated vertically into a single weight matrix $W$ of shape $(n_{inputs} + n_{neurons}) * n_{neurons}$
* The notation $[X_{(t)} Y_{(t - 1)}]$ represents the horizontal concatenation of the matrices $X_{(t)}$ & $Y_{(t - 1)}$.

Notice that $Y_{(t)}$ is a function of $X_{(t)}$ & $Y_{(t - 1)}$, which is a function of $X_{(t - 1)}$ & $Y_{(t - 2)}$, which is a function of $X_{(t - 2)}$ & $Y_{(t - 3)}$, & so on. This makes $Y_{(t)}$, a function of all the inputs since time *t = 0* (that is, $X_{(0)}, X_{(1)}, ..., X_{(t)}$). At the first time step, *t = 0*, there are no previous outputs, so they are typically assumed to be all zeros.

## Memory Cells

Since the output of a recurrent neuron at time step *t* is a function of all the inputs from previous time steps, you could say it has a form of *memory*. A part of a neural network that preserves some state across time steps is called a *memory cell* (or simply a *cell*). A single recurrent neuron, or a layer of recurrent neurons, is a very basic cell, capable of learning only short patterns (typically about 10 steps long, but this varies depending on the task). Later in this chapter, we will look at some more complex & powerful types of cells capable of learning longer patterns (roughly 10 times longer, but again, this depends on the task).

In general a cell's state at time step *t*, denoted $h_{(t)}$ (the "h" stands for "hidden"), is a function of some inputs at that time step & its state at the previous time step: $h_{(t)} = f(h_{(t - 1)}, x_{(t)})$. Its output at time step *t*, denoted $y_{(t)}$, is also a function of the previous state & the current inputs. In the case of the basic cells we have discussed so far, the output is simply equal to the state, but in more complex cells, this is not always the case, as shown in the figure below.

<img src = "Images/Hidden State.png" width = "500" style = "margin:auto"/>

## Input & Output Sequences

An RNN can simultaneously take a sequence of inputs & produce a sequence of outputs. This is a *vector-to-sequence network*. This type of *sequence-to-sequence network* is useful for predicting time series such as stock prices: you feed it the prices over the last N days, & it must output the prices shifted by one day into the future (i.e., from N - 1 days ago to tomorrow).

Alternatively, you could feed the network a sequence of inputs & ignore all outputs except for the last one. In other words, this is a *sequence-to-vector network*. For example, you could feed the network a sequence of words corresponding to a moview revier, & the network would output a sentiment score (e.g., from -1 [hate] to +1 [love]).

Conversely, you could feed the network the same input vector over & over again at each time step & let it output a sequence. This is a *vector-to-sequence network*. For example, the input could be an image (or the output of a CNN), & the output could be a caption for that image.

Lastly, you could have a sequence-to-vector network, called an *encoder*, followed by a vector-to-sequence network, called a *decoder*. For example, this could be used for translating a sentence from one language to another. You could feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, & then the decoder would decode this vector into a sentence in another language. This two-step model, called an *Encoder-Decoder*, works much better than trying to translate on the fly with a single sequence-to-sequence RNN (like the one represented at the top left): the last words of a sentence can affect the first words of the translation, so you need to wait until you have seen the whole sentence before translating it.

<img src = "Images/Seq-to-Seq, Seq-to-Vector, Vector-to-Seq, Encoder-Decoder.png" width = "500" style = "margin:auto"/>

Sounds promising, but how do you train a recurrent neural network?

---

# Training RNNs

To train an RNN, the trick is to unroll it through time (like we just did) & then simply use regular backpropagation. This strategy is called *backpropagation through time* (BPTT).

Just like in regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows). Then the output sequence is evaluated using a cost function $C(Y_{(0)}, Y_{(1)}, ..., Y_{(T)})$ (where *T* is the max time step). Note that this cost function may ignore some outputs (for example, in a sequence-to-vector RNN, all outputs are ignored except for the very last one). The gradients of that cost function are then propagated backward through the unrolled network (represented by the solid arrows). Finally, the model parameters are updated using the gradients computed during BPTT. Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output ( for example, in the below figure, the cost function is computed using the last three outputs of the network, $Y_{(2)}$, $Y_{(3)}$, $Y_{(4)}$, so gradients flow through these three outputs, but not through $Y_{(0)}$ & $Y_{(1)}$). Moreover, since the same parameters $W$ & $b$ are used at each time step, backpropagation will do the right thing & sum over all time steps.

<img src = "Images/Backpropagation Through Time (BPTT).png" width = "600" style = "margin:auto"/>

Fortunately, tf.keras. takes care of all of this complexity for you -- so let's start coding.

---

# Forecasting a Time Series

Suppose you are studying the number of active users per hour on your website, or the daily temperature in your city, or your company's financial health, measure quarterly using multiple metrics. In all these cases, the data will be a sequence of one or more values per tiem step. This is called a *time series*. In the first two examples, there is a single value per time step, so these are *univariate time series*, while in the financial example, there are multiple values per time step (e.g., the company's revenue, debt, & so on), so it is a *multivariate time series*. A typical task is to predict future values, which is called *forecasting*. Another common task is to fill in the blanks: to predict (or rather "postdict") missing values from the past, this is called *imputation*. For example, the below figure shows 3 univariate time series, each of them 50 time steps long, & the goal here is to forecast the value at the next time step (represented by the X) for each of them.

<img src = "Images/Time Series Forecasting.png" width = "600" style = "margin:auto"/>

For simplicity, we are using a time series generated by the `generate_time_series()` function, shown here:

In [2]:
import numpy as np

def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20))
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)
    return series[..., np.newaxis].astype(np.float32)

This function creates as many time series as requested (via the `batch_size` argument), each of length `n_steps`, there is just one value per time step in each series (i.e., all series are univariate). The function returns a numpy array of shape [*batch size*, *time steps*, 1], where each series is the sum of two sine waves of fixed amplitudes but random frequences & phases, plus a bit of noise.

Now, let's create a training set, a validation set, & a test set using this function:

In [3]:
n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_val, y_val = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

`X_train` contains 7,000 time series (i.e., its shape is [7000, 50, 1]), while `X_cal` contains 2,000 (from the 7,000th time series to the 8,999th) & `X_test` contains 1,000 (from 9,000th to the 9,999th). Since we want to forecast a single value for each series, the targets are column vectors (e.g., `y_train` has a shape of [7000, 1])

## Baseline Metrics

Before we start using RNNs, it is often a good idea to have a few baseline metrics, or else we may end up thinking our model works great when in fact it is doing worse than basic model. For example, the simplest approach is to predict the last value in each series. This is called *naive forecasting*, & it is sometimes surprisingly difficult to outperform. In this case, it gives us a mean squared error about 0.02:

In [4]:
import tensorflow as tf
from tensorflow import keras

y_pred = X_val[:, -1]
loss = np.mean(np.square(y_val - y_pred))
loss

0.021324897

Another simple approach is to use a fully connected network. Since it expects a flat list of features for each input, we need to add a `Flatten` layer. Let's use a simple linear regression model so that each prediction will be a linear combination of the values in the time series:

In [11]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape = [50, 1]),
    keras.layers.Dense(1)
])
model.compile(loss = "mean_squared_error",
              optimizer = "adam",
              metrics = ["mae"])
model.fit(X_train, y_train, epochs = 20,
         validation_data = (X_val, y_val))

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.0822 - mae: 0.2232 - val_loss: 0.0259 - val_mae: 0.1290
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0217 - mae: 0.1177 - val_loss: 0.0147 - val_mae: 0.0972
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0133 - mae: 0.0934 - val_loss: 0.0107 - val_mae: 0.0830
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0099 - mae: 0.0802 - val_loss: 0.0088 - val_mae: 0.0752
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0085 - mae: 0.0746 - val_loss: 0.0077 - val_mae: 0.0701
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0074 - mae: 0.0692 - val_loss: 0.0068 - val_mae: 0.0659
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - 

<keras.src.callbacks.history.History at 0x14c21c860>

If we compile this model using the MSE loss & the default adam optimizer, then fit it on the training set for 20 epochs & evalute it on the validation set, we get an MSE of about 0.004. That's much better than the native approach.

## Implementing a Simple RNN

Let's see if we can beat that with a simple RNN:

In [14]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape = [None, 1])
])

That's really the simplest RNN you can build. It just contains a single layer with a single neuron. We do not need to specify the length of the input sequences (unlike in the previous model), since a recurrent neural network can process any number of time steps (this is why we set the first input dimension to `None`). By default, the `SimpleRNN` layer uses the hyperbolic tangent activation function. It works exactly as we saw earlier: the initial state $h_{(init)}$ is set to 0, & it is passed to a single recurrent neuron, along with the value of the first time step, $x_{(0)}$. The neuron computes a weighted sum of these values & applies the hyperbolic tangent activation function to the result, & this gives the first output, $y_0$. In a simple RNN, this output is also the new state $h_0$. This new state is passed to the same recurrent neuron along with the next input value, $x_{(1)}$, & the process is repeated until the last time step. Then the layer just outputs the last value, $y_{49}$. All of this is performed simultaneously for every time series.

If you compile, fit, & evaluate this model (just like earlier, we train for 20 epochs using Adam), you will find that its MSE reaches only 0.015, so it is better than the naive approach but it does not beat a simple linear model. Note that for each neuron, a linear model has one parameter per input & per time step, plus a bias term (in the simple linear model we used, that's a total of 51 parameters). In contrast, for each recurrent neuron in a simple RNN, there is just one parameter per input & per hidden state dimension (in a simple RNN,that's just the number of recurrent neurons in the layer), plus a bias term. In this simple RNN, that's a total of just three parameters.

In [15]:
model.compile(loss = "mean_squared_error",
              optimizer = "adam",
              metrics = ["mae"])
model.fit(X_train, y_train, epochs = 20,
          validation_data = (X_val, y_val))

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - loss: 0.2478 - mae: 0.4383 - val_loss: 0.1302 - val_mae: 0.3082
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.1122 - mae: 0.2854 - val_loss: 0.0881 - val_mae: 0.2511
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0798 - mae: 0.2378 - val_loss: 0.0707 - val_mae: 0.2227
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0651 - mae: 0.2128 - val_loss: 0.0588 - val_mae: 0.2015
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0538 - mae: 0.1922 - val_loss: 0.0500 - val_mae: 0.1850
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0462 - mae: 0.1773 - val_loss: 0.0434 - val_mae: 0.1718
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - 

<keras.src.callbacks.history.History at 0x14c505760>

Apparently, our simple RNN was too simple to get a good performance. So let's try to add more recurrent layers!

## Deep RNNs

It is quite common to stack multiple layers of cells, as shown in the below figure. This gives you a *deep RNN*.

In [16]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences = True, input_shape = [None, 1]),
    keras.layers.SimpleRNN(20, return_sequences = True),
    keras.layers.SimpleRNN(1)
])

If you compile, fit, & evalute this model, you will find that it reaches an MSE of 0.003. We finally managed to beat the linear model!

In [17]:
model.compile(loss = "mean_squared_error", 
              optimizer = "adam",
              metrics = ["mae"])
model.fit(X_train, y_train, epochs = 20,
          validation_data = (X_val, y_val))

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 20ms/step - loss: 0.0908 - mae: 0.2185 - val_loss: 0.0098 - val_mae: 0.0802
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 20ms/step - loss: 0.0075 - mae: 0.0694 - val_loss: 0.0047 - val_mae: 0.0548
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 20ms/step - loss: 0.0051 - mae: 0.0573 - val_loss: 0.0039 - val_mae: 0.0503
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - loss: 0.0046 - mae: 0.0552 - val_loss: 0.0049 - val_mae: 0.0564
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - loss: 0.0043 - mae: 0.0528 - val_loss: 0.0043 - val_mae: 0.0525
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - loss: 0.0037 - mae: 0.0489 - val_loss: 0.0037 - val_mae: 0.0492
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/

<keras.src.callbacks.history.History at 0x14c6ef770>

Note that the last layer is not ideal: it must have a single unit because we want to forecast a univariate time series, & this means we must have a single output value per time step. However, having a singlue unit means that the hidden state is just a single number. That's really not much, & it's probably not that useful; presumably, the RNN will mostly use the hidden states of the other recurrent layers to carry over all the information it needs from time step to time step, & it will not use the final layer's hidden state very much. Moreover, since a `SimpleRNN` layer uses the tanh activation function by default, the predicted value must lie within the range -1 to 1. But what if you want to use another activation function? For both these reasons, it might be preferable to replace the output layer with a `Dense` layer: it would run slightly faster, the accuracy would be roughly the same, & it would allow us to choose any output activation function we want. If you make this change, also make sure to remove `return_sequences = True` from the second (now last) recurrent layer:

In [19]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences = True, input_shape = [None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(1)
])
model.compile(loss = "mean_squared_error", 
              optimizer = "adam",
              metrics = ["mae"])
model.fit(X_train, y_train, epochs = 20,
          validation_data = (X_val, y_val))

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - loss: 0.0683 - mae: 0.1727 - val_loss: 0.0048 - val_mae: 0.0548
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - loss: 0.0044 - mae: 0.0529 - val_loss: 0.0040 - val_mae: 0.0506
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - loss: 0.0037 - mae: 0.0488 - val_loss: 0.0032 - val_mae: 0.0457
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - loss: 0.0032 - mae: 0.0459 - val_loss: 0.0032 - val_mae: 0.0455
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - loss: 0.0032 - mae: 0.0453 - val_loss: 0.0030 - val_mae: 0.0443
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - loss: 0.0032 - mae: 0.0457 - val_loss: 0.0034 - val_mae: 0.0464
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/

<keras.src.callbacks.history.History at 0x14f641a30>

If you train this model, you will see that it converges faster & performs just as well. Plus, you could change the output activation function if you wanted.

## Forecasting Several Time Steps Ahead

So far we have only predicted the value at the next time step, but we could just as easily have predicted the value several steps ahead by changing the targets appropriately (e.g., to predict 10 steps ahead, just change the targets to be the value 10 steps ahead instead of 1 step ahead). But what if you want to predict the next 10 values?

The first option is to use the model we already trained, make it predict the next value, then add that value to the inputs (acting as if this predicted value had actually occured), & use the model again to predict the following value, & so on, as in the following code:

In [21]:
series = generate_time_series(1, n_steps + 10)
X_new, Y_new = series[:, :n_steps], series[:, n_steps:]
X = X_new
for step_ahead in range(10):
    y_pred_one = model.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis = 1)

Y_pred = X[:, n_steps:]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 218ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step


As you might expect, the prediction for the next step will usually be more accurate than the predictions for later time steps, since the errors might accumulate, as you can see in the below figure.

<img src = "Images/Forecasting 10 Steps Ahead.png" width = "500" style = "margin:auto"/>

If you evalute this approach on the validation set, you will find an MSE of about 0.029. This is much higher than the previous models, but it's also a much harder task, so the comparison doesn't mean much. It's much more meaningful to compare this performance with naive predictions (just forecasting that the time series will remain constant for 10 time steps) or with a simple linear model. The naive approach is terrible (it gives an MSE of about 0.223), but the linear model gives an MSE of about 0.0188: it's much better than using our RNN to forecast the future one step at a time, & also much faster to train & run. Still, if you only want to forecast a few time steps ahead, on more complex tasks, this approach may work well.

The second option is to train an RNN to predict all 10 next values at once. We can still use a sequence-to-vector model, but it will output 10 values instead of 1. However, we first need to change the targets to be vectors containing the next 10 values:

In [22]:
series = generate_time_series(10000, n_steps + 10)
X_train, y_train = series[:7000, :n_steps], series[:7000, -10:, 0]
X_val, y_val = series[7000:9000, :n_steps], series[7000:9000, -10:, 0]
X_test, y_test = series[9000:, :n_steps], series[9000:, -10:, 0]

Now we just need the output layer to have 10 units instead of 1:

In [25]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences = True, input_shape = [None, 1]),
    keras.layers.SimpleRNN(20),
    keras.layers.Dense(10)
])

After training this model, you can predict the next 10 values at once very easily:

In [26]:
Y_pred = model.predict(X_new)
Y_pred

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 215ms/step


array([[-0.60554725, -0.3288696 , -0.2098759 , -0.17849247,  0.23301774,
         0.00085084, -0.12446003,  0.01948992, -0.42078346, -0.39807707]],
      dtype=float32)

This model works nicely: the MSE for the next 10 time steps is about 0.008. That's much better than the linear model. But we can still do better: indeed, instead of training the model to forecast the next 10 values only at the very last time step, we can train it to forecast, the next 10 values at each & every time step. In other words, we can turn this sequence-to-vector RNN into a sequence-to-seuqnece RNN. The advantage of this technique is that the loss will contain a term for the output of the RNN at each & every time step, not just the output at the last time step. This means there will be many more error gradients flowing through the model, & they won't have to flow only through time; they will also flow from the output of each time step. This will both stabilise & speed up training.

To be clear, at time step 0 the model will output a vector containing the forecasts for time steps 1 to 10, then at time step 1 the model will forecast time steps 2 to 11, & so on. So each target must be a sequence of the same length as the input sequence, containing a 10-dimensional vector at each step. Let's prepare these target seqeunces:

In [27]:
Y = np.empty((10000, n_steps, 10))
for step_ahead in range(1, 10 + 1):
    Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]
Y_train = Y[:7000]
Y_val = Y[7000:9000]
Y_test = Y[9000:]

To turn the model into a sequence-to-sequence model, we must set `return_sequences = True` in all recurrent layers (even the last one), & we must apply the output `Dense` layer at every time step. Keras offers a `TimeDistributed` layer for this very purpose: it wraps any layer (e.g., a `Dense` layer) & applies it at every time step of its input sequence. It does this efficiently, by reshaping the inputs so that each time step is treated as a separate instance (i.e., it reshapes the inputs from [*batch size*, *time steps*, *input dimensions*] to [*batch size* x *time steps*, *input dimensions*]; in this example, the number of input dimensions is 20 because the previous `SimpleRNN` layer has 20 units), then it runs the `Dense` layer, & finally it reshapes the outputs back to sequences (i.e., it reshapes the outputs from [*batch size* x *time steps*, *output dimensions*] to [*batch size*, *time steps*, *output dimensions*]; in this example, the number of output dimensions is 10, since the `Dense` layer has 10 units). Here is the updated model:

In [28]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.SimpleRNN(20, return_sequences = True, input_shape = [None, 1]),
    keras.layers.SimpleRNN(20, return_sequences = True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

The `Dense` layer actually supports sequences as inputs (& even higher-dimensional inputs): it handles them just like `TimeDistributed(Dense(...))`, meaning it is applied to the last input dimension only (independently across all time steps). Thus, we could replace the last layer with just `Dense(10)`. For the sake of clarity, however, we will keep using `TimeDistributed(Dense(10))` because it makes it clear that the `Dense` layer is applied independently at each time step & that the model will output a sequence, not just a single vector.

All outputs are needed during training, but only the output at the last time step is useful for predictions & for evaluation. So although we will rely on the MSE over all the outputs for training, we will use a custom metric for evaluation, to only compute the MSE over the output at the last time step:

In [34]:
def last_time_step_mse(Y_true, Y_pred):
    return keras.metrics.mean_squared_error(Y_true[:, -1], Y_pred[:, -1])

optimizer = keras.optimizers.Adam(learning_rate = 0.01)
model.compile(loss = "mse", optimizer = optimizer, metrics = [last_time_step_mse])

We get a validation MSE of about 0.006, which is 25% better than the previous model. You can combine this approach with the first one: just predict the next 10 values using this RNN, the concatenate these values to the input time series & use the model again to predict the next 10 values, & repeat the process as many times as needed. With this approach, you can generate arbitrarily long sequences. It may not be very accurate for long-term predictions, but it may be just fine if your goal is to generate original music or text.

Simple RNNs can be quite good at forecasting time series or handling other kinds of sequences, but they do not perform as well on long time series or sequences. Let's discuss why & see what we can do about it.

---

# Handling Long Sequences

To train an RNN on long sequences, we must run it over many time steps, making the unrolled RNN a very deep network. Just like any deep neural network, it may suffer from the unstable gradients problem: it may take forever to train, or training may be unstable. Moreover, when an RNN processes a long sequences, it will gradually forget the first inputs in the sequence. Let's look at both these problems, starting with the unstable gradients problem.

## Fighting the Unstable Gradients Problem

Many of the tricks we used in deep nets to alleviate the unstable gradients problem can also be used for RNNs: good parameter initialisation, faster optimizers, dropout, & so on. However, nonsaturating activation functions (e.g., ReLU) may not help as much here; in fact, they may actually lead the RNN to be even more unstable during training. Why? Well, suppose gradient descent updates the weights in a way that increases the outputs slightly at the first time step. Because the same weights are used at every time step, the outputs at the second time step may also be slightly increases, & those at the third, & so on until the outputs explode -- & a nonsaturation activation function does not prevent that. You can reduce this risk by using a smaller learning rate, but you can also simply use a saturating activation function like the hyperbolic tangent (this explains why it is the default). In much the same way, the gradients themselves can explode. If you notice that training is unstable, you may want to monitor the size of the gradients (e.g., using tensorboard) & perhaps use gradient clipping.

Moreover, batch normalisation cannot be used as efficiently with RNNs as with deep feedforward nets. In fact, you cannot use it between time steps, only between recurrent layers. To be more precise, it is technically possible to add a BN layer to a memory cell so that it will be applied at each time step (both on the inputs for that time step & on the hidden state from the previous step). However, the same BN layer will be used at each time step, with the same parameters, regardless of the actual scale & offset of the inputs & hidden state. In practice, this does not yield good results, as was demonstrated by Cesar Laurent et al. in a 2015 paper: the authors found the BN was slightly benficial only when it was applied to the inputs, not to the hidden states. In other words, it was slightly better than nothing when applied between recurrent layers, but not within recurrent layers. In keras, this can be done simply by adding a `BatchNormalization` layer before each recurrent layer, but don't expect too much from it.

Another form of normalisation often works better with RNNs: *layer normalisation*. This idea was introduced by Jimmy Lei Ba et al. in a 2016 paper: it is very similar to batch normalisation, but instead of normalising across the batch dimensions, it normalises across the features dimension. One advantage is that it can compute the required statistics on the fly, at each time step, independently for each instance. This also means that it behaves the same way during training & testing (as opposed to BN), & it does not need to use exponential moving averages to estimate the feature statistics across all instances in the training set. Like BN, layer normalisation learns a scale & an offset parameter for each input. In an RNN, it is typically used right after the linear combination of the inputs & the hidden states.

Let's use tf.keras to implement layer normalisation within a simple memory cell. For this, we need to define a custom memory cell. It is just like a regular layer, except its `call()` method takes two arguments: the `inputs` at the current time step & the hidden `states` from the previous time step. Note that the `states` argument is a list containing one or more tensors. In the case of a simple RNN cell, it contains a single tensor equal to the outputs of the previous time step, but other cells may have multiple state tensors (e.g., an `LSTMCell` has a long-term state & short-term state). A cell must also have a `state_size` attribute & an `outputsize` attribute. In a simple RNN, both a simpy equal to the number of units. The following code implements a custom memory cell which will behave like a `SimpleRNNCell`, except it will also apply layer normalisation at each time step:

In [38]:
class LNSimpleRNNCell(keras.layers.Layer):
    def __init__(self, units, activation = "tanh", **kwargs):
        super().__init__(**kwargs)
        self.state_size = units
        self.output_size = units
        self.simple_rnn_cell = keras.layers.SimpleRNNCell(units, activation = None)
        self.layer_norm = keras.layers.LayerNormalization()
        self.activation = keras.activations.get(activation)
    def get_initial_state(self, inputs = None, batch_size = None, dtype = None):
        if inputs is not None:
            batch_size = tf.shape(inputs)[0]
            dtype = inputs.dtype
        return [tf.zeros([batch_size, self.state_size], dtype = dtype)]
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states)
        norm_outputs = self.activation(self.layer_norm(outputs))
        return norm_outputs, [norm_outputs]

The code is quite straightforward. Our `LNSimpleRNNCell` class inherits form the `keras.layers.Layer` class, just like any custom layer. The constructor takes the number of units & the desired activation function, & it sets the `state_size` & `output_size` attributes, then creates a `SimpleRNNCell` with no activation function (because we want to perform layer normalisation after the linear operation but before the activation function). Then the constructor creates the `LayerNormalization` layer, & finally it fetches the desired activation function. The `call()` method starts by applying the simple RNN cell, which computes a linear combination of the current inputs & the previous hidden states, & it returns the result twice (indeed, in a `SimpleRNNCell`, the outputs are just equal to the hidden states: in other words, `new_state[0]` is equal to `outputs`, so we can safely ignore `new_states` in the rest of the `call()` method). Next, the `call()` method applies layer normalisation, followed by the activation function. Finally, it returns the outputs twice (once as the outputs & once as the new hidden states). To use this custom cell, all we need to do is create a `keras.layers.RNN` layer, passing it a cell instance:

In [39]:
model = keras.models.Sequential([
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences = True, input_shape = [None, 1]),
    keras.layers.RNN(LNSimpleRNNCell(20), return_sequences = True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])



Similarly, you could create a custom cell to apply dropout between each time step. But there's a simpler way: all recurrent layers (except for `keras.layers.RNN`) & all cells provided by keras have a `dropout` hyperparameter & a `recurrent_dropout` hyperparameter: the former defines the dropout rate to apply to the inputs (at each time step) & the latter defines the dropout rate for the hidden states (also at each time step). No need to create a custom cell to apply dropout at each time step in an RNN.

With these techniques, you can alleviate the unstable gradients problem & train an RNN much more efficiently. Now let's look at how to deal with the short-term memory problem.

## Tackling the Short-Term Memory Problem

Due to the transformation that the data goes through when traversing an RNN, some information is lost at each time step. After a while, the RNN's state contains virtually no trace of the first inputs. This can be a showstepper. Imagine Dory the fish trying to translate a long sentence; by the time she's finished reading it, she has no clue how it started. To tackle this problem, various types of cells with long term memory have been introduced. They have proven so successful that the basic cells are not used much anymore. Let's first look at the most popular of these long-term memory cells: the LSTM cell.

### LSTM Cells

The *long short-term memory* (LSTM) cell was proposed in 1997 by Sepp Hochreiter & Jurgen Schmidhuber & gradually improved over the years by several researchers such as Alex Graves, Hasim Sak, & Wojciech Zaremba. If you consider the LSTM cell as a black box, it can be used very much like a basic cell, except it will perform much better; training will converge faster, & it will detect long-term dependences in the data. In keras, you can simply use the `LSTM` layer instead of the `SimpleRNN` layer:

In [40]:
model = keras.models.Sequential([
    keras.layers.LSTM(20, return_sequences = True, input_shape = [None, 1]),
    keras.layers.LSTM(20, return_sequences = True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

Alternatively, you could use the general-purpose `keras.layers.RNN` layer, giving it an `LSTMCell` as an argument:

In [41]:
model = keras.models.Sequential([
    keras.layers.RNN(keras.layers.LSTMCell(20), return_sequences = True, input_shape = [None, 1]),
    keras.layers.RNN(keras.layers.LSTMCell(20), return_sequences = True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

However, the `LSTM` layer uses an optimised implementation when running on a GPU, so in general, it is preferable to use it (the `RNN` layer is mostl useful when you define custom cells`). 

So how does an LSTM work? It's architecture is shown:

<img src = "Images/LSTM Cell.png" width = "600" style = "margin:auto"/>

If you don't look at what's inside the box, the LSTM cell looks exactly like a regular cell, except that its state is split into two vectors: $h_{(t)}$ & $c_{(t)}$ ("c" stands for "cell"). You can think of $h_{(t)}$ as the short-term state & $c_{(t)}$ as the long-term state.

Now let's open the box! The key idea is that the network can learn what to store in the long-term state, what to throw away, & what to read from it. As the long-term state $c_{(t - 1)}$ traverses the network from left to right, you can see that it first goes through a *forget gate*, dropping some memmories, & then it adds some new memories via the addition operation (whch adds the memories that were selected by an *input gate*). The result `c_{(t)}` is sent straight out, without any further transformation. So, at each time step, some memories are dropped & some memories are added. Moreover, after the addition operation, the long-term state is copied & passed through the tanh function, & then the result is filtered by the *output gate*. This produces the short-term state $h_{(t)}$ (which is equal to the cell's output for this time step, $y_{(t)}$). Now, let's look at where new memories come from & how the gates work.

First, the current input vector $X_{(t)}$ & the previous short-term state $h_{(t - 1)}$ are fed to four different fully connected layers. They all serve a different purpose:

* The main layer is the one that outputs $g_{(t)}$. It has the usual role of analysing the current inputs $x_{(t)}$ & the previous (short-term) state $h_{(t - 1)}$. In a basic cell, there is nothing other than this layer, & its output goes straight out to $y_{(t)}$ & $h_{(t)}$. In contrast, in an LSTM cell this layer's output does not go straight out, but instead its most important parts are stored in the long-term state (& the rest is dropped).
* The three other layers are *gate controllers*. Since they use the logistic activation function, their outputs range from 0 to 1. As you can see, their outputs are fed to element-wise multiplication operations, so if they output 0s they close the gate, & if they output they open it. Specifically:
   - The *forget gate* (controlled by $f_{(t)}$) controls which parts of the long-term state should be erased.
   - The *input gate* (controlled by $i_{(t)}$) controlls which parts of $g_{(t)}$ should be added to the long-term state.
   - Finally, the *output gate* (controlled by $o_{(t)}$) controls which parts of the long term state should be read & output at this time step, both to $h_{(t)}$ & to $y_{(t)}$.

In short, an LSTM cell can learn to recognise an important input (that's the role of the input gate), & store it in a long-term state, preserve it for as long as it is needed (that's the role of the forget gate), & extract it whenever it is needed. This explains why these cells have been amazingly successful at capturing long-term patterns in time series, long texts, audio recordings, & more.

The below equation summarises how to compute the cell's long-term state, its short-term state, & its output at each time step for a single instance (the equations for a whole mini-batch are very similar).

$$\begin{split}
i_{(t)} = \sigma(W^{\intercal}_{xi} x_{(t)} + W^{\intercal}_{hi} h_{(t - 1)} + b_i) \\
f_{(t)} = \sigma(W^{\intercal}_{xf} x_{(t)} + W^{\intercal}_{hf} h_{(t - 1)} + b_f) \\
o_{(t)} = \sigma(W^{\intercal}_{xo} x_{(t)} + W^{\intercal}_{ho} h_{(t - 1)} + b_o) \\
g_{(t)} = \tanh(W^{\intercal}_{xg} x_{(t)} + W^{\intercal}_{hg} h_{(t - 1)} + b_g) \\
c_{(t)} = f_{(t)} \otimes c_{(t - 1)} + i_{(t)} \otimes g_{(t)} \\
y_{(t)} = h_{(t)} = o_{(t)} \otimes tanh(c_{(t)})
\end{split}$$

In this equation:

* $W_{xi}$, $W_{xf}$, $W_{xo}$, $W_{xg}$ are the weight matrices of each of the four layers for their connection to the input vector $x_{(t)}$.
* $W_{hi}$, $W_{hf}$, $W_{ho}$, $W_{hg}$ are the weight matrices of each of the four layers for their connection to the previous short-term state $h_{(t - 1)}$.
* $b_i$, $b_f$, $b_o$, & $b_g$ are the bias terms for each of the four layers. Note that tensorflow initialises $b_f$ to a vector full of 1s instead of 0s. This prevents forgetting everything at the beginning of training.

### Peephole Connections

In a regular LSTM cell, the gate controllers can look only at the input $x_{(t)}$ & the previous short-term state $h_{(t - 1)}$. It may be a good idea to give them a bit more context by letting them peek at the long-term state as well. This idea was proposed by Felix Gers & Jurgen Schmidhuber in 2000. They proposed an LSTM variant with extra connections called *peephole connections*: the previous long-term state $c_{(t - 1)}$ is added as an input to the controllers of the forget gate & the input gate, & the current long-term state $c_(t)$ is added as input to the controller of the output gate. This often improves performance, but not always, & there is no clear patternf for which tasks are better off with or without them: you will have to try it on your task & see if it helps.

In keras, the `LSTM` layer is based on the `keras.layers.LSTMCell` cell, which does not support peepholes. The `tf.keras.PeepholeLSTMCell` does, however, you can always create a `keras.layers.RNN` layer & pass a `PeepholeLSTMCell` to its constructor.

There are many other variants of the LSTM cell. One particularly popular variant is the GRU cell, which we will look at now.

### GRU Cells

The *Gated Recurrent Unit* (GRU) cell was proposed by Kyunghyun Cho et al. in a 2014 paper that also introduced the encoder-decoder network we discussed.

<img src = "Images/GRU Cell.png" width = "500" style = "margin:auto"/>

The GRU cell is a simplified version of the LSTM cell, & it seems to perform just as well (which explains its growing popularity). These are the main simplifications:

* Both state vectors are merged into a single vector $h_{(t)}$.
* A single gate controller $z_{(t)}$ controls both the forget gate & the input gate. If the gate controller outputs a 1, the forget gate is open (= 1) & the input gate is closed (1 - 1 = 0). If it outputs a 0, the opposite happens. In other words, whenever a memory must be stored, the location where it will be stored is erased first. This is actually a frequent variant to the LSTM cell in & of itself.
* There is no output gate; the full state vector is output at every time step. however, there is a new gate controller $r_{(t)}$ that controls which part of the previous state will be shown to the main layer ($g_(t)$).

The below equation summarise how to compute the cell's state at each time step for a single instance.

$$\begin{split}
z_{(t)} = \sigma(W^{\intercal}_{xz} x_{(t)} + W^{\intercal}_{hz} h_{(t - 1)} + b_z) \\
r_{(t)} = \sigma(W^{\intercal}_{xr} x_{(t)} + W^{\intercal}_{hr} h_{(t - 1)} + b_r) \\
g_{(t)} = tanh(W^{\intercal}_{xg} x_{(t)} + W^{\intercal}_{hg} (r_{(t)} \otimes h_{(t - 1)}) + b_g) \\
h_{(t)} = z_{(t)} \otimes h_{(t - 1)} + (1 - z_{(t)}) \otimes g_{(t)}
\end{split}$$

Keras provides a `keras.layers.GRU` layer (based on the `keras.layers.GRUCell` memory cell); using it is just a matter of replacing `SimpleRNN` or `LSTM` with `GRU`.

LSTM & GRU cells are one of the main reasons behind the success of RNNs. Yet while they can tackle much longer sequences than simple RNNs, they still have a fairly limited short-term memory, & they have a hard time learning long-term patterns in sequences in 100 time steps or more, such as audio samples, long time series, or long sentences. One way to solve this is to shorten the input sequences, for example using 1D convolutional layers.

### Using 1D Convolutional Layers to Preprocess Sequences

In the previous lesson, we saw that a 2D convolutional layer works by sliding several fairly small kernels (or filters) across an image, producing multiple 2D feature maps (one per kernel). Similarly, a 1D convolutional layer slides several kernels across a sequence, producing a 1D feature map per kernel. Each kernel will learn to detect a single very short sequential pattern (no longer than the kernel size). If you use 10 kernels, then the layer's output will be composed of 10 1-dimensional sequences (all of the same length), or equivalently you can view this output as a single 10-dimensional sequence. This means that you can build a neural network composed of a mix of recurrent layers & 1D convolutional layers (or even 1D pooling layers). If you use a 1D convolutional layer with a stride of 1 & `"same"` padding, then the output sequence will have the same length as the input sequence. But if you use `"valid"` padding or a stride greater than 1, then the output sequence will be shorter than the input sequence, so make sure you adjust the targets accordingly. For example, the following model is the same as earlier, except it starts with a 1D convolutional layer that downsamples the input sequence by a factor of 2, using a stride of 2. The kernel size is larger than the stride, so all inputs will be used to compute the layer's output, & therefore the model can learn to preserve the useful information, dropping only the unimportant details. By shortening the sequences, the convolutional layer may help the `GRU` layers detect longer patterns. Note that we must also crop off the first three steps in teh targets (since the kernel's size if 4, the first output of the convolutional layer will be based on the input time steps 0 to 3), & downsample the targets by a factor of 2:

In [44]:
keras.backend.clear_session()

model = keras.models.Sequential([
    keras.layers.Conv1D(filters = 20, kernel_size = 4, strides = 2, padding = "valid",
                        input_shape = [None, 1]),
    keras.layers.GRU(20, return_sequences = True),
    keras.layers.GRU(20, return_sequences = True),
    keras.layers.TimeDistributed(keras.layers.Dense(10))
])

model.compile(loss = "mse", optimizer = "adam")
model.fit(X_train, Y_train[:, 3::2], epochs = 20,
          validation_data = (X_val, Y_val[:, 3::2]))

Epoch 1/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 16ms/step - loss: 0.0909 - val_loss: 0.0479
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - loss: 0.0420 - val_loss: 0.0346
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - loss: 0.0332 - val_loss: 0.0306
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - loss: 0.0293 - val_loss: 0.0272
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 16ms/step - loss: 0.0269 - val_loss: 0.0258
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 16ms/step - loss: 0.0248 - val_loss: 0.0242
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - loss: 0.0237 - val_loss: 0.0249
Epoch 8/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - loss: 0.0229 - val_loss: 0.0225
Epoch 9/20
[1m219/219[0m [32m

<keras.src.callbacks.history.History at 0x1522f1ee0>

If you train & evaluate this model, you will find that it is the best model so far. The convolutional layer really helps. In fact, it is actually possible to use only 1D convolutional layers & drop the recurrent layers entirely!

### WaveNet

In a 2016 paper, Aaron van den Oord & other DeepMind researcher introduced an architecture called *WaveNet*. They stacked 1D convolutional layers, doubling the dilation rate (how spread apart each neuron's inputs are) at every layer: the first convolutional layer gets a glimpse of just two time steps at a time, while the next one sees four time steps (its receptive field is four time steps long), the next one sees eight time steps & so on. This way the lower layers learn short-term patterns, while the higher layers learn long-term patterns. Thanks to the doubling dilation rate, the network can process extremely large sequences very efficiently.

<img src = "Images/WaveNet Architecture.png" width = "550" style = "margin:auto"/>

In the WaveNet paper, the authors actually stacked 10 convolutional layers with dilation rates of 1, 2, 4, 8, ..., 256, 512, then they stacked another group of 10 identical layers (also with dilation rates 1, 2, 4, 8, ..., 256, 512), then again another idential group of 10 layers. They justified this architecture by pointing out that a single stack of 10 convolutional layers with these dilation rates will act like a super-efficient convolutional layer with a kernel size 1,024 (except way faster, more powerful, & using significantly fewer parameters), which is why they stacked 3 such blocks. They also left-padded the input seqeunces with a number of zeros equal to the dilation rate before every later, to preserve the same sequence length throughout the network. Here is how to implement a simplified wavenet to tackle the same sequences as earlier.

In [45]:
model = keras.models.Sequential()
model.add(keras.layers.InputLayer(input_shape = [None, 1]))
for rate in (1, 2, 4, 8) * 2:
    model.add(keras.layers.Conv1D(filters = 20, kernel_size = 2, padding = "causal",
                                  activation = "relu", dilation_rate = rate))
model.add(keras.layers.Conv1D(filters = 10, kernel_size = 1))
model.compile(loss = "mse", optimizer = "adam")
history = model.fit(X_train, Y_train, epochs = 20,
                    validation_data = (X_val, Y_val))

Epoch 1/20




[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - loss: 0.1003 - val_loss: 0.0378
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - loss: 0.0349 - val_loss: 0.0303
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 0.0293 - val_loss: 0.0273
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 0.0264 - val_loss: 0.0260
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - loss: 0.0252 - val_loss: 0.0242
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - loss: 0.0242 - val_loss: 0.0238
Epoch 7/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - loss: 0.0233 - val_loss: 0.0231
Epoch 8/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - loss: 0.0224 - val_loss: 0.0224
Epoch 9/20
[1m219/219[0m [32m━━━━━━━━━━━

This `Sequential` model starts with an explicit input layer (this is simpler than trying to set `input_shape` only on the first layer), then continues with a 1D convolutional layer using `"causal"` padding: this ensures that the convolutonal layer does not peek into the future when making predictions (it is equivalent to padding the inputs with the right amount of zeros on the left & using `"valid"` padding). We then add similar pairs of layers using growing dilation rates: 1, 2, 4, 8, & again 1, 2, 4, 8. Finally, we add the output layer: a convolutional layer with 10 filters of size 1 & without any activation function. Thanks to the padding layers, every convolutional layer outputs a sequence of the same length as the input sequences, so the targets we use during training can be the full sequences: no need to crop them or downsample them.

The last two models offer the best performance so far in forecasting our time series! In the wavenet paper, the authors achieved state-of-the-art performance on various audio tasks (hence the name of the architecture), including text-to-speech tasks, producing incredibly realistic voices across several languages. They also used the model to generate music, one audio sample at a time. This feat is all the more impressive when you realise that a single second of audio can contain tens of thousands of time steps -- even LSTMs & GRUs cannot handle such long sequences.