# Chapter 15: Processing Sequences Using RNNs and CNNs

## 15.1 Recurrent Neurons and Layers

A **recurrent neural network (RNN)** looks very much like a feedforward neural network, except it also has connections pointing backward. 

**Unrolling the network through time** - The same recurrent neuron represented once per time step.

Extending to a layer of recurrent neurons, each neuron receives the input vector $\mathbf{x}_{(t)}$ and the output vector from the previous time step $\mathbf{y}_{(t-1)}$.

Considering the whole recurrent layer, the weight vectors can be placed in two weight matrices $\mathbf{W}_x$ and $\mathbf{W}_y$.

*Equation 15-1. Output of a recurrent layer for a single instance:*
$$ \mathbf{y}_{(t)} = \phi(\mathbf{W}_x^T \mathbf{x}_{(t)} 
+ \mathbf{W}_y^T \mathbf{y}_{(t-1)} + \mathbf{b})$$

*Equation 15-2. Outputs of a layer of recurrent neurons for all instances in a mini-batch:
$$
\mathbf{Y}_{(t)} = \phi(\mathbf{X}_{(t)} \mathbf{W}_x  
+ \mathbf{Y}_{(t-1)} \mathbf{W}_y + \mathbf{b}) $$

$$ 
\mathbf{Y}_{(t)} = \phi([\mathbf{X}_{(t)} \mathbf{Y}_{(t-1)}] \mathbf{W} + \mathbf{b}) \text{, with } \mathbf{W} = 
\begin{bmatrix}
\mathbf{W}_x \\
\mathbf{W}_y
\end{bmatrix}
$$

In this equation:
- $\mathbf{Y}_{(t)}$ is an $m \times n_{neurons} $ matrix containing the layer's outputs at time step *t* for each instance in the mini-batch.
- $\mathbf{X}_{(t)}$ is an $m \times n_{inputs}$ matrix containing the inputs for all instances.
- $\mathbf{W}_x $ is an $n_{inputs} \times n_{neurons}$ matrix containing the connection weights for the inputs of the current time step.
- $\mathbf{W}_y $ is an $n_{neurons} \times n_{neurons}$ matrix containing the connection weights for the outputs of the previous time step.
- $\mathbf{b}$ is a vector of size $n_{neurons}$ containing each neuron's bias term.
- The weight matrices $\mathbf{W}_x $ and $\mathbf{W}_y$ are concatenated vertically into a single weight matrix $\mathbf{W}$ of shape $(n_{inputs} + n_{neurons}) \times n_{neurons}$.
- The notation $[\mathbf{X}_{(t)} \mathbf{Y}_{(t-1)}]$ represents the horizontal concatentation of the matrices $\mathbf{X}_{(t)}$ and $\mathbf{Y}_{(t-1)}$.

> Note: $\mathbf{Y}_{(t)}$ is a function of all the inputs since time $t=0$ (ie. $\mathbf{X}_{(0)}, \mathbf{X}_{(1)}... \mathbf{X}_{(t)}$).

### 15.1.1 Memory Cells

Since the output of a recurrent neuron at time step $t$ is a function of all the inputs from the previous time steps, it has a form of *memory*.

**Memory cell (a cell)** - A part of a neural network that preserves some state across time steps.

A single recurrent neuron, or a layer of recurrent neurons, is a very basic cell.

In general a cell's state at time step $t$, denoted $\mathbf{h}_{(t)}$ (the "h" stands for "hidden"), is a function of some inputs at that time step and its state at the previous time step: $\mathbf{h}_{(t)} = f(\mathbf{h}_{(t-1)}, \mathbf{x}_{(t)})$. Its output at time step $t$ denoted $\mathbf{y}_{(t)}$ is also a function of the previous state and the current inputs.

### 15.1.2 Input and Output Sequences

**Sequence-to-sequence network** - An RNN that can simultaneously take a sequence of inputs and produce a sequence of outputs (**see top-left Figure 15-4 in book**). 

- Predicting time series such as stock prices: feed it prices over the last *N* days, and it must output the prices shifted by one day into the future.

**Sequence-to-vector network** - Feed the network a sequence of inputs and ignore all outputs except for the last one (**see top-right Figure 15-4 in book**).

- You can feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score.

**Vector-to-sequence network** - Feed the network the same input vector over and over at each time step and let it output a sequence (**see bottom-left Figure 15-4 in book**).

- The input can be an image (or the output of a CNN) and the output could be a caption for that image.

**Encoder-Decoder** - A two-step model consisting of a sequence-to-vector network (**encoder**) followed by vector-to-sequence network (**decoder**) (**see bottom-right Figure 15-4 in book**).

- Translating a sentence from one language to another. Feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language.
- Works much better than translating with a single sequence-to-sequence RNN: the last words of a sentence can affect the first words of the translation, so you need to wait until you have seen the whole sentence before translating it.

## 15.2 Training RNNs

**Backpropagation through time (BPTT)** - To train an RNN, unroll it through time and then use regular backpropagation (**see Figure 15-5 in book**).

General outline to train RNN:
1. Compute forward pass through the unrolled network.
2. Evaluate the output sequence using a cost function $C(\mathbf{Y}_{(0)}, \mathbf{Y}_{(1)},..., \mathbf{Y}_{(T)})$, where $T$ is the max time step.
3. Propagate the gradients of that cost function backward through the unrolled network.
    - Note: The gradients flow backward through all the outputs used by the cost function.
4. Update model parameters using the gradients computed during BPTT.

## 15.3 Forecasting a Time Series

### 15.3.1 Baseline Metrics

### 15.3.2 Implementing a Simple RNN

### 15.3.3 Deep RNNs

### 15.3.4 Forecasting Several Time Steps Ahead

## 15.4 Handling Long Sequences

### 15.4.1 Fighting the Unstable Gradients Problem

### 15.4.2 Tackling the Short-Term Memory Problem