# Chapter 14: Recurrent Neural Networks

Recurrent neural networks (RNNs) are a particular type of artificial neural network which are able to accurately make predictions about sequences of data. They are useful for _natural language processing_ (NLP), automatic translation, speech-to-text, or _sentiment analysis_.

Some RNNs such as [Google's Magenta project](https://cdn2.vox-cdn.com/uploads/chorus_asset/file/6577761/Google_-_Magenta_music_sample.0.mp3) are capable of writing their own melodies. RNNs can [generate sentences](https://karpathy.github.io/2015/05/21/rnn-effectiveness/), [image captions](https://arxiv.org/pdf/1411.4555v2.pdf), and more.

## Recurrent Neurons

Up until now, we have looked mainly at feed-forward neural networks where data only flows in one direction. Neurons in an RNN depend not only on the input data, but also the outputs of the neurons at previous iterations of the sequence of data. At each _time step_ (or _frame_), the _recurrent neuron_ receives the input, $\mathbf{x}_{(t)}$, as well as its output in the previous frame, $y_{(t-1)}$. The output of a layer of recurrent neurons is given by

$$ \mathbf{y}_{(t)} = \phi\left( \mathbf{W}_x^{\,T} \cdot \mathbf{x}_{(t)} + \mathbf{W}_y^{\,T} \cdot \mathbf{y}_{(t-1)} + \mathbf{b} \right) $$

where

- $\mathbf{W}_x$ and $\mathbf{W}_y$ are the weight matrices of the recurrent neuron layer.

- $\mathbf{y}_{(t-1)}$ is the output of the layer at the previous frame.

- $\mathbf{b}$ is the bias vector.

- $\phi$ is the layer's activation function.

While training the RNN, you can compute the output of a whole mini-batch in one computation given by

$$ \mathbf{Y}_{(t)} = \phi\left( \mathbf{X}_{(t)} \cdot \mathbf{W}_x + \mathbf{Y}_{(t-1)} \cdot \mathbf{W}_y + \mathbf{b} \right) \\
= \phi\left( \left[ \mathbf{X}_{(t)} \; \mathbf{Y}_{(t-1)} \right] \cdot \mathbf{W} + \mathbf{b} \right) \;\; \text{with} \;\; \mathbf{W} = \left[ \begin{matrix} \mathbf{W}_x \\ \mathbf{W}_y \end{matrix} \right] $$

where

- $\mathbf{Y}_{(t)}$ is an $m \times n_\text{neurons}$ matrix containing the layer's outputs at frame $t$.

- $\mathbf{X}_{(t)}$ is an $m \times n_\text{inputs}$ matrix containing the inputs of all instances in the mini-batch at frame $t$.

- $\mathbf{W}_x$ is an $n_\text{inputs} \times n_\text{neurons}$ matrix containing the connection weights for the inputs at the current frame.

- $\mathbf{W}_y$ is an $n_\text{neurons} \times n_\text{neurons}$ matrix containing the connection weights for the outputs of the previous frame.

- $\mathbf{b}$ is the bias vector of size $n_\text{neurons}$.

- The weights matrices are often concatenated vertically into a single $\left( n_\text{inputs} + n_\text{neurons} \right) \times n_\text{neurons}$ matrix, $\mathbf{W}$.

- The notation $\left[ \mathbf{X}_{(t)} \; \mathbf{Y}_{(t-1)} \right]$ represents the horizontal concatenation of the matrices $\mathbf{X}_{(t)}$ and $\mathbf{Y}_{(t-1)}$.

### Memory Cells

Since the output of a recurrent neuron depends on its previous outputs, one can say it develops a form of _memory_. A part of a neural network which retains state across time steps is called a _memory cell_. A single recurrent neuron or a layer of recurrent neurons is a very _basic cell_.

In general, a cell's state at frame $t$, i.e. $\mathbf{h}_{(t)}$, is a function of the inputs at the frame $t$ as well as the state at the previous time step. In other words,

$$ \mathbf{h}_{(t)} = f \kern-0.15em \left(\mathbf{h}_{(t-1)}, \mathbf{x}_{(t)}\right). $$

The output of the recurrent neuron is also a function of the previous state and the current inputs. In the case of the basic cell, the output is equal to the state at frame $t$, but for other types of memory cells this is not the case.

### Input and Output Sequences

An output of an RNN can be its prediction of the next terms in the input sequence, such as predicting the future stock prices based on their previous prices. For example you can feed it the prices for the last $N$ days and have it predict the prices for $N-1$ days ago up to tomorrow.

You can also have the network ignore all outputs except the last one, this is called a sequence-to-vector RNN. For example, the input could be a movie review and the output could be a measure of the sentiment of the review (positive or negative).

You can also have vector-to-sequence network which takes a single input and outputs a new sequence, e.g. a model which generates captions for images.

Finally, you can have a sequence-to-vector network called an _encoder_, and a vector-to-sequence network called a _decoder_. This two step model, called an Encoder-Decoder, works well for tasks like translating languages. In fact they outperform sequence-to-sequence RNNs for that particular task.