# What are Recurrent Neural Networks (RNNs)?

## Motivation

> Much of the data found in the real world is sequential data - its order matters

Examples:
- X-Y-Z coordinates of an object over time
- Price of an artwork over time
- Text
- Audio waveforms

To process sequential data, it needs to be represented mathematically:

![](./images/Sequential%20Features.png)

You can thing of each input as a vector representation for each timestep:

![](./images/Sequential%20Data%20Single%20Example%20Detailed.png)

Sequential data can appear in many datasets that we want to model.
These datasets typically contain complex input-output relationships, so we want to use neural networks to model them.

The time dimension makes the features of each example at least 2-D, with the first axis being for each timestep, and the second axis being for each feature. The problem: Feedforward neural networks can only process vectors. The naive thing to do would be to flatten out all of this data into a vector. But that, again causes issues.

The problem with flattening sequential data: 
- Different input sequences can have different lengths, so the flattened vectors vary in size, but feedforward networks can only process fixed size vectors.
    - They cannont be resized like images can without losing potentially important information.
- The model should look for the same thing in different parts of the input sequence.
    - This is known as parameter sharing across time.

Recurrent Neural Networks (RNNs) can tackle these problems.
- They process the data sequentially.
- They use the same weights at different time steps.
- They keep an internal hidden state, that contains information from the sequence that has been processed so far.

![](./images/RNN%20Hidden%20State%20Equation.png)

RNNs work as follows:
- An initial hidden state of the network is initialised as a vector of zeros in each hidden layer
- At each timestep, an input vector (data from a single timestep, such as a word embedding, or a X-Y-Z coordinate) is fed to the RNN
- The new hidden state of the first recurrent layer is computed by concatenating the outputs of:
    - A linear layer takes a number of weighted combinations of that input vector's raw input features 
    - A linear layer takes a number of weighted combinations of the previous hidden state
- Other deeper hidden layers combine their previous hidden state
- A model head (such as a classification head) combines the 

![](./images/RNN%20Predictions.png)

> Recurrent networks can process sequences of varying lengths

> Recurrent networks can handle many different types of situations

![](./images/X-to-X%20RNNs.png)

> Recurrent networks look for the same things at different points in time, by using the same weights across different timesteps

> Recurrent networks process inputs sequentially

## Limitations of RNNs

1. Difficulty in training: RNNs can be difficult to train, especially for long sequences. This is due to the vanishing and exploding gradient problem, where the gradients either become very small or very large as they are backpropagated through the network. This can make it difficult for the network to learn long-term dependencies.

1. Limited ability to process long-term dependencies: While RNNs are able to capture some long-term dependencies, they may struggle with very long sequences or dependencies that are separated by a large number of elements because the hidden state is manipulated so much by every sequential item processed between the dependencies.

1. Sensitivity to initialization: Like other neural networks, RNNs can be sensitive to the values chosen for their initial hidden states. It is not yet well understood how parameter initialisation affects optimisation or generalisation.

1. Computational complexity: RNNs can be computationally expensive, especially for large sequences, because they cannot be parallelised in the time dimension. The cost of each parameter update step is $O(T)$. This can be a limitation when working with large datasets or when real-time processing is required.

1. Difficulty in interpreting results: RNNs can be difficult to interpret, as it is not clear exactly what is represented by their hidden states just by looking at them. This can make it difficult to understand the decisions made by the network and how it is using the input data.

## Training RNNs

### Teacher Forcing

"Teacher forcing" is a technique used when training a sequence-to-sequence model, such as a recurrent neural network (RNN) with attention. It refers to the use of the true target sequence as the input to the decoder at each time step, rather than the predicted output of the model.

In other words, when teacher forcing is used, the decoder is "forced" to generate the next output based on the true target sequence, rather than its own predicted output at the previous time step. This can help the model learn faster and more accurately, but can also make the model more dependent on the teacher forcing, and may reduce its ability to generate reasonable output when teacher forcing is not used.