# Recurrent Networks:


<img src="images/recurrent-network.png" width="50%">


Regular feed-forward networks aren't well suited to processing sequential input data. It also can't effectively learn long-range dependencies.
- With a sequential training sample, having a fixed input layer is problematic
- With some training samples, long-range dependencies have to be learnt which would be difficult for a deep feed-forward network to identify and generalise to
    - Eg. Suppose we need a model to predict the next word in the sentence  "*France is where I grew up, but I now live in Boston. I speak fluent* \_\_\_\_\_". Regular feed-forward architectures simply can't generalise well enough to learn this long-range dependency

Recurrent neural networks on the other hand, are good architectures for processing sequential data. Audio signals and English text are examples of sequential data that recurrent neural networks are suited to processing. A recurrent architecture addresses the following:
1. Handling variable-length sequential inputs
2. Tracking long-range dependencies
3. Paying attention to the order of values in the sequence
4. Sharing parameters across the sequence (similar to what convolutional neural networks do with their kernels)


Recurrent neural networks, for example, can produce one output for a variable-length sequential input, or it can produce an output for each step in the sequential data. 
<img src="images/recurrent-network-sequence-modeling.png" width="50%">


A regular feedforward network proceeds like this: 
$$\texttt{input} \to \texttt{hidden layers} \to \texttt{output}.$$
It's effectively a closed system, so there's no idea of *context* or *memory of the past* that is factored into this predictive model. A recurrent neural network introduces this concept of *memory* by keeping track of what the hidden layer activations of previous timesteps were. The predictive model is then like this:
$$\texttt{input + previous hidden layer activations} \to \texttt{hidden layers} \to \texttt{output}.$$

With this model, the output is also deterministic given the hidden layer, but now the specific hidden layer values is only reachable *with the right sequence of inputs*.

From <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">here</a>: "Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones."

### Recurrent Neural Networks:




<table>
    <tr>
        <td>    
            <p style="text-align: center;">
                <strong>Elman Recurrent Network</strong>
            </p>
            <img src="images/elman-recurrent-network.png">
        </td>
        <td>
            <p style="text-align: center;">
                <strong>Recurrent Network with shortcuts</strong>
            </p>
            <img src="images/recurent-network-shortcut.png">
        </td>
    <tr>
</table>


<img src="images/recurrent-network-hidden-state-transfer.gif" width="75%" />
<p style="text-align: center;" width="75%">The transfer of hidden layer activations across 4 timesteps. The first timestep has no context layer. The colour of the hidden layer nodes represents the memory of previous hidden layer values</p>

- __Recurrence of information:__ 
With each time step, the input follows the standard feed-forward procedure to the output layer, but the hidden layer activations are copied to the "context" layer, or also called the "hidden state", which will be then be fed as additional input as subsequent timesteps go through the feed-forward procedure.  
    - Eg. Once `l1` is fed through the network and the output layer is computed, the intermediate hidden layer values are copied over to nodes `s1`, `s2` and `s3` as additional inputs for when `l2` is passed through the network 
    - This allows information to persist across timesteps and be factored into future predictions

    
- Sometimes shortcut connections between the input layer and output layer are added. TODO: why does is this helpful sometimes?


<table>
    <tr>
        <td width="50%">  
            <p>
                A recurrent network can be 'unrolled' into an equivalent feed-forward network. 
            </p>
            <img src="images/recurrent-unrolled.png">
        </td>
        <td>
            <img src="images/recurrent-network-unrolled.png">
            <p>
               All the weight matrices $W_{xh}, W_{hh}, W_{hy}$ are reused across each time step.
               $W_{hh}$ transforms the current hidden state to the next hidden state. The total loss $L$ is simply the sum of all the individual losses computed at each timestep: $L_1 + L_2 + ...$
            </p>
        </td>
    <tr>
</table>


21:00 in MIT VIDEO!!!!!!!!!!!!!!!!!!!!!!!111

### Backpropagation Through Time:
*Backpropagation through time* &mdash; backpropagation on the unrolled chain of network

<table>
    <tr>
        <td>            
            <img src="images/backprop-through-time.png" width="100%">
            <p>
               At each timestep, we produce a prediction $\hat{y}_t$ which results in a loss function value $L_t$. The sum of all such loss function values gives $L = L_1 + L_2 + \dots + L_t$. 
            </p>
        </td>
        <td>
            <img src="images/backpropagation-through-time.gif" width="100%">
            <p>
                The black node represents the prediction, the yellow represents the error function value, the mustard colour represents the derivatives
            </p>
        </td>
    <tr>
</table>


Keep in mind, there are only 3 weight matrices involved. Backpropagation through time can be considered as regular backpropagation on the unrolled network, reusing the same 3 weight matrices.

## LSTM and GRU Architecture

#### Short Term Memory Problem:
<img src="images/short-term-memory-problem.gif" width="25%">
Short term memory problem &mdash; caused by vanishing gradient problem. In this case, the influence of information from steps from a long time ago diminshes. Long range dependencies aren't effectively learned by normal RNN architectures
- In the above example, the first two 'layers' for "what" and "time" are not considered much in the final prediction
    
LSTM, *long short-term memory*, and GRU, *gated recurrent unit* are two kinds of recurrent neural network architectures that were designed to combat the short term memory problem in normal recurrent neural networks.
- The use of *gates* allow for better long range dependency learning. These gates are tensor operators for learning what information to add/remove to the hidden state.


### LSTMs:
A basic recurrent neural network can be thought of as a chain of modules, represented in the following diagram:

<img src="images/rnn-internal-unrolled.png" width="50%">


An LSTM would still have the same overall structure, but it would introduce further layers acting as gates for allowing or inhibiting information flow into the *cell state*:


<table>
    <tr>
        <td width="70%">            
            <img src="images/lstm-internal-unrolled.png" width="100%">
            <img src="images/lstm-internal-unrolled-labels.png" width="75%">
        </td>
        <td>
            <p style="text-align: center;">Cell state of a module</p>
            <img src="images/lstm-cell-state.png" width="100%">
        </td>
    <tr>
</table>


The cell state is the horizontal line running through the top of the module as shown above. It can be thought of as a conveyor belt, allowing information to flow along between each module with information sometimes being introduced via the gates.

The gates are just sigmoid layers and an elementwise multiplication operation. Sigmoid is chosen because it outputs values in the range $[0, 1]$, indicating what proportion of each component should be 'let through the gate'.

The first step of the LSTM is to decide what information to forget, ie. what to exclude from the cell state. This is done by the *forget gate* which is the 

### Gated Recurrent Unit:

<img src="images/rnn-lstm-gru.png" width="50%">


RNNs train faster since they are computationally lighter than LSTM and GRU architectures.

### Second Order Networks:


Reber grammar, non-deterministic finite state machine, can be learnt by a simple recurrent network.

### Resources:
- <a href="https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21">RNNs, LSTMs, GRUs</a>
- <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTMs</a>
- <a href="https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/">LSTM code explanation</a>
- <a href="https://www.youtube.com/watch?v=SEnXr6v2ifU&ab_channel=AlexanderAmini">MIT Recurrent Neural Networks</a>
