# Recurrent Networks:


<img src="images/recurrent-network.png" width="50%">


Regular feed-forward networks aren't well suited to processing sequential input data. It also can't effectively learn long-range dependencies.
- With a sequential training sample, having a fixed input layer is problematic
- With some training samples, long-range dependencies have to be learnt which would be difficult for a deep feed-forward network to identify and generalise to
    - Eg. Suppose we need a model to predict the next word in the sentence  "*France is where I grew up, but I now live in Boston. I speak fluent* \_\_\_\_\_". Regular feed-forward architectures simply can't generalise well enough to learn this long-range dependency

Recurrent neural networks on the other hand, are good architectures for processing sequential data. Audio signals and English text are examples of sequential data that recurrent neural networks are suited to processing. A recurrent architecture addresses the following:
1. Handling variable-length sequential inputs
2. Tracking long-range dependencies
3. Paying attention to the order of values in the sequence
4. Sharing parameters across the sequence (similar to what convolutional neural networks do with their kernels)


Recurrent neural networks, for example, can produce one output for a variable-length sequential input, or it can produce an output for each step in the sequential data. 
<img src="images/recurrent-network-sequence-modeling.png" width="50%">


### Recurrent Neural Networks:




<table>
    <tr>
        <td>    
            <p style="text-align: center;">
                <strong>Elman Recurrent Network</strong>
            </p>
            <img src="images/elman-recurrent-network.png">
        </td>
        <td>
            <p style="text-align: center;">
                <strong>Recurrent Network with shortcuts</strong>
            </p>
            <img src="images/recurent-network-shortcut.png">
        </td>
    <tr>
</table>


- __Recurrence of information:__ 
With each time step, the input follows the standard feed-forward procedure to the output layer, but the hidden layer activations are copied to the "context" layer, or also called the "hidden state", which will be then be fed as additional input as subsequent timesteps go through the feed-forward procedure.  
    - Eg. Once `l1` is fed through the network and the output layer is computed, the intermediate hidden layer values are copied over to nodes `s1`, `s2` and `s3` as additional inputs for when `l2` is passed through the network 

    
- Sometimes shortcut connections between the input layer and output layer are added. TODO: why does is this helpful sometimes?


<table>
    <tr>
        <td width="50%">  
            <p>
                A recurrent network can be 'unrolled' into an equivalent feed-forward network. 
            </p>
            <img src="images/recurrent-unrolled.png">
        </td>
        <td>
            <img src="images/recurrent-network-unrolled.png">
            <p>
               All the weight matrices $W_{xh}, W_{hh}, W_{hy}$ are reused across each time step.
               $W_{hh}$ transforms the current hidden state to the next hidden state. The total loss $L$ is simply the sum of all the individual losses computed at each timestep: $L_1 + L_2 + ...$
            </p>
        </td>
    <tr>
</table>


21:00 in MIT VIDEO!!!!!!!!!!!!!!!!!!!!!!!111

### Backpropagation Through Time:
*Backpropagation through time* &mdash; backpropagation on the unrolled chain of network

<img src="images/backprop-through-time.png" width="50%">


#### Short Term Memory Problem:
<img src="images/recurrent-network-text-example.png" width="50%">
Short term memory problem &mdash; caused by vanishing gradient problem. In this case, the information from steps from a long time ago diminshes. Long range dependencies aren't effectively learned
- In the above example, the first two 'layers' for "what" and "time" are not considered much in the final prediction
    
LSTM, *long short-term memory*, and GRU, *gated recurrent unit* are two recurrent neural network architectures that were created to combat the short term memory problem in normal recurrent neural networks.

The use of *gates* allow for better long range dependency learning.

These gates are tensor operators for learning what information to add/remove to the hidden state.


<img src="images/rnn-lstm-gru.png" width="50%">


RNNs train faster since they are computationally lighter than LSTM and GRU architectures.

### Second Order Networks:


Reber grammar, non-deterministic finite state machine, can be learnt by a simple recurrent network.

### Resources:
- <a href="https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21">RNNs, LSTMs, GRUs</a>
- <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTMs</a>
- <a href="https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/">LSTM code explanation</a>
- <a href="https://www.youtube.com/watch?v=SEnXr6v2ifU&ab_channel=AlexanderAmini">MIT Recurrent Neural Networks</a>
