## Week 7: Recurrent Neural Networks I

### Outline
- Motivation & flavours: uses in ML vs neuroscience, rate vs spiking, vanilla vs LSTM/GRU
- Architechtures: one-to-one, one-to-many, many-to-many, etc
- Math: 
    - forward pass (w/ numpy examples)
    - backprop
- Training: methods and challenges
- Hands-on: RNN implemented in PyTorch and trained to integrate noise
- Bonus: integrators as line attractors and the oculomotor system

## RNNs: ML vs neuroscience
### Why are RNNs useful for ML?
<img src='./img/nn.png'>

- Feedforward networks are constrained in their operations:
    - accept inputs (vectors) of a fixed size
    - perform a pre-determined number of computational steps
    - produce outputs (vectors) of a fixed size

<img src='./img/rnn_2.png'>

- Sometimes want to process sequences of data
    - Ex: audio, text, time-dependent signals
- FF networks aren't great for this
    - length of sequence can be variable
    - temporal order of sequence can be very important
- RNNs have *recurrence*:
    - connections "within layers"
    - the computation at each timestep is not only dependent on the current input, but the current state (and therefore all previous states and inputs)
    - "state/context-dependent" computation, "memory"
    - temporal component = dynamical system

### RNNs in practice (ML)
- mainly used for natural language processing (NLP), translation, transcription, etc
- "Vanilla" RNNs are very difficult to train
- Long Short-Term Memory (LSTM) networks are far more common, but are still difficult to train
- Due to sequential operation, RNNs are difficult to parallelize
    
<img src='./img/lstm.png'>

## RNNs for neuroscience
### Why?
- recurrence is a canonical property of brain circuitry
- the brain is a dynamical system

### How?
- "recurrent neural network" can have many meanings in a neuroscience context
    - biophysically detailed models of a few interconnected neurons (ex occulomotor system) are recurrent neural networks
    - cortical microcircuit models with E/I balance via hundreds/thousands of pyramidal cells and inhibitory interneurons are recurrent neural networks
    - some "population" coding models with idealized tuning curves and Poisson spiking are recurrent neural networks

### Spiking vs Rate-based Networks
- the RNNs used in machine learning are typically referred to as "rate" networks in neuroscience
    - each unit at each timestep has an "activation"
    - "activation" is typically interpreted as analagous to the firing rate of a neuron
- Until recently, "spiking" networks were far more common in neuroscience
    - result of simulating dynamics of individual neurons (ex LIF) and connecting them
    - spiking networks... spike
- Common arguments for/against:
    - Spiking networks have more biophysical detail
    - Spiking networks preserve spike timing information
    - Spiking networks have (inter-spike/inter-trial) variability
    - Rate networks are differentiable (and therefore far easier to train)
    - Rate networks capture population dynamics accurately enough, more detail is unnecessary
- Big problem with both? Training!

## VRNN Architectures 
<img src='./img/rsz_inout.jpg'>

<img src='./img/diags.jpg'>

Red = inputs, green = state, blue = output

Examples (left to right): image classification, image captioning, sentiment analysis, translation, video classification 

## RNN Math
<img src='./img/rsz_many-to-many.png'>

Vanilla RNNs use the same weights for every step $t = 1:n$. We need 3 weight matrices:
- $W_{xh}$ for all $x_t$ -> $h_t$ (red arrows)
- $W_{hh}$ for all $h_{t-1}$ -> $h_t$ (green arrows)
- $W_{hy}$ for all $h_t$ -> $y_t$ (blue arrows)

We also need bias vectors $b_x, b_h, b_y$. Think of these as intercepts or tonic activity.

The state $h$ at time $t$ can be expressed as:

$$h_t = \sigma(W_{xh}x_t + b_x + W_{hh}h_{t-1} + b_h)$$

where $\sigma$ is a nonlinearity. $\textrm{tanh}$ is the most common for VRNNs.

The output at time $t$ is a function of the current state:
$$ y_t = W_{hy}h_t + b_y $$

### Notes:
- Here, the output step is linear and outputs are unbounded
- Weight matrix sparsity and symmetry has a large impact on dynamics
- "Training" the network = finding weights + biases that work

In ML, computing the internal state and the output is known as performing a "forward pass." The code in this case is ridiculously simple:

`h[t] = np.tanh(np.dot(W_xh, x[t]) + b_x + np.dot(W_hh, h[t-1]) + b_h)`
`y[t] = np.dot(W_hy, h[t]) + b_y`

### The hard part: training
In order to train this network, we have to follow a familiar pattern in ML: 
    
   1) Define a loss function $\mathcal{L}$ and 
   
   2) Find parameter values $\theta = \{W_{xh}, b_x, W_{hh}, b_h, W_{hy}, b_y\} $ that minimize $\matcal{L}$
   
To define loss, we also need "ground truth" or target outputs $t$ that we want our network to produce for a given input $x$. For example, we could use mean squared error (MSE) for our loss $\mathcal{L}$:

$$ \mathcal{L} = \frac{1}{k} \sum_{i=1}^{k}(t_i - y_i)^2$$

Note: $i$ indexes over $k$ input sequences, not time!

Thanks to our nonlinearity, we cannot find $\theta$ analytically, so we much compute gradients and use gradient descent to optimize them. We do this using chain rule.

Some are simple- for example the linear outputs only require current $h_t$:

$$ \frac{\partial \mathcal{L}}{\partial W_{hy}} = \frac{\partial \mathcal{L}}{\partial y} * \frac{\partial y}{\partial W_{hy}}$$

$$ \frac{\partial \mathcal{L}}{\partial b_y} = \frac{\partial \mathcal{L}}{\partial y} * \frac{\partial y}{\partial b_y}$$

If you know the derivative of your cost function, these are simple to compute on a backwards pass.

The recurrent parameters are harder, because they require backpropagating through every timestep. For ex:

$$ \frac{\partial \mathcal{L}}{\partial W_{xh}} = \frac{\partial \mathcal{L}}{\partial y} \sum_t \frac{\partial y}{\partial h_t} * \frac{\partial h_t}{\partial W_{xh}}$$

This is the main reason that training RNNs is difficult: exploding gradients!