# sequence oriented tasks

- sequential data: time series, DNA sequence, weather, text, audio, video



variable length inputs, variable length outputs, variable length computation

common: neural net remembers info in the sequence

![image.png](attachment:image.png)

## backpropagation through time (BPTT)

- unroll the self-loop into many time steps, then backpropagate through time steps as they are MLP

![image.png](attachment:image.png)

# recurrent neural net (RNN)

- takes arbitrarily sized inputs and remember a hidden state of info

- in feedforward nets, activations of later layers are completely determined by **input**


- in RNN, hidden layers are allowed to be affected by activations at **earlier times**, i.e., self-loops
    

- vanishing gradient problem: training can be difficult for **long-term** dependencies, forget info which is too far from past

![image.png](attachment:image.png)

# LSTM

- LSTM is good at handling long term dependencies


- main component of LSTM is running cell state

![image.png](attachment:image.png)

## forget gate

- forget gate: how much of previous hidden state should be forgotten?

$$
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
$$

![image.png](attachment:image.png)

## running state

running state $C_t$ is a combination of previous hidden state and new hidden state

$$
C_t = f_t * C_{t-1} + i_t * \tilde{C_t}
$$

$$
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
$$

$$
\tilde{C_t} = \tanh (W_C \cdot [h_{t-1}, x_t] + b_C)
$$

## output 

output $h_t$ is info including previous hidden state $h_{t-1}$ and current input $x_t$ multiplied by running state $C_t$

$$
h_t = o_t * \tanh (C_t)
$$

$$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$$



## problem of LSTM encoder-decoder

![image.png](attachment:image.png)

intermediate states of encoder are not visible to decoder

decoder: has to come up with everything from bottleneck layer as Autoencoder

# transformer: CNN + Attention

- each decoder layer pay different attention to the last layer of encoder

### attention

![image.png](attachment:image.png)

- expose all the previous hidden states to each decoder state, to make its decoding task easier

- attention allows decoder adaptively consider encoder state based on current state

![image.png](attachment:image.png)

- e.g. English to German translation

    verb in English is 2nd word while verb in German is last word,
    
    decoder state $C_4$ may pay more attention to encoder state $h_2$
    
    $\Sigma|x$ is weighted summation of previous hidden states that customize for each decoder state,
    
    softmax make sum of weights for different hidden states are 1

# unsupervised sequential models

- frequent unsupervised training in language models uses **masking**

    unsupervised training both in pretraining and language generation
    
    GPT use mask training


- e.g.

    - mask parts of the input and fill it in

    - mask the end of a sequence and complete this


- application: text generation, music generation

# reading

- Luong et al. Effective Approaches to Attention-based Neural Machine Translation 2015


- Hochreiter & Schmidhuber Long Sort Term Memory, 1997


- https://medium.com/datadriveninvestor/attention-in-rnns-321fbcd64f05


- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
