# Sequence Model
One of input and output is a sequence model.

## Notations
Input: $x: x^{<1>}, x^{<2>}, ..., x^{<t>}, ..., x^{<n>}$\
Output: $y: y^{<1>}, y^{<2>}, ..., y^{<t>}, ..., y^{<n>}$\
Number of features (sequence length) in $x: T_x$\
Number of features in $y: T_y$\
Feature $t$ in $i$th training example: $x^{(i)<t>}$\
Feature $t$ in $i$th output: $y^{(i)<t>}$\
Sequence length of $i$th training example: $T_x^{(i)}$\
Sequence length of $i$th output: $T_x^{(i)}$

# Word Representation
## One-hot Method
Build a dictionary with, say 10,000, words.\
Construct a vector for each word from the input, the vector elements are zeros except for the word, which is 1.

# Recurrent Neural Network
## Why not traditional neural network
__Problems__
1. Length of input and output is variable.
2. Cannot flexibly transfer information learned from one part of text to another part.
3. Too many parameters to train.

## Structure

$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{y}^{<1>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{y}^{<2>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{y}^{<3>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{y}^{<T_y>}$\
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$\
$a^{<0>}\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$
$\ \to\ a^{<1>}\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$
$\ \to\ a^{<2>}\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$
$\ \to\ a^{<3>}\ \to\ ...\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$\
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$\
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{x}^{<1>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{x}^{<2>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{x}^{<3>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{x}^{<T_x>}$

The network gives a prediction after taking in one word, and the prediction of the prior words will affect the laters.\
The parameters are $W_{ax}$ for inputs, $W_{aa}$ for activations, and $W_{ya}$ for predictions. They are unchanged for each word, which is why the network is ___Recurrent___.

Downside: only words before the target work is considered. Solution: Bidirectional RNN (BRNN).

## Foward Propagation
__Note:__ parameter $W_{ij}$ means $W$ times $j$ to calculate $i$.

$a^{<0>} = 0$\
$a^{<1>} = g(W_{aa}a^{<0>} + W_{ax}x^{<1>} + b_a)$, $g$ is often $tanh$\
$\hat{y}^{<1>} = g_2(W_{ya}a^{<1>} + b_y)$, $g_2$ depends on the type of prediction, either softmax, sigmoid, or others

Generally\
$a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$\
$\hat{y}^{<t>} = g_2(W_{ya}a^{<t>} + b_y)$

Simplified\
$a^{<t>} = g(W_a[a^{<t-1>}, x^{<t>}] + b_a)$\
$\hat{y}^{<t>} = g_2(W_ya^{<t>} + b_y)$\
Where $W_a = [W_{aa}|W_{ax}]$, $W_{aa}: (100,100), W_{ax}: (100,10000), W_a: (100,10100)$\
$[a^{<t-1>}, x^{<t>}] = $
$
\begin{bmatrix}
    a^{<t-1>} \\
    x^{<t>}
\end{bmatrix}
$

## Back Propagation Through Time
Through time: as Foward Propagation is from left to right, Back Propagation is from right to left, meanwhile t decreases.

Each prediction has a loss $L^{<t>} = -y^{<t>}log\hat{y}^{<t>}-(1-y^{<t>})log(1-\hat{y}^{<t>})$, overall loss is $L(\hat{y}, y) = \sum_{t=1}^{T_y}{L^{<t>}(\hat{y}^{<t>}, y^{<t>})}$

# Different Types of RNN Architectures
Many-to-one (only outputs $\hat{y}$ in the last round)\
One-to-many (only inputs in the first round, feed the last output $\hat{y}^{<t-1>}$ as $x^{<t>}$)

## Many-to-many ($T_x \neq T_y$)
Encoder of length $T_x$ plus Decoder of length $T_y$

$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{y}^{<1>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{y}^{<T_y>}$\
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$\
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$
$\ \to\ \dots\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$
$\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$
$\ \to\ \dots\ \to\ $
$
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box \\
    \Box
\end{bmatrix}
$\
$\ \ \uparrow$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \uparrow$\
$\ \hat{x}^{<1>}$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \hat{x}^{<T_x>}$


# Language Modelling
## Tokenization
Create one-hot vectors for words.\
Add \<EOS> to the end of sentence as an additional token.\
Mark words outside the dictionary as \<UNK>.

## Training
Give all words before word \<t> as input, and predict the possibility of this word.\
This means $x^{<1>} = \vec{0}, x^{<t>} = Y^{<t-1>}$ until the last token \<EOS>\
Total Loss is the softmax loss function over sum up over the sequence outputs.\
The process helps the model to better predict the next word given previous words.

## Novel Sequence Sampling
Randomly choose a word based on the softmax probabilities. Then input the word to the model, and randomly choose another word based on the output softmax probabilities. Repeat the process until \<EOS>.

Besides word-level modelling, character-based modelling is sometimes used, though expansive and perform worse at info between lines.

# Vanishing Gradients with RNN
Traditional RNNs are not good at capturing long-term dependencies of words, such as verb plural form, due to vanishing gradient.\
When vanishing gradients occur, info nearby one word becomes much important than farther info.

## Solution: Gated Recurrent Unit (GRU)
Intrinsically, GRU maintains a memory cell in a certain timestep, which remain unchanged (influential) through a long term, and forget it at a certain point.\
To maintain the memorial value or replace it with current value depends on function $\Gamma_u$, which is a value after sigmoid function, thus approximately 0 or 1.

__Note:__
1. For GRU, memory cell is the same as the hidden state, while they're different in LSTM model.
2. The network still gives predicted values as usual, it's only the value passed through each timestep remembered or forgotten.
3. To remember more than one memorial feature, memory cell $c^{<t>}$ is often __a vector__, and each timestep updates (or not) all elements of $c^{<t>}$.

__Algorithm__ \
$$c^{<t>} = a^{<t>}$$
$$\tilde{c}^{<t>} = tanh(W_c[c^{<t-1>}, x^{<t>}] + b_c)$$
$$\Gamma_u = \sigma(W_u[c^{<t-1>}, x^{<t>}] + b_u)$$
$$c^{<t>} = \Gamma_u \odot \tilde{c}^{<t>} + (1-\Gamma_u) \odot c^{<t-1>}$$

_The above $\odot$ means element-wise product._ \
To understand easily, we say $\Gamma_u$ is regarded as 0 or 1, but in real cases it's just an approximation.\
So $c^{<t>}$ is more like a weighted mean between the memory and the current value, and $\Gamma_u$ represents how important the current value is.

Turns out, another gate $\Gamma_r$ (r for relevance) is used in major GRU models. Why? No reason, it works!
$$\tilde{c}^{<t>} = tanh(\Gamma_r \odot W_c[c^{<t-1>}, x^{<t>}] + b_c)$$
$$\Gamma_r = \sigma(W_r[c^{<t-1>}, x^{<t>}] + b_r)$$



## LSTM Model
In LSTM, memory cells and hidden states are separately stored.\
LSTM contains 3 controlling gates, update, forget, and output (UFO).\
Update: how much the new memory is added.\
Forget: how much the old memory is preserved.\
Output: how much memory can be seen (transfer to hidden state). ___(What's the point?)___

__Algorithm__
$$\tilde{c}^{<t>} = tanh(W_c[c^{<t-1>}, x^{<t>}] + b_c)$$
$$\Gamma_u = \sigma(W_u[a^{<t-1>}, x^{<t>}] + b_u)$$
$$\Gamma_f = \sigma(W_f[a^{<t-1>}, x^{<t>}] + b_f)$$
$$\Gamma_o = \sigma(W_o[a^{<t-1>}, x^{<t>}] + b_o)$$
$$c^{<t>} = \Gamma_u \odot \tilde{c}^{<t>} + \Gamma_f \odot c^{<t-1>}$$
$$a^{<t>} = \Gamma_o \odot tanh(c^{<t>})$$

Although LSTM is an older model compared to GRU, it usually performs better for having 3 gates. While GRU is sometimes favored for faster computation in large-scale projects.

# Bidirectional RNNs
Problems occur when previous words are the same.\
Eg.\
He said, "Teddy bears."\
He said, "Teddy Roosevelt."

Bidirectional RNN uses two RNNs to combine info from words before and after the target word.\
As the two RNNs meet at word \<t> and gives prediction,
$$\hat{y}^{<t>} = g(W_y[\overleftarrow{a}^{<t>}, \overrightarrow{a}^{<t>}] + b_y)$$
Whereas the original RNN is $$\hat{y}^{<t>} = g(W_y\overrightarrow{a}^{<t>} + b_y)$$

__Problem with BRNN:__ It needs the full text to do bidirectional, so it can't be applied to tasks like real-time speech recognition.

# Deep RNNs
Deep RNN are RNNs stacking together, transferring hidden states both horizontally and vertically. Each RNN is a layer.

For a hidden state $a^{[l]<t>}$,
$$a^{[l]<t>} = g(W_a[a^{[l]<t-1>}, a^{[l-1]<t>}]+b_a^{[l]})$$

Deep RNNs are cost expensive, so 3 layers is the ceiling (__but who knows?__).\
However, instead of horizontal stacking, a vertical RNN can replace $y^{<t>}$ for further prediction.