## Representing Sequences

### Tokenizing

Split the sequence into tokens. A token might contain a letter, a word, or many words.

### Embedding

Construct a vocabulary (size $v$) containing all unique tokens in the sequence, and represent each token with a number or a one-hot vector of size $v$.

## Language Models

Within a sequence, language models try to predict the next token from previously seen tokens.

N-gram model assumes that only the previous $n-1$ tokens contribute to current token, following an nth-order markov property:

$$P(x_t | x_{t-1}, \dots, x_1) = P(x_t | x_{t-1}, \dots, x_{t-n+1})$$

To calculate the probability, there are different approaches.

### Word Frequency Approach

The word frequency approach calculates probability simply by counting how many times certain word combinations appeared in the sequence:

$$P(x_3 | x_2, x_1) = \frac{n(x_3, x_2, x_1)}{n(x_2, x_1)}$$

But this approach performs poorly if $n$ gets big, when novel word combinations appear more frequently.

### Neural Network Approach

The neural network approach summarizes information from previous tokens into a hidden variable:

$$h_{t-1} = f_{summary}(x_{t-1}, \dots, x_{t-n+1})$$

With this hidden variable added, possibility can be written as:

$$P(x_t | x_{t-1}, \dots, x_{t-n+1}) = P(x_t | h_{t-1})$$

At each time step, the neural network does 2 calculations:

1. Summarize a new hidden variable $h_t = f_{summary}(x_t, h_{t-1})$.

2. Predict current token $o_t = P(x_t | h_{t-1}) = f_{predict}(h_t)$. Note that $h_t$ contains information from $x_t$ and $h_{t-1}$

The above 2 functions are represented in the classic MLP structure (written in matrix form for batch calculation):

$$H_t = \phi(X_tW_{xh} + H_{t-1}W_{hh} + \mathbf{b}_h)$$

$$O_t = H_tW_{hp} + \mathbf{b}_p$$

Parameters $W, \mathbf{b}$ are learned through training.

### Measuring Model Performance

Perplexity is defined as:

$$\exp (-\frac{1}{n} \sum_{i=1}^n \log P(x_t | x_{t-1}, \dots, x_1))$$

where $x_t$ is the actual label, and $P(x_t | x_{t-1}, \dots, x_1)$ is the model's output of the possibility of $x_t$ being the current token.

In best cases, the model predicts a possibility of 1 for all actual $x_t$, then perplexity equals 1.

In worst cases, the model predicts a possibility of 0 for all actual $x_t$, then perplexity equals positive infinity.

## Encoder-Decoder

A general structure for sequence-to-sequence (seq2seq) tasks.

Encoder takes the input sequence of variable length $T$, computes a state $\mathbf{c}$ of fixed length.

Decoder takes the state from encoder and the previous predicted tokens, predicts the current token.

### Implementation with RNN

#### Encoder

RNN, but without output $o_t$.

At each time step, only calculates encoder hidden state $h_t = f_{summary}(x_t, h_{t-1}) \in \mathbf{R}^h$ from embedded input token $x_t \in \mathbf{R}^v$.

In batch form (batch size $n$):

$$H_t = \phi(X_tW_{xh} + H_{t-1}W_{hh} + \mathbf{b}_h) \in \mathbf{R}^{n \times h}$$

The generated state is just $\mathbf{c} = H_T$.

#### Decoder

RNN, but with 2 modifications, both of which have 2 design choices:

1. At each time step $t$, use actual embedded target token $y_{t-1}$ as input ("teacher-forcing", only when training), or use model's prediction $o_{t-1}$

2. Concatnate additional input $\mathbf{c}$ to every time step, or only use $\mathbf{c}$ as an initial hidden state $h_0$

Both choosing the former, our modified RNN calculates decoder hidden state $s_t = f_{summary}(y_{t-1}, \mathbf{c}, s_{t-1}) \in \mathbf{R}^s$ and $o_t = f_{predict}(s_t) \in \mathbf{R}^v$.

In batch form (batch size $n$):

$$Y_{t-1} = \begin{bmatrix} Y_{t-1} & H_T \end{bmatrix} \in \mathbf{R}^{n \times (v+h)}$$

$$S_t = \phi(Y_{t-1}W_{xs} + S_{t-1}W_{ss} + \mathbf{b}_s) \in \mathbf{R}^{n \times s}$$

$$O_t = S_tW_{sp} + \mathbf{b}_p \in \mathbf{R}^{n \times v}$$

And during testing, no "teacher-forcing" can be done, so the previous output $o_{t-1}$ has to be used as input:

$$O_{t-1} = \begin{bmatrix} O_{t-1} & H_T \end{bmatrix} \in \mathbf{R}^{n \times (v+h)}$$

$$S_t = \phi(O_{t-1}W_{xs} + S_{t-1}W_{ss} + \mathbf{b}_s) \in \mathbf{R}^{n \times s}$$

$$O_t = S_tW_{sp} + \mathbf{b}_p \in \mathbf{R}^{n \times v}$$

