# Transformers



## Sources:

1) Deep Learning - Foundations and Concepts (Bishop and Bishop)

## Attention

The key concept that underpins transformers is **attention**.

For example, consider the following sentences:

1) The loud bark came from the direction of the dog.
2) Deep in the forest, The loud sound came from the bark falling from that tree over there.

The word 'bark' has different meanings in both sentences. This can be seen from considering words surronding bark. In the case of one dog is very important and for two, forest and tree is especially important.

When processing these sentences, or sequences of tokens, an effective neural network should 'attend' to (i.e. rely more heavily on) specific words in the sentence to varying degrees. 

This can be achieved in transformers using the attention mechanism, unlike traditional neural networks whose weights are fixed after training, where weighting factors have values that depend on the specific input data.

A transformer can be viewed as a rich form of embedding in which a given vector is mapped to a location that depends on the other vectors in the sequence. (So a vector representing 'bark' in the above sentence can be mapped to different places in a new embeddings space for the two different sentences (Are these close to each other in the embedding space though ??))

## Transformer processing 

The input data to a transformer is a set of vectors $\{ x_n \}$ where $x_i \in \mathbb{R}^n$. $x_i$ are referred to as 'tokens' (words in sentences, patches within images or an amino acid in a protein for example). The elements of $x_i$ are referred to as features.

A powerful property of transformers is that a new neural network does not need to be defined to handle different data types: transformers can handle arbitary data types by appending them into a joint set of tokens.

The data vectors containing the features are combined into a matrix $\textbf{X}$ of dimensions $N \times D$ where the $n^{th}$ row comprises the single token vector $\textbf{x}^T_n$ and where $n \in \{ 1,..., N \}$.

The fundamental building block of a transformer is a function that takes the data matrix $\textbf{X}$ and creates a transformed matrix $\widetilde{\textbf{X}}$:

\begin{equation}
\widetilde{\textbf{X}} = TransformerLayer[\textbf{X}]
\end{equation}

Applying multiple transformer layers, each layer having their own weights and biases, in succession allows the creation of deep networks capable of learning rich internal representations.

A single transformer layer comprises of two stages: the attention mechanism (mixing together the correspoding features from different token vectors across the columns of the data matrix) and the transformation of the features within the token vector.

### Attention coefficeints

Mapping the input tokens $\{ x_1,\dots,x_N \}$ to another set $\{y_1,\dots, y_N \}$ in a new embedding space with richer semantic structure. The value of $y_n$ should depend on not just the corresponding input vector $x_n$ but on all the vectors (in the set/context) $\{ x_1,\dots,x_N \}$ and with attention, should be stronger for those inputs that are particularly important for determining a modified representation of $y_n$. This can be described using the following linear combination of the input vectors:

\begin{equation}
y_n = \sum_{m=1}^{N} a_{nm}x_m
\end{equation}

where $a_{nm}$ are called attention weights. These coefficients should be close to zero for input tokens that have little influence on the output $y_n$ and largest for the most influential inputs in the set. Two constraints are applied to the attention weights:

1) $a_{nm} \geq 0$
2) $\sum_{m=1}^{N} a_{nm} = 1$ (coefficients define a 'partition of unity')

Note: there are different set of coefficients for each output vector $y_n$ and 1) and 2) apply in these cases; the coefficients $a_{nm} depend strictly on the input data.

#### Self-attention

Self-attention can be described in the terms of information retrieval. Imagine a movie recommendation site: the attributes if each movie on the site can be encoded in a vector known as a key. The corresponding movie under inspection can be called the value. The user could also provide their own desired attributes known as a query. The service could compare the query vector with all the key vectors to send the user a corresponding value vector. The user can be thought of as 'attending' to the particular move whose key most closely matches their query. (Hard attention. Usually transformers return continuous variables to measure the degree of match between the the queries and keys in order to weight the influence of the value vectors on the outputs. This is known as soft attention. This also ensures the transformer function is differenetiable and and therefore can be trained by gradient descent.) 

Each input vector in a set, $x_n$ can be used as 'value' vector that will be used to create output tokens. $x_n$ can also be used directly as the 'key' vector for the input token n. The token $x_m$ is used as the query vector for output $y_m$.

To see how much the token $x_n$ should attend to $x_m$, the similarity between the vectors need to be calculated. This can be done using the dot product, $x_n^Tx_m$. To impose the constraints for the attention weights (i.e. 'partition of unity' and greater than or equal to 0) the softmax function can be used to normalise the attention weights:

\begin{equation}
a_{nm} = \frac{exp(x_n^Tx_M)}{\sum^{N}_{m'= 1} exp(x_n^Tx_m')}
\end{equation}

This can be written in matrix notation as:

\begin{equation}
Y = Softmax[XX^T]X
\end{equation}

This process is called self-attention because the same sequence is being used to determine the queries, keys and values. More completely, this is known as dot-product self-attention.

It is evident from the above expression that the transformation from input vectors $\{x_n \}$ to output $\{ y_n \}$ is fixed. It has no adjustable parameters and therefore it cannot learn from the data; addtionally, each of the features in a token vector $x_n$ play an equal role in determining the attention coefficients where as the it would be more suitable if the network could learn what features are most pertinent when determining token similarity.

Both issues can be addressed by using modified feature vectors given as a linear transformation of the orginal vectors:

\begin{equation}
\widetilde{\textbf{X}} = \textbf{X}\textbf{U}
\end{equation}

where $\textbf{U}$ is an $D \times D$ matrix of learnable weight parameters, similar to a 'layer' in a neural network. The modified transformation is then:

\begin{equation}
Y = Softmax[\textbf{X} \textbf{U} \textbf{U}^T \textbf{X}^T] \textbf{X} \textbf{U}
\end{equation}

Although this approach has much more flexibility, the matrix $\textbf{X} \textbf{U} \textbf{U}^T \textbf{X}^T$ is symmetric where as it is important that attention mechanism can support significant asymmetry. 
(i.e. "For example, we might expect that ‘chisel’ should be strongly associ- ated with ‘tool’ since every chisel is a tool, whereas ‘tool’ should only be weakly associated with ‘chisel’ because there are many other kinds of tools besides chisels." Bishop - 364) 

This asymmetry can be introduced and a much more flexible model can be created by allowing the queries, keys and values to each have their own independent, learnable parameters (i.e. $W^{(q)}$, $W^{(k)}$ and $W^{(v)}$): 

\begin{equation}
Q = XW^{(q)}
\end{equation}

\begin{equation}
K = XW^{(k)}
\end{equation}

\begin{equation}
V = XW^{(v)}
\end{equation}
*Note: bias terms can be added to the above equations through absorbption into the weight matrices with an additional row of parameters.*

The matrix $W^{(k)}$ has dimensionality $D \times D_k$ where $D_k$ is the dimensionality of the key vector. $W^{(q)}$ has to have the same dimensionality as $W^{(k)}$ so that the dot produt can be performed. (Usually $D_k = D$). $W^{(v)}$ has dimensionality $D \times D_v$ where $D_v$ is the dimensionality of the value vector and the output of the attention mechanism. (Usually $D_v = D$ to help facilitate the inclusion of residual connections and to allow multiple attention layers to be stacked on top of each other).

Now the attention mechanism can be written as:

\begin{equation}
Y = Softmax[QK^T]V
\end{equation}

where $QK^T$ is a $N \times N$ matrix and $\textbf{Y}$ has dimensionality $N \times D_v$.

In the transformer layer, the signal paths have multiplicative relations between activation values. Here, the activations are multipled by the data-dependent attention coefficients; standard neural networks multiply activations by fixed parameters. If a standard neural network learns to ignore a particular input or hidden-unit variable, it does so for all input vectors. In contrast, transformers if one of the attention coefficients is zero, the resulting signal path will ignore the incoming signal and hav eno effect on the network output.

One refinement that must be made is to scale the softmax function:

\begin{equation}
Y = Softmax[\frac{QK^T}{\sqrt{D_k}}]V
\end{equation}

This is to prevent the dot product from becoming too large and so the gradients becoming too small. The choice of $\sqrt{D_k}$ is due to the following reasoning: if the elements of $Q$ and $K$ are independently sampled from a zero-mean, unit variance Gaussian distribution, then the varianceof the dot product of two vectors will be $D_k$. Therefore, the argument of the softmax is normalized using the square root of $D_k$. This is known as the 'scaled dot-product attention' and is the final form of the attention mechanism.

## Multi-head attention

A single head of attention allows the output vectors to attend to data-dependent patterns of input vectors. However, there may be multiple patterns of attention that are equally relevant. Using a single attention head may lead to averaging over different patterns of attention that are important for the output. 

To rectify this, multiple heads of self-attention can be deployed in parallel, each with their own set of learnable parameters that control the calculation of the query, key and value matrices. (This is analgous to using multiple filters in a convolutional layer.)

If each head of attention is indexed by $h$ where $h \in \{ 1, \dots, H \}$ then a single layer of multi-head attention can be written as:

\begin{equation}
H_h = Attention(Q_h, K_h, V_h)
\end{equation}

where 

\begin{equation}
Q_h = XW_h^{(q)}
\end{equation}

\begin{equation}
K_h = XW_h^{(k)}
\end{equation}

\begin{equation}
V_h = XW_h^{(v)}
\end{equation}  

These heads are then concatenated together to form the output matrix $\textbf{Y}$:

\begin{equation}
Y(X) = Concat[H_1, \dots, H_H]W^O
\end{equation}

where $W^O$ is a matrix of learnable parameters which performs a linear transformation on the concatenated output of the heads. As each attention head has the dimensionality $N \times D_v$, the concatenated matrix has the dimensionality $N \times HD_v$. The linear matrix then has the dimensionality $HD_v \times D$ so that the output has the dimensionality $N \times D$ - the same dimensionality as the input.

*Note: $D_v$ is typically chosen to be equal to $D/H$ so that the resulting concatenation has dimension of $N \times D$.*

Multi-head self-attention forms the core building block of the transformer. As neural networks benefit greatly from depth, it makes sense that multiple self-attention should be stacked on top of each other to increase the depth of the network. Residual connections can be used to bypass the multi-head structure which requires that the output dimensionality is the same as the input dimensionality ($N \times D$). This is followed by Layer Normalisation or sometimes preceeded by it in more recent applications as this can ensure more effective optimisation:

\begin{equation}
Z = LayerNorm(Y(X) + X), Post-normalisation
\end{equation}

\begin{equation}
Z = Y(X') + X, Pre-normalisation
\end{equation}
where $X' = LayerNorm[X]$ 

So far, the attention mechanism creates linear combinations of the value vectors which are linearly combined to produce the output vectors. Additionally, the value vectors are linear combinations of the input vectors and thus, the outputs of an attention layer are constrained to be linear combinations of the input. Non-linearity does enter through the used of the softmax function used to calculate the attention weights, however, 'the output vectors are still constrained to lie in the subspace spanned by the input vectors and this limits the expressive capabilities of the attention layer' (Bishop pg. 369). 
The flexibility of the transformer can be increased by post-processing the output of each layer using a standard non-linear neural network with $D$ inputs and $D$ outputs. For example, this can be a two-layer fully-connected network with ReLU hidden units. In order to preserve ability of the transformer to process variable-length input sequences, the same shared network is applied to each of the output vectors, corresponding to the rows of $Z$. This can also be imporved by using a residual connection:

\begin{equation}
\widetilde{X} = LayerNorm[MLP[Z] + Z]
\end{equation}

Pre-norm can be used instead so that:

\begin{equation}
\widetilde{X} = MLP[Z'] + Z
\end{equation}

where $Z' = LayerNorm[Z]$

In a typical transformer there a multiple such layers stacked on top of each other. Each layer has the same structure, although there is no sharing of parameters between these layers.

## Positional encoding

In the transformer, the query, key and value matrices are shared across the input tokens. Consequently, the transformer is equivariant with respect to input permutations (i.e. changing the order of the input tokens does not affect the output). 

Although the sharing of the parameters facilitates the massiviely parallel processing of the transformer and allows the network to learn long-range dependencies as effectively as short-range dependencies, the lack of positional information is a major limitation for sequential data. (i.e. consider the sentences: "The food was bad, not good at all." vs. "The food was good, not bad at all.")

This can be achieved using positional encoding. A positional encoding vector, $r_n$ is associated with each input position $n$ and then is combined with the input vector $x_n$. The combination is made through addition of the position vectors to the token vectors (concatenation would increase the dimensionality of the input vectors): 

\begin{equation}
\widetilde{x}_n = x_n + r_n
\end{equation}

This works due to the fact that two randomly chosen uncorrelated vectors tend to be nearly orthogonal in spaces of high dimensionality, which indicates that the network can process the tokens and positions relatively independently. Furthermore, the resiudal connections across every layer ensires the positional information does not get lost going from one transformer layer to the next. 


A good choice for positional encoding is to choose a number in the range (0,1) and assign each token in the sequence to a corresponding position. One drawback is that this representation is not unique for a given position as it depends on the overall sequence length. An ideal postional encoding scheme should provide 'a unique representation for each position, it should be bounded, it should generalize to longer sequences, and it should have a consistent way to express the number of steps between any two input vectors regardless of their absolute position because the relative position of tokens is more important than the absolute position' (Bishop pg.372). 

One approach is to used sinusoidal functions:

\begin{equation}
r_n =\begin{cases}
    \sin (\frac{n}{L^{i/D}}), & \text{if i is even}.\\
    \cos (\frac{n}{L^{(i-1)/D}}), & \text{if i is odd}.
  \end{cases}
\end{equation}

The elements of the positional encoding vector $r_n$ are given by a series of sine and cosine function of steadily increasing wavelength with elements of $r_n$ lying in the range (0,1). One property of the sinusoidal representation is that for any fixed offset $k$,the encoding at position $n+k$ can be represented as a linear combination of the encoding at position $n$ in which the coefficients don't depend on the absolute position but only the value of $k$. The network should therefore be able to attend to relative positions. 

Position encodings can also be learned. A vector of weights are added to each token position which can be learnt jointly with the rest of the model parameters during training and avoids using handcrafted positional encodings. This does not generalize to longer input sequences as the encoding will be untrained fot positional encodings not seen during training. So this is generally must suitable for sequences in which the input length is relatively constant during training and inference. 

## Decoder transformers

Can be used as generative models that create output sequences of tokens. The goasl is to construct an autoregressive model of the form:

\begin{equation}
p(x_1,\dots,x_N) = \prod^{N}_{n=1} p(x_n |x_1,\dots, x_{n-1})
\end{equation}

where p(x_1,\dots,x_N) is the conditional distribution of the token output $x_N$ given the input sequence $x_1,\dots, x_{n-1}$ learnt by the model from the training data.

Samples can be drawn from this sequence and can be fed back into the model to give a distribution over the $n+1$th token, and so on. 

The model can be trained using a self-supervised approach. 
Efficiency can be achieved by processing an entire sequence at once so that each token acts as both a target value for the sequence of previous tokens and as an input for the next tokens. 

In order to hide the future tokens from the model, the input sequence is shifted to the right by one step and a masked (or casual) attention is used. This is essentially setting all the corresponding elements of the attention matrix to zero beyond the current token in the input sequence and normalising the elements so that each row sums to 1.

*Note: padding tokens can be used to ensure that the input sequences are of equal length for efficient processing in the transformer. An additional mask is then used in the attention weights to ensure that the output vectors do not pay attention to any padding tokens.* 

The generative process of making tokens can be repeated indefinitely or until an end-of-sequence token is generated. 

Also note: Due to masked attention, the embedding learned for a particular token depends only that token itself and on earlier tokens and so does not change when a later token is generated. Consequently, much of the computation can be recycled when processing a new token.

### Samping strategies

• Greedy search: select the token with the highest probabilty (Complexity - $O(KN)$ where $K$ is the number of tokens in the vocabulary and $N$ is the length of the sequence). The model will then be deterministic in its output predictions. Carefully note that chosing the most likely token at each step is not the same as selecting the most likely sequence of tokens. The most likely sequence would instead require the maximization of the joint distribution over all tokens (Complexity - $O(K^{N})$ - most likely infeasible for long sequences):

\begin{equation}
p(y_1, \dots, y_N) = \prod^{N}_{n=1} p(y_n | y_1, \dots, y_{n-1})
\end{equation}

• Beam search: a set of $B$ hypotheses are maintained at each step each consisting of the sequence of tokens up to step n.Here $B$ is called the beam width. Each of these possible sequences are fed into the network, and for each sequence the B most likely most probable tokens are retained creating $B^2$ possible hypotheses for the extended sequence. 

The beam search algorithm maintains $B$ alternative sequences and keeps track of their probabilites and selects the most probable sequence from this set. (Sequence probabilites are generally normalised to ensure longer sequences have chances of being selected not just shorter ones.) (Complexity - $O(BKN)$ i.e. linear in sequence length.)

Instead of trying to find the most likely sequence, tokens can be generated by sampling from the softmax distribution of tokens at each step but this can lead to nonsensical sequences, especially for dictionaries with large numbers of tokens. Or we can consider the states with the top K porbabilites for some choice of K and sample from them according to their renormalised probablities. (A variant is called 'top-p' or nucleus sampling). 

A 'softer' version of top-K sampling is to introduce a parameter called temperature (T) into the softmax function: 

\begin{equation}
y_i = \frac{exp(a_i/T)}{\sum_j exp(a_j/T)}
\end{equation}

The tokens are sampled from the modified distribution. When T=0, the probability mass is concentrated on the most probable state with all other states having zero probability (reverts to greedy search). When T=1, the unmodified softmax distribution is recovered. So by choosing values in the range $0 < T < 1$, the probability is concentrated towards higher values.

As $T \rightarrow \infty$, the distribution becomes uniform across all tokens. 

Careful: Models can drift from the distribution sequences seen during training when the model is generating sequences.

## Encoder transformers

Encoders are transformers that take sequences as input and produce fixed-length vectors as output. 

![title](figures/decoder_architecture.png)
Bishop pg. 389


A randomly chosen subset of tokens is chosen are replaced with mask tokens. The model is then trained to predict the missing tokens at the corresponding mask nodes (Bidirectional if the network sees tokens before and after the mask(s)).

Compared to decoder transformers, encoders are less efficient as they use only a fraction of sequence tokens as training labels. Decoders cannot generate sequences either. 

The procedure of masking random tokens can also introduce a mismatch between the training, test time and fine-tuning distributions. (Devlin et al. 2018 introduced in BERT the masking of 80% of the mask tokens with < mask > and 10% with random tokens and 10% with the original tokens.)

Once the encoder is trained, it can be fine-tuned for a variety of downstream tasks. To do this, a new output layer is constructed with a specific form to the task being solved. 

For example for classification tasks, during fine-tuning all model parameters including the new output matrix are learnt using the log prbability of the correct class label.

## Sequence-to-sequence models

Combination of encoder and decoder transformers (Vaswani et al. 2017). An encoder transformer can be used to map the input sequence into a suitable internal representation denoted by $Z$. To incorporate $Z$ into the generative process for the output sequence, a modified version of self-attention called cross-attention is used. This is the same as self-attention but the query vectors come from the sequence being generated and the key and value vectors come from $Z$, the representation of the encoder. 

This results in the following architecture:

![title](figures/seq-to-seq.png)
