# Transformer Taxonomy

Previously we've covered fine-tuning and evaluating a transformer.

 In this notebook, below topics are gonna be covered:
 1. Implementing attenion mechanism
 2. Implementing attention decoder
 3. Architectural difference between the encoder and decoder(For decoder only transformer refer [nanogpt](https://github.com/JpChii/nanogpt))
 3. Taxonomy of transformers

## The Transformer architecture

Transformer is based on encoder-decoder architectecture which is widely used for machine translation tasks.

***Encoder*** - Encoder Converts sequence of input tokens to embeddings also called *hidden state* or *context*

***Decoder*** - Decoder uses the *hidden state* to iterativley generate an output sequence of tokens, one token at a time.

*Encoder-Decoder architecture*
![Architecture](../notes/images/3-transformer-taxonomy/encoder-decoder.png)

* Encoder combines token embeddings and positional embeddings and pass them through a stack of encoder layers to product hidden state.
* Encoder's output is fed to each decoder layer. Decoder layer then predicts the next most probabale token.
* Let's say *Die* and *Zeit* are predicted, now the decoder gets these two and encoder outputs to predict the next token *fliegt*.
* This process'll be reperated until EOS token.

### The Encoder

In encoder, the encoder stack followed by token embeddings are made up of below layers,
* multi-head self-attention layer
* fully connected feed-forward layer that is applied to each input embedding

The output embeddings of each encoder layers have the same size as the inputs. The main role of encoder stack is to **update** input embedding with some contextutal information.

Ex: *apple* will be updated to be more *company-lik* and less *fruit-like* if work *keynote* and *phone* are close to it.

*Multi-head attention*
![Multi-head attention](../notes/images/3-transformer-taxonomy/multi-head-attention.png)

Each of these layer use skip connections and layer normalization to train deeop neural networks effectivley.

Let's start with self-attention layer:

#### Self-Attention

Self-attention computes weights for all hidden states in the same set(encoder or decoder).

***Main idea of self-attention***: Instead of a fixed embedding for each token, we can use the entire sequence and compute a *weighted average* of each embedding. Mathametically this can be defined as below,

For a sequence of x_1, x_2, ....., x_n, self-attention produces a sequences of new embeddings xx_1, xx_2, ....,xx_n where each xx_i is a linear combination of al the x_j.

*Expression*

![expression](../notes/images/3-transformer-taxonomy/self-attention-expression.png)

The coeffecients w_ji are `attention weight* and are normalized to 1. How averaging works?

Consider *time flies like an arrow* where *flies* is a verb. By assigning more weights to arrow and time, we come with a representation of flies which has some context in it. Embeddings generated this way are called *contextualized embeddings*.

Let's check how these attention weights are calculates...

#### Scaled dot-product attention

The common way to implement self-attention is scaled-dot product from the transformer introuction paper.

There are four main steps:
* Project key, query, value for each embedding
  * These are Linear matrices
* dot-product query and key to get attention weights matrix
  * query is current the embedding token and is multiplied with keys of all token. The dot product will be high based on similarity
* To avoid explosion of attention weights by dot product it's normalized
* Finally token embeddings are updated by 
    dot product of attention weights and values.

### Visualizing attention

Attention cab be visualized with [BertViz for Jupyter](https://oreil.ly/eQK3I)

To visualize attention weights, neuron_view module traces computation of wieghts to show the query and key vectors are combined to product the final weight.

In [11]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

text = "time flies like an arrow"
show(model=model,model_type="bert", tokenizer=tokenizer, sentence_a=text, display_mode="light", layer=0, head=8)

100%|██████████| 433/433 [00:00<00:00, 169273.34B/s]
100%|██████████| 440473133/440473133 [01:49<00:00, 4039290.54B/s]


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

flies has the highest magnitude with arrow.