<a name='0'></a>

# Attention and Transformer

In the previous notebook, we learned about Recurrent Neural Networks and their applications in computer vision. Toward the end, we saw that Recurrent Neural Networks are rarely used to day due to the advent Transformer. In this notebook, we will take a step back and see the downsides of RNNs, how they motivate Transformer and dive deep into Transformer architecture.

***Outline***:

- [1. The Downsides of Recurrent Networks and Motivation of Transformers](#1)
- [2. Transformer Architecture](#2)
  - [2.1 Attention](#2-1)
    - [2.1.1 Attention Inputs: Queries, Keys, Values](#2-1-1)
    - [2.1.2 Multi-Head Attention](#2-1-2)
  - [2.2 Embedding and Positional Encoding Layers](#2-2)
  - [2.3 Residual Connections, Layer Normalization, and Dropout](#2-3)
  - [2.4 Linear and Softmax Layers](#2-4)
  - [2.5 Encoder and Decoder](#2-5)
- [3. Advantages and Disadvantages of Self-Attention](#3)
- [4. Implementations of Transformer](#4)
- [5. Evolution of Large Language Transformer Models](#5)
- [6. Transformers Beyond NLP](#6)
- [7. Final Notes](#7)
- [8. References and Further Learning](#8)

<a name='1'></a>

## 1.The Downsides of Recurrent Networks and Motivation of Transformers

As we saw in the previous notebook, recurrent networks are neural network algorithm which are used to process sequential data such as texts and videos. Different to feed-forward networks, recurrent networks have a feedback loop in their design, a key element in such ability to model sequential data. But there is a problem in the way RNNs are wired.

Due to such inherent sequential design, there is a computation bottleneck since we can't process input training examples parallelly. We have to process each word or character(or simply a token) of a sentence(or each frame in a video) at a time and as you might imagine, this becomes even harder for longer sequences.

To this point, the issue is not only unstable gradient problems that makes it hard to learn long-range dependencies, but also the fact that RNNs don't work well with modern GPUs(Graphical Processing Units. GPUs are designed for faster parallel computations and idle at sequential operations).

Transformer is a neural network architecture that can process sequential data in parallel manner. It doesn't use any recurrent or convolution layers, but rather, it is purely based on attention. We will learn more about attention in the next sections. Transformer was introduced in 2017 in the famous paper [Attention is all you need](https://arxiv.org/abs/1706.03762). Before its advent, researchers used to combine recurrent networks or convolutions with attention(examples [here](https://arxiv.org/pdf/1508.04025.pdf), [here](https://arxiv.org/pdf/1409.0473.pdf), and [here](https://arxiv.org/abs/1502.03044)). After Transformer, attention has been literally all you need not only in natural language processing but also in other modalities such as computer vision and visual language models.



<a name='2'></a>

## 2. Transformer Architecture

Transformer is made of 2 main layers which are self-attention layers and fully connected layers(or dense layers in [Keras/TensorFlow](https://keras.io/api/layers/core_layers/dense/) or linear layers in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)). The stack of multiple fully-connected layers is typically called Multi-Layer Perceptrons(MLP). So, we can simply say Transformer is made of self-attention and MLPs networks.

In addition to self-attention and fully connected layers, Transformer also contains other layers such as layer normalization, embedding layer(at the inputs and outputs), and positional encoding layer for maintaining the order of tokens. We will discuss them in details.


![image](https://drive.google.com/uc?export=view&id=1bbcftxyRuYgpBr2Tj0fgxkcPp98G2LwT)

<a name='2-1'></a>

### 2. 1 Attention

Attention is the main layer of the Transformer architecture. Attention can sound complicated but in essence, it's this mechanism that can allow the neural network to pay more attention to a part of the input data that contain the meaningful information and pay less attention to the rest of the input data. In image captioning for example, that could mean paying attention to parts of the image that contain key features or features that are useful for generating the desired caption.

![image](https://drive.google.com/uc?export=view&id=1kd8PKpZzjK2bQhlEqfkKQdCbr7kA9LsB)

Another example: In machine translation or sequence to sequence translation, attention could mean attending to the words that have associative meaning. A nice example of this is found in below image. Notice where the order of the words `European Economic Area` were translated to `zone économique européenne`.


![image](https://drive.google.com/uc?export=view&id=14UtDLC9vkFvrYg_dC9jTPXNHnGJR-_4f)

What's going on there? Do you see something? The order of the words were reversed in translated sequence because it makes sense that way. So, when translating a word, attention can give the model to not only translate the word correctly, but to also translate it in the right order and attending to other words that correrate with that particular word. In short, attention can identify and preserve the context when performing translations between different languages.

The machine translation example we used above to convey the idea of attention is not merely from Transformer, but the idea of attention itself is the same. It is to attend or to focus to the part of input that contain meaningful information and focus less on rest of the input data.

<a name='2-1-1'></a>

#### 2.1.1 Attention Inputs: Queries, Keys, and Values


Technically speaking, attention is a function used to measure the similarity between two vectors. Attention function takes 3 inputs which are `query`, `keys`, and `values`. These 3 terms might sounds complicated now, but by taking a real example, you will understand the gist of them.

Let's say that you are searching *attention is all you need paper* on [ArXiv](https://arxiv.org). The title of the paper or what you will enter in the search field is a `query`. Internally, ArXiv will find papers that matches with your query based on `keys` such titles, authors, fields, journal, abstract, etc... Now, ArXiv will find the similarity between your `query` and fixed `keys`, and display the top papers that matches with your `query`, or simply, papers that have high similarity scores. The papers in the ArXiv database that we are querying from based on some sets of keys can be reffered to as `values.` The similary scores can also be referred to as relevancy scores or attention scores.


![image](https://drive.google.com/uc?export=view&id=1KAuNB7qXCHvA1sFmMVhJ-qb6AnA1S83X)

So, let's use the above analogy to understand what our attention function is doing in Transformer. But before we do that, let's look at the function of attention.

$$
Attention(Q, K, V) = softmax(\frac {QK^T}{\sqrt {d_k}})V
$$

In the above function:
* $Q$ denotes queries matrix or in our ArXiv analogy, the paper that we want to search.
* $K$ denotes the keys matrix or sets of things(such as title, authors, journal, etc...) that describe papers in our analogy. In the attention function, $K$ vector is transposed(i.e $K^T$).
* $V$ is the value matrix, or all papers that we are querying from in our ArXiv.
* $\sqrt {d_k}$ is the scaling factor used to avoid unstable gradients problems. In our analogy, this could mean clipping off papers that are far off our query(not sure if that make sense but that's the idea).
* $Softmax()$ is a function that transforms the scaled dot-product $QK^T$ into probability distributions. Softmax output applied to the values $V$ will produce weighted values.

![image](https://drive.google.com/uc?export=view&id=18WVSE7jD87tXO0ufDVwHfwW8BnGWFmLo)

Putting things together: attention function takes queries, keys, and values matrices, computes the dot-product(or matrix multiplication) or similarity between  queries and keys, scale the dot-product or (magnitude of $QK^T$) with $\sqrt {d_k}$, puts the result into softmax function to get the weighted output, and apply it on the values matrix to get the weighted sum of the values. In brief, attention computes the similarity scores between queries and keys and weighs the scores with values to produce weighted values. Everything else is to smooth the computation.

As both query, keys, and values are matrices, computing the attention function is pretty fast and can be done in a single pass since it's merely a normal matrix multiplication(no loops or recurrence involved). And as we alluded to in the beginning, that's where extreme parallelism of Transformer comes from.


<a name='2-1-2'></a>

#### 2.1.2 Multi-Head Attention

Multi-head attention is nothing other than multiple self-attention layers that are concatenated together to provide a single output. Concatenating parallel attention layers helps the model to learn much better representations than using single attention layer.

Before query, key, and value matrices are projected into attention layers, they first pass into independent linear or dense layers. After that, they are fed to attention layers whose concatenated output is fed to a final linear/dense layer. Below image illustrate multi-head attention.


![image](https://drive.google.com/uc?export=view&id=1kbsAj4VyIakJFXioZ-PgyByVqFaT1HfW)

In the original Transformer paper, the number of parallel attention heads(h) were 8, but this is hyperparameter that you can change. You can have any number of attention heads you want depending on the task.

You would think that having parallel heads increase computation cost, but it actually doesn't because the dimension of each head is one eighth of the total dimension of all attention layers(the dimension of each head is 64 in the Transformer paper which comes from 512/8=64. 512 is the dimension of the model and encoder output, denoted as $d_{model}$ in the paper).

Multi-head attention can be compared to depth-wise separable convolutions in Convolutional Neural Networks. Depth separable convolutions is a special type of convolution that splits input tensor into multiple channels, operates on each channel independently(usually 3x3 convolutions), concatenate the outputs and feed it to a pointwise(1x1) convolution. It was first introduced in [Xception paper](https://arxiv.org/abs/1610.02357) by Francois Chollet.

That's it about self-attention and multi-head attention. One last thing to note is that in the encoder, both query, key, and value matrices are drawn from the same input sequence.

<a name='2-2'></a>

### 2.2 Input Embedding and Positional Encoding Layers

In language modelling, it is common to use learned or pre-trained word embeddings to add a little bit of semantic in the language model. The embedding layer converts the input sequence(words) into dense tokens or high-dimensional vector where words that have similar meanings tend to point in the same direction(you can visualize learned embeddings such as Word2Vec on [TensorFlow Embedding Projector](http://projector.tensorflow.org)). So, the input sentences to a Transformer don't need to be tokenized before hand.


Also, Transformer does not use any recurrent networks. Avoiding recurrent methods speeds up the computation since self-attention can operate on each token looking at all other tokens at the same time, but the problem is that the order of tokens is lost in the process.

Recurrent networks are inherently wired to process sequences preserving its order or the positions of tokens in an input sequence. So, if we want to avoid them altogether, we have to find a ways to keep order of tokens in Transfomer.

The designers of Transformer proposed adding positional encoding layer to the output of embedding layer before encoder or decoder to maintain the order of tokens and this worked pretty well. They also experimented with positional embedding and it produced the same results.

<a name='2-3'></a>

### 2.3 Residual Connections, Layer Normalization, and Dropout

Residual connections or skip connections and layer normalization are popular ingredients in neural network design. Since when Deep Residual Networks[(He et al., 2015)](https://arxiv.org/abs/1512.03385) showed that residual connections avoid vanishing gradients problems and help the model to converge faster, it's almost impossible to see a model that doesn't have some sorts of skip connections to day.

Quoting Vaswani, the author of the attention is all you need paper, "*residuals(connections) carry positional information to higher layers, among other information.*"

![image](https://drive.google.com/uc?export=view&id=1fLddMPGOMWFHb9jVLo3t52Vm8w2841s7)


In conjuction with skip connections, various normalization methods have also proved to work better since they also foster model convergence. To date, layer normalization(or layer norm in short) is one of normalization techniques that are used in novel papers. Also, as the layer norm paper noted, unlike batch normalization which have different behaviors between training and test times, layer normalization performs exactly the same computation at training and test times[(Jimmy et al., 2016)](https://arxiv.org/pdf/1607.06450.pdf). 

In the Transformer architecture, layer normalization follows multi-head attentions and fully-connected layers while residual connections connects different block of layers(see the architecture for clarity).

Lastly, Transformer uses [dropout](https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) layer to avoid overfitting. Dropout is applied right after each multi-head attention and fully-connected layer(but before layer normalization). Dropout is also then applied to the output of positional encoding and embedding layers(for both encoder and decoder).


Residual connections, layer normalization(or other normalization techniques), and dropout(and other regularization techniques) are norms in deep learning. They don't guarantee to work always, but they make a big difference.

<a name='2-4'></a>

### 2.4 Linear and Softmax Layers

After self-attention layers, linear layers are the second dominant layer in the Transformer architecture. Linear layers or dense layers perform normal linear transformations(mathematical speaking, $wx+b$). All linear layers are followed by ReLU non-linearities.

Softmax layer is used for converting the decoder output into probabilities. The softmax layer used at the output of decoder is not to be confused with sofmax that is used in dot-product attention although they all perform the same computations. The later is used for producing weighted sum of the values for all query-key pairs while the former is for decoding the output of the Transformer. The output softmax layer is actually the output activation function.

<a name='2-5'></a>

### 2.5 Encoder and Decoder

Transformer has two slices: encoder and decoder. Let's review them in brief.

Both encoder and decoder have almost same main layers which are multi-head attention and fully-connected layers. The paper refers to those layers as sub-layers since each encoder or decoder repetition is itself is a layer(as we will see later, both encoder and decoder are repeated multiple times).

Encoder is the first part of Transformer that takes the output of positional embedding and input embedding layers and forward it to stacked self-attention and fully-connected layers. Encoder is repeated 6 times in the orginal Transformer architecture, but this can be changed depending on the task. It's merely another hyperparameter to tune.


Decoder almost looks like encoder expect that there are two stacks of self-attention layers(or multi-head attention) and the first multi-head attention is masked to have same dimension as the input sequence. Decoder takes two inputs, one from encoder and the other from positional encoding and output embedding layer. Decoder is also repeeated 6 times.


![image](https://drive.google.com/uc?export=view&id=14Yvdy4Q-tz8cbU4lcm8zRLhO6cYB7XeF)

Transformer used encoder and decoder due to the fact that when performing machine translation, you have two sequences(example: english input sentence, french output sentence) that you have to process jointly, but for simple tasks like image classification and sentiment analysis, either encoder or decoder is enough but most people tend to use encoder.


<a name='3'></a>

## 3. Advantages and Disadvantages of Self-Attention

Self-attention is the main ingredient in Transformer architecture. Compared to other sequence processing methods such as recurrent networks and convolutions(1D), self-attention has a number of advantages over them. One of those advantages is handling long-term sequences or long-range dependencies. Also, as self-attention can attend to all words in a sequence, it can preserve the general context of the sequence. Below image shows how self-attention and convolution transform the sequence: while convolution can only process tokens defined by the size of filter or window size, self-attention can attend to all tokens in parallel.

![image](https://drive.google.com/uc?export=view&id=1q7lAD9h2gcN6x978UfDHgwptZ9PUiiQs)

By stacking multiple attention layers, the Transformer architecture can push the context of the sequence even further. Below image shows the visualizations of multiple attention layers operating on a sequence.

![image](https://drive.google.com/uc?export=view&id=1zGT7Bm5eXxCKF538LPph1_xrJq1j0n2a)


To visualize attention output for your own sequences, you can use [exBERT](https://huggingface.co/exbert/) tool by Hugging Face. [BertViz](https://github.com/jessevig/bertviz) also provides interactive way to visualize attention in Transformer(BertViz was introduced in this [paper](https://arxiv.org/pdf/1904.02679.pdf))



That said however, self-attention has computational bottlenecks. Although self-attention can be parallized, it has a quadratic time complexity($O(N^2D)$, $N$ is the length of sequence and $D$ is the number of dimensions of each word embedding). Such large time complexity is nothing for small sequences, but for large sequences, this can be a big issue. Fortunately, Transformer architecture was introduced for machine translation where N is typically small and that makes self-attention efficient for machine translation. Lukas Kaiser, one of the authors of Transformer paper discussed how they came to the realization that such time-complexity is not an issue in this fanstastic [talk](https://www.youtube.com/watch?v=rBCqOTEfxvg).

<a name='4'></a>

## 4. Implementations of Transformer

The purpose of this notebook is not providing step by step implementation of Transformer, but it's important to give some pointers of tools that make it easier to use or implement Transformers.

The original Transformer paper was implemented in [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor), a library for Transformer models in TensorFlow. A better way to start with Tensor2Tensor is going over this [Colab notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb#scrollTo=odi2vIMHC3Rm).

[PyTorch text(torchtext)](https://github.com/pytorch/text) also contains transformer based models. [PyTorch Lightning](https://github.com/Lightning-AI/lightning) also offers some Tranformer models.

The most popular implementations of various Transformer models is [Hugging Face 🤗 Transformers library](https://github.com/huggingface/transformers). Hugging Face Transformer library is probably all you need if you want to use Transformers. So easier to use!! As sneak peak, below is how to perform sentiment analysis with Transformer library...

```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("There has never been something so easier to use than Hugging Face Transformers library in open-source community")
```
```
[{'label': 'POSITIVE', 'score': 0.7931336164474487}]
````


Also, [KerasNLP](https://keras.io/api/keras_nlp/) has cool building blocks for implementing Transformers. Below is an example of building a Transformer Encoder in just less than 10 lines of codes...

```python
import keras_nlp
from tensorflow import keras

inputs = keras.Input((input_seq_len,))
emb = keras.layers.Embedding()(inputs)
pos_enc = keras_nlp.layers.SinePositionEncoding()
x = emb + pos_enc
outputs = keras_nlp.layers.TransformerEncoder(64,8)(x)
transformer_encoder = keras.Model(inputs, outputs)
```



Finally, all popular deep learning frameworks started to incorporate Transformer building blocks as layers that you can use easily. Check [Transformer layers in PyTorch](https://pytorch.org/docs/stable/nn.html#transformer-layers) and [Attention layers in Keras/TensorFlow](https://keras.io/api/layers/attention_layers/).

<a name='5'></a>

## 5. Evolution of Large Language Transformer Models(LLM)

Since the introduction of Transformer architecture, language models got bigger and bigger. The first Transformer model(one we have been learning) had 213 million parameters. A few years later, the number of parameters went up to billions. And what's amusing about those models is that they are all basically using the same Transformer model(with minimal modifications in some models).

The common theme about almost all language models(LM) is that they utilize unsupervised pre-training. They are trained on large unlabelled texts data from the internet and they can then be finetuned or prompted(with few samples or no samples at all) on NLP downstream tasks such as text classification, question answering, summarization, etc...

One of the earliest large language model(LLM) based on Transformer is BERT which stands for [Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805v2). A large BERT had up to 340M parameters. There are many versions of BERT such as [RoBERTa](https://arxiv.org/abs/1907.11692v1), [DistilBERT](https://arxiv.org/abs/1910.01108v4), among others.

[GPT or Generative Pretrained Transformer](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) model series pushed the depth of language models further. [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) and [GPT-3](https://arxiv.org/pdf/2005.14165v4.pdf) for example had 1.5 and 175 billion parameters respectively. There are many more bigger language models such as [Megatron-LM](https://arxiv.org/pdf/1909.08053.pdf) which had 8.3 billion parameters.

Recent large language models are using few-shot learning and zero-shot learning. Few-shot learning refers to using or prompting a model with just few samples and without fine-tuning(or changing gradients) during the inference. If a model needs one demonstration or sample before it can make predictions, it is called one-shot learning. Zero-shot refers to using a model to make predictions from the description of the task. Zero-shot model does not need any sample before it can work. It just needs the description of the task and it can start translating languages or generate texts right away(example of task description: complete the missing information in a sentence).

Example of recent models most of which are few-shot and zero-shot learners are [GPT-3](https://arxiv.org/pdf/2005.14165v4.pdf), Google [PaLM](https://arxiv.org/abs/2204.02311) which has 540-billion parameters, [Gopher](https://arxiv.org/pdf/2112.11446.pdf) which has 280 billion parameters, and DeepMind [Chinchilla](https://arxiv.org/pdf/2203.15556.pdf).


<a name='6'></a>

## 6. Transformers Beyond NLP

Transformer was introduced in Natural Language Processing(NLP) domain. More precisely, it was first introduced for neural machine translation, a task that involves translating one language to another. Since Transformer showed excellent performance in pretty much most NLP tasks, people basically started to throw it into other modalities. Below, we discuss in brief the emergence of transformers in computer vision and other modalities.

Let's start with computer vision. The first thing we should ask ourselves is why care about using Transformer in computer vision when we have ConvNets? Well, ConvNets works great but they introduce the [spatial inductive biases](https://samiraabnar.github.io/articles/2020-05/indist), the assumption that nearby pixels(in a local window) are correlated and important than far pixels.

The same authors that invented Transformer architecture tried to use it for image generation task as a subsitute for Generative Adversarial Networks(GANs). The architecture was dubbed Image Transformer and it worked exactly like standard Language Transformer except that it was operating on a sequence of pixels. So, given a sequence of input pixels, Image Transformer would predict the next pixel and build up the whole image. But this did not scale well since it's computationally expensive to apply self-attention to all pixels(recall that the time complexity of self-attention is $O(N^2D)$ where $N$ is the length of input pixels sequence and $D$ is the channel dimension which is 3 for color images). Thus, Image Transformer was applied on small size datasets(like Cifar10 that has 32x32 images). But at the time(2018), this was state-of-the-art in the image generation.

To overcome the computational bottleneck of applying self-attention to images, rather than attenting to all pixels in image, we can instead divide the images into a sequence of small parts(typically called patches) and throw a Transformer encoder or decoder into those patches. There may have been a number of works that attempted to employ this, but a popular and most important one to date is ViT which was introduced in [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929v2). ViT is essentially a Transformer encoder operating on a sequence of image patches.

![image](https://drive.google.com/uc?export=view&id=1uW5J4yvcpTeqs00mtZN5oZMwT0dkk_bB)


The downsides of Transformer in computer vision is that it requires lot of training images(away more than public images such as ImageNet). ConvNets works great on small datasets due to the inductive biases that we alluded to previously. To overcome the extreme need of training data, people tried to combine convolutions in the early layers of the model and self-attention in between or later. A remarkable example of ConvNets-Transformer topology is [DETR: End-to-End Object Detection With Transformers](https://alcinos.github.io/detr_page/) that uses ResNet backbone network and full Transformer(both encoder-decoder) to perform object detection task. [Augmenting convolutional networks with attention-based aggregation](https://arxiv.org/abs/2112.13692) also uses convolution in the early layers and uses attention as a substitute of normal global pooling layer. Other remarkable papers that combined convolutions and self-attention are [Convolutional vision Transformer(CvT)](https://arxiv.org/pdf/2103.15808v1.pdf) and Mobile-Friendly Vision Transformer[(MobileViT)](https://arxiv.org/pdf/2110.02178v2.pdf).

Transformer is ofcourse used in other computer tasks such as [semantic segmentation](https://arxiv.org/abs/2105.05633v3), [human pose estimation](https://arxiv.org/abs/2204.12484v2), and other [dense prediction tasks](https://arxiv.org/abs/2205.08534v2).

Finally, Transformers are used in visual language tasks or multi-model tasks such as image captioning and visual question answering. They are used in  reinforcement learning. A recent example of this is [Gato, a generalist agent](https://www.deepmind.com/publications/a-generalist-agent) that can solve over 600 tasks using a single Transformer model.

<a name='7'></a>

## 7. Final Notes

In this notebook, we have been learning about Transformer, one of the most important architectures and greatest inventions in history of deep learning.

The main element of Transformer architecture is self-attention whose mere idea is to attent to all parts of the input sequence. The one downside of Transformer is the quadratic complexity of self-attention.

There have been modifications of Transformer architecture but they almost looks like the same. You can see various Transformer based models [here](https://paperswithcode.com/methods/category/transformers).

<a name='8'></a>

## 8. References and Further Learning

Transformer was invented in 2017, but it's still new and by far, it is one of the most difficult topics to head around. Below I puts some videos and articles which might be useful. Note that you don't have to go over all of them. Pick few that suites your taste!

### Videos

* [Attention, Lecture 13, Justin Johnson](https://www.youtube.com/watch?v=YAgjfMR9R_M&t=1888s)

* [Deep Learning for Natural Language Processing, DeepMind x UCL](https://www.youtube.com/watch?v=8zAP2qWAsKg&t=3310s)

* [Transformers and Self-Attention - Stanford CS224N Lecture 14](https://www.youtube.com/watch?v=5vcj8kSwBCY&t=2255s)

* [Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser](https://www.youtube.com/watch?v=rBCqOTEfxvg&t=2224s)

* [LSTM is dead. Long Live Transformers!](https://www.youtube.com/watch?v=S27pHKBEp30&t=1094s)


### Articles

* [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html)
* [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
* [Transformers from Scratch](http://peterbloem.nl/blog/transformers)
* [Attention? Attention!](https://lilianweng.github.io/posts/2018-06-24-attention/#full-architecture)

### [BACK TO TOP](#0)