How do Transformers work?

In this section, we will take a high-level look at the architecture of Transformer models.

🗣 A bit of Transformer history

Some reference points in the (short) history of Transformer models:

Broadly, they can be grouped into three categories:

GPT-like (also called auto-regressive Transformer models)
BERT-like (also called auto-encoding Transformer models)
BART/T5-like (also called sequence-to-sequence Transformer models)

🈵 Transformers are language models

Transformer models have been trained on large amounts of raw text in a self-supervised fashion.
Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model(no human-annotated labels).

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks.

Examples of a tasks:

Causal Language Modeling: predicting the next word in a sentence having read the n previous words.
Masked Language Modeling: predict a masked word in a sentence.

🔝 Transformers are big models

Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

Unfortunately, training a model, especially a large one, requires a large amount of data and costs in terms of time and compute resources. It even translates to environmental impact:

➡️ Transfert Learning

Pretraining: act of training a model from scratch on very large amounts of data. The weights are randomly initialized, and the training starts without any prior knowledge. Therefore, it requires a very large corpus of data, and training can take up to several weeks.
Fine-tuning: additional training with a dataset specific to your task using a pretrained language model. The fine-tuning process is able to take advantage of knowledge acquired by the initial model during pretraining (initial model knowledge is “transfered”, hence the term transfert learning).

Advantages of transfert learning:

achieves better results than training from scratch (unless you have lots of data),
requires way less data to get decent results,
the amount of time and resources (data, hardware, money etc.) needed to get good results are much lower,
lower environmental impact.

🧩 General architecture

In this section, we’ll go over the general architecture of the Transformer model and then detail each of the components in next sections.

🕸 Introduction 🕸

The model is primarily composed of two blocks:

Encoder: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
Decoder: The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Each of these parts can be used independently, depending on the task:

Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and NER (named entity recognition).
Decoder-only models: Good for generative tasks such as text generation.
Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

👁 Attention layers 👁

A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”! For now, all you need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

Indeed a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

🧠 The original architecture 🧠

The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language.

In the encoder, the attention layers can use all the words in a sentence.
The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated).

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

🆚 Architectures vs. checkpoints 🆚

The terms architecture, checkpoints and models have slightly different meanings:

Architecture: the skeleton of the model — the definition of each layer and each operation that happens within the model.
Checkpoints: the weights that will be loaded in a given architecture.
Model: this is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity. For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.How do Transformers work_.md

2.How do Transformers work_.md

How do Transformers work?

🗣 A bit of Transformer history

🈵 Transformers are language models

Examples of a tasks:

🔝 Transformers are big models

➡️ Transfert Learning

Advantages of transfert learning:

🧩 General architecture

🕸 Introduction 🕸

👁 Attention layers 👁

🧠 The original architecture 🧠

🆚 Architectures vs. checkpoints 🆚

Files

2.How do Transformers work_.md

Latest commit

History

2.How do Transformers work_.md

File metadata and controls

How do Transformers work?

🗣 A bit of Transformer history

🈵 Transformers are language models

Examples of a tasks:

🔝 Transformers are big models

➡️ Transfert Learning

Advantages of transfert learning:

🧩 General architecture

🕸 Introduction 🕸

👁 Attention layers 👁

🧠 The original architecture 🧠

🆚 Architectures vs. checkpoints 🆚