➡️ Encoder models

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

Pretraining method

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Tasks & examples

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

Representatives of this family of models include:

ALBERT
BERT
DistilBERT
ELECTRA
RoBERTa

⬅️ Decoder models

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

Pretraining method

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

Tasks & examples

These models are best suited for tasks involving text generation.

Representatives of this family of models include:

CTRL
GPT
GPT-2
Transformer XL

↔️ Sequence-to-sequence models

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

Pretraining method

The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.

Tasks & examples

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

Representatives of this family of models include:

BART
mBART
Marian
T5

Summury

The following table summarizes what we've seen above:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.Encoder, Decoders, Encoder-Decoder models.md

3.Encoder, Decoders, Encoder-Decoder models.md

➡️ Encoder models

Pretraining method

Tasks & examples

⬅️ Decoder models

Pretraining method

Tasks & examples

↔️ Sequence-to-sequence models

Pretraining method

Tasks & examples

Summury

Files

3.Encoder, Decoders, Encoder-Decoder models.md

Latest commit

History

3.Encoder, Decoders, Encoder-Decoder models.md

File metadata and controls

➡️ Encoder models

Pretraining method

Tasks & examples

⬅️ Decoder models

Pretraining method

Tasks & examples

↔️ Sequence-to-sequence models

Pretraining method

Tasks & examples

Summury