# Transformers and Hugging Face

In this tutorial, we will learn about Transformers and how to use them using Hugging Face.

We will focus mainly on natural language processing tasks. We will use the Transformers library from Hugging Face, which provides a simple and efficient way to use pre-trained models for various NLP tasks.

## Installation

To install the Transformers library, run the following command:

```bash
pip install transformers
```

or install it from the requirements.txt file:

```bash
pip install -r requirements.txt
```

## Transformers

In this section, we will look at what Transformer models can do and use our first tool from the 🤗 Transformers library: the `pipeline()` function.

## Pipelines

The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

We can even pass several sentences! The pipeline will return a list of dictionaries, one for each sentence:

In [None]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
])

By default, this pipeline selects a particular pretrained model that has been fine-tuned for **sentiment analysis** in **English**. The model is downloaded and cached when you create the ``classifier`` object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:

1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.

Some of the currently available pipelines are:

- ``feature-extraction`` (**get the vector representation of a text**): This is useful when you need fixed-dimensional feature vectors as inputs to other models.
- ``fill-mask``: This will fill in the masked part of the sentence. The model needs to have a masked language model head.
- ``ner`` (**named entity recognition**): This will recognize the entities in the text (like names of people, organizations, locations, etc.)
- ``question-answering``: This will extract the answer to a question from the provided text.
- ``sentiment-analysis``: This will analyze the sentiment of a text.
- ``summarization``: This will generate a summary of a long text.
- ``text-generation``: This will generate a text based on a prompt.
- ``translation``: This will translate a text into another language.
- ``zero-shot-classification``: This allows you to specify which labels to use for the classification, without having to fine-tune the model on your data.

Let’s have a look at a few of these!

## Zero-shot classification

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the ``zero-shot-classification`` pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

This pipeline is called ``zero-shot`` because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!


## Text Generation

Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t always get the same results.


In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

You can control how many different sequences are generated with the argument ``num_return_sequences`` and the total length of the output text with the argument ``max_length``.


In [None]:
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

## Using any model from the Hub in a pipeline

The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the Model Hub and click on the corresponding tag on the left to display only the supported models for that task.

Let’s try the ``distilgpt2`` model! Here’s how to load it in the same pipeline as before:


In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains checkpoints for multilingual models that support several languages.

Once you select a model by clicking on it, you’ll see that there is a widget enabling you to try it directly online. This way you can quickly test the model’s capabilities before downloading it.


## Mask filling

The next pipeline you’ll try is fill-mask. The idea of this task is to fill in the blanks in a given text:

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

The ``top_k`` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special ``<mask>`` word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.


## Named Entity Recognition (NER)

Named entity recognition (``NER``) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let’s look at an example:


In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Here the model correctly identified that ``Sylvain`` is a person (**PER**), ``Hugging Face`` an organization (**ORG**), and ``Brooklyn`` a location (**LOC**).

We pass the option ``grouped_entities=True`` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words.


## Exercises:

### Create a pipeline for `question-answering`:

In [1]:
# code here

### Create a pipeline for `summarization`:

In [2]:
# code here

### Create a pipeline for `translation`:

In [3]:
# code here

## How do Transformers work?

### A bit of Transformer history

Here are some reference points in the (short) history of Transformer models:

<img src="img/transformers.png">

The Transformer architecture was introduced in **June 2017**. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:

- **June 2018**: ``GPT``, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results.
- **October 2018**: ``BERT``, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!).
- **February 2019**: ``GPT-2``, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns.
- **October 2019**: ``DistilBERT``, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance.
- **October 2019**: ``BART`` and ``T5``, two large pretrained models using the same architecture as the original Transformer model (the first to do so).
- **May 2020**: ``GPT-3``, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called **zero-shot learning**).

This list is far from comprehensive and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:

1. **GPT-like** (also called auto-regressive Transformer models)
2. **BERT-like** (also called auto-encoding Transformer models)
3. **BART/T5-like** (also called sequence-to-sequence Transformer models)




## Transformers are language models 

All the Transformer models mentioned above (``GPT``, ``BERT``, ``BART``, ``T5``, etc.) have been trained as **language models**. This means they have been trained on large amounts of raw text in a **self-supervised** fashion. **Self-supervised learning** is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a statistical understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called **transfer learning**. During this process, the model is fine-tuned in a **supervised** way — that is, using human-annotated labels — on a given task.

An example of a task is predicting the next word in a sentence having read the ``n`` previous words. This is called **causal language modeling** because the output depends on the past and present inputs, but not the future ones.

<img src="img/next-word.png">

Another example is **masked language modeling**, in which the model predicts a masked word in the sentence.

<img src="img/masked.png">




## Transformers are big models

Apart from a few outliers (like `DistilBERT`), the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.

<img src="img/llms.png">

Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. It even translates to environmental impact, as can be seen in the following graph.

<img src="img/co2.png">



## Transfer Learning

**Pretraining** is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.

<img src="img/llm-training.png">

This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.

**Fine-tuning**, on the other hand, is the training done after a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait — why not simply train the model for your final use case from scratch? There are a couple of reasons:

1. The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).

2. Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.

3. For the same reason, the amount of time and resources needed to get good results are much lower.

For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an ``arXiv`` corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is “transferred,” hence the term **transfer learning**.

<img src="img/fine-tuning.png">

Fine-tuning a model therefore has lower **time**, **data**, **financial**, and **environmental** costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a **pretrained model** — one as close as possible to the task you have at hand — and fine-tune it.





## General Architecture of a Transformer model

The model is primarily composed of two blocks:

- **Encoder** (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.

- **Decoder** (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

<img src="img/general-architecture.png">

Each of these parts can be used independently, depending on the task:

- **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.

- **Decoder-only models**: Good for generative tasks such as text generation.

- **Encoder-decoder models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization.





## Attention layers

A key feature of **Transformer models** is that they are built with special layers called **attention layers**. In fact, the title of the paper introducing the Transformer architecture was *“Attention Is All You Need”*! In a nutshell, attention layers will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

To put this into context, consider the task of translating text from English to French. Given the input *“You like this course”*, a translation model will need to also attend to the adjacent word *“You”* to get the proper translation for the word *“like”*, because in French the verb *“like”* is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating *“this”*, the model will also need to pay attention to the word *“course”*, because *“this”* translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of *“course”*. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

The same concept applies to any task associated with **natural language**: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

Now that you have an idea of what **attention layers** are all about, let’s take a closer look at the **Transformer architecture**.


## The original architecture

The **Transformer architecture** was originally designed for translation. During training, the **encoder** receives inputs (sentences) in a certain language, while the **decoder** receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can depend on what comes after as well as what comes before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (i.e., only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder, which then uses all the inputs from the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words. If it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard! For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

The original **Transformer architecture** looked like this, with the encoder on the left and the decoder on the right:

<img src="img/original_transformer.png">

Note that the **first attention layer** in a decoder block pays attention to all (past) inputs to the decoder, while the **second attention layer** uses the output of the encoder. This allows the decoder to access the entire input sentence to best predict the current word. This is especially useful because different languages often have grammatical rules that rearrange word order, or context provided later in a sentence may influence the translation of a particular word.

An **attention mask** can also be used in the encoder/decoder to prevent the model from focusing on certain special tokens — for example, the **padding tokens** that are added to standardize the length of inputs when batching sentences together.


## Encoder Models

### Encoder Models

Encoder models use only the **encoder** of a Transformer model. At each stage, the attention layers can access **all the words** in the initial sentence, making them suitable for tasks that require an understanding of the full input. These models are often characterized as having **bi-directional** attention, meaning they process text from both directions (left-to-right and right-to-left). This is why they are also called **auto-encoding models**.

The **pretraining** of these models typically involves **corrupting** a given sentence (e.g., by masking random words) and then tasking the model with **reconstructing** the original sentence. This self-supervised approach helps the model acquire a broad understanding of language.

### Best-Suited Tasks:
- Sentence classification
- Named entity recognition (NER)
- Word classification
- Extractive question answering

### Representative Models:
- **ALBERT**
- **BERT**
- **DistilBERT**
- **ELECTRA**
- **RoBERTa**

<img src="img/bert.png">


## Decoder Models

### Decoder Models

Decoder models use only the **decoder** of a Transformer model. At each stage, for a given word, the attention layers can only access the words positioned **before** it in the sentence. This makes them suitable for tasks where predicting the next word is necessary. These models are often called **auto-regressive models**.

The **pretraining** of decoder models typically involves **predicting the next word** in a sentence, which helps the model learn language generation.

### Best-Suited Tasks:
- Text generation
- Language modeling

### Representative Models:
- **CTRL**
- **GPT**
- **GPT-2**
- **Transformer XL**

<img src="img/gpt_arch.png">


## Sequence-to-sequence models (Encoder-Decoder models)

### Encoder-Decoder Models (Sequence-to-Sequence Models)

Encoder-decoder models, also known as **sequence-to-sequence models**, use both parts of the Transformer architecture. 

- **Encoder**: At each stage, the attention layers can access all the words in the initial sentence.
- **Decoder**: The attention layers can only access the words positioned before a given word in the input.

The **pretraining** of these models can involve the objectives of either encoder or decoder models, but it often includes more complex objectives. For instance, **T5** is pretrained by replacing random spans of text (which can contain several words) with a single **mask** word, and the model's objective is to predict the text that the mask replaces.

### Best-Suited Tasks:
- Summarization
- Translation
- Generative question answering

### Representative Models:
- **BART**
- **mBART**
- **Marian**
- **T5**

<img src="img/all_arch.png">


## Bias and Limitations

### Limitations of Pretrained Models in Production

While pretrained models or their fine-tuned versions can be powerful tools for various tasks, it's important to be aware of their limitations, especially when using them in production.

The **biggest limitation** arises from the nature of pretraining. To enable pretraining on large datasets, researchers often scrape vast amounts of data from the internet. This means that the model might be exposed to **both high-quality and low-quality content**. As a result, some models may unintentionally learn biases, inaccuracies, or undesirable patterns from the internet's vast and unfiltered content.


#### Example: Fill-Mask Pipeline with BERT

Consider a **fill-mask** pipeline using BERT. When tasked with filling in a masked word, the model might sometimes produce results based on content it learned during pretraining, which could be biased or inappropriate, especially if the dataset includes noisy or problematic sources.

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

### Gender Bias in Pretrained Models

When tasked with filling in missing words, the BERT model (trained on neutral datasets like English Wikipedia and BookCorpus) might still produce biased results. For instance, when asked to fill in the missing word in the sentence "The woman went to work as a [MASK]," the model might generate terms associated with a specific gender, such as "waitress" or even "prostitute," which are occupations often linked to one gender. 

In some cases, it may offer a more gender-neutral response like "waiter/waitress," but this doesn't solve the deeper issue. Despite BERT being trained on seemingly neutral datasets, it still inherits biases from the data it was trained on. 

This illustrates that pretrained models can inadvertently generate **sexist, racist, or homophobic content**, even when their training data is carefully selected. Fine-tuning the model on your specific data does not necessarily eliminate these intrinsic biases.

### Takeaway
When using pretrained models, always be aware that the model might generate biased or harmful content, and ensure to evaluate and mitigate these risks before deploying them in production.


## Summary

### Summary of Transformer Models and Their Use Cases

We explored the following topics:

- **Using the `pipeline()` function** from 🤗 Transformers for various NLP tasks.
- **Searching for and using models** from the Hugging Face Model Hub.

We also discussed the core architecture of Transformer models, highlighting the importance of **transfer learning** and **fine-tuning**. Depending on the task you need to solve, you can use either the full architecture or only the encoder or decoder. Here's a summary of these different approaches:

| **Model Type**              | **Description**                                                                                       | **Typical Use Cases**                                                                   |
|-----------------------------|-------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------| 
| **Encoder-only models**     | Use only the encoder of the Transformer model. Understands input sentences.                           | Sentence classification, named entity recognition (NER), extractive question answering. | 
| **Decoder-only models**     | Use only the decoder. Focus on generating text from prior context.                                    | Text generation, autoregressive tasks (e.g., GPT, GPT-2).                               | 
| **Encoder-decoder models**  | Use both encoder and decoder. Suitable for tasks that involve both understanding and generating text. | Translation, summarization, generative question answering (e.g., T5, BART).             |

### Key Takeaways:
- **Transfer learning**: Pretraining on a large dataset and then fine-tuning on a task-specific dataset allows models to learn more efficiently.
- **Encoder vs. Decoder**: Choose based on whether the task involves understanding input (encoder) or generating output (decoder).
- **Fine-tuning**: Adapt pretrained models to your specific needs with relatively small amounts of task-specific data.

By leveraging these models effectively, you can solve a wide range of NLP problems using state-of-the-art performance.


# Quiz

### 1. Explore the Hub and look for the roberta-large-mnli checkpoint. What task does it perform?

<ol type="a">
  <li>Summarization</li>
  <li>Text Classification</li>
  <li>Text Generation</li>
</ol>

### 2. What will the following code return?

```python
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
```

<ol type="a">
  <li>It will return classification scores for this sentence, with labels "positive" or "negative".</li>
  <li> It will return a generated text completing this sentence.</li>
  <li>It will return the words representing persons, organizations or locations.</li>
</ol>



### 3. What should replace "…" in this code sample?

```python
from transformers import pipeline

filler = pipeline("fill-mask", model="bert-base-cased")
result = filler("...")
```

<ol type="a">
  <li> This <mask> has been waiting for you.</li>
  <li>This [MASK] has been waiting for you.</li>
  <li>This man has been waiting for you.</li>
</ol>

### 4. Why will this code fail?

```python
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
result = classifier("This is a course about the Transformers library")
```

<ol type="a">
  <li>This pipeline requires that labels be given to classify this text.</li>
  <li>This pipeline requires several sentences, not just one.</li>
  <li>The 🤗 Transformers library is broken, as usual.</li>
  <li> This pipeline requires longer inputs; this one is too short.</li>
</ol>

### 5. What does “transfer learning” mean?

<ol type="a">
  <li>Transferring the knowledge of a pretrained model to a new model by training it on the same dataset.</li>
  <li>Transferring the knowledge of a pretrained model to a new model by initializing the second model with the first model's weights.</li>
  <li>Transferring the knowledge of a pretrained model to a new model by building the second model with the same architecture as the first model.</li>
</ol>


### 6. True or false? A language model usually does not need labels for its pretraining.

<ol type="a">
  <li>True</li>
  <li>False</li>
</ol>


### 7. Which of these types of models would you use for completing prompts with generated text?

<ol type="a">
  <li>An encoder model</li>
  <li>A decoder model</li>
  <li>a sequence-to-sequence model</li>
</ol>

### 8.  Which of those types of models would you use for summarizing texts?

<ol type="a">
  <li>An encoder model</li>
  <li>A decoder model</li>
  <li>a sequence-to-sequence model</li>
</ol>

### 9.  Which of these types of models would you use for classifying text inputs according to certain labels?

<ol type="a">
  <li>An encoder model</li>
  <li>A decoder model</li>
  <li>a sequence-to-sequence model</li>
</ol>

### 10. What possible source can the bias observed in a model have?

<ol type="a">
  <li>The model is a fine-tuned version of a pretrained model and it picked up its bias from it.</li>
  <li>The data the model was trained on is biased.</li>
  <li>The metric the model was optimizing for is biased.</li>
</ol>