# Using 🤗 Transformers

## Introduction

### Overview of the 🤗 Transformers Library

As you saw in [Part 1](transformers_and_hugging_face_part1.ipynb), Transformer models are typically very large, with millions to tens of billions of parameters. Training and deploying these models is a complex process. Additionally, new models are released almost daily, each with its own implementation, making it challenging to experiment with all of them.

The 🤗 Transformers library was created to address these challenges by providing a unified API to load, train, and save any Transformer model. Its main features include:

- **Ease of use**: Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code.
- **Flexibility**: At their core, all models are simple `PyTorch nn.Module` or `TensorFlow tf.keras.Model` classes and can be treated like any other models in their respective machine learning frameworks.
- **Simplicity**: The library minimizes abstractions. Each model’s forward pass is fully defined in a single file, following the “All in one file” principle. This makes the code understandable and hackable.

This last feature distinguishes 🤗 Transformers from other ML libraries. Instead of relying on shared modules across files, each model has its own dedicated layers. This approach not only makes the models more approachable and easier to understand but also allows experimentation with one model without impacting others.

---

### Highlights

In this part, we will:

1. Begin with an **end-to-end example** that combines a model and a tokenizer to replicate the `pipeline()` function introduced in Part 1.
2. Explore the **model API**, including:
   - The `model` and `configuration` classes.
   - How to load a model and process numerical inputs to generate predictions.
3. Dive into the **tokenizer API**, covering:
   - Text-to-numerical conversion for neural networks.
   - Numerical-to-text conversion for output interpretation.
4. Learn to process **multiple sentences in a prepared batch** efficiently.
5. Wrap up with a closer look at the high-level `tokenizer()` function.

By the end of this part, you’ll have a deeper understanding of how to leverage the 🤗 Transformers library for various NLP tasks.


### Behind the pipeline

Let’s start with a complete example, taking a look at what happened behind the scenes when we executed the following code in Part 1:

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

and obtained:

In [1]:
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

As we saw in Part 1, this pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

<img src="img/pipeline.png">

Let’s break down the steps in the pipeline:

### Preprocessing with a tokenizer

Like other neural networks, Transformer models cannot process raw text directly. Therefore, the first step in any pipeline is to convert text inputs into numerical representations that the model can understand. This is done using a **tokenizer**, which handles the following tasks:

1. **Splitting the input** into tokens, which could be words, subwords, or symbols (such as punctuation).
2. **Mapping each token** to a unique integer.
3. **Adding additional inputs** that the model may require (e.g., special tokens or padding).

To ensure compatibility, this preprocessing must match exactly how the model was pretrained. This requires downloading the tokenizer configuration associated with the model. The `AutoTokenizer` class in 🤗 Transformers simplifies this process through its `from_pretrained()` method. 

Using the model’s checkpoint name, `from_pretrained()` fetches the necessary tokenizer information from the Model Hub and caches it locally, ensuring that the data is downloaded only on the first run.

For example, the default checkpoint for the sentiment-analysis pipeline is **`distilbert-base-uncased-finetuned-sst-2-english`** (you can view its [model card](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)). To use it, run the following code:




In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can pass our sentences directly to it, and it will return a dictionary ready to be fed into our model! However, Transformer models only accept **tensors** as input, so the final step is to convert the list of input IDs into tensors.

🤗 Transformers is framework-agnostic, meaning you don’t need to worry about the backend—whether it’s PyTorch, TensorFlow, or Flax. However, since tensors are the expected input for these models, here’s a quick explanation: 

Tensors are similar to NumPy arrays. A tensor can be:
- A **scalar** (0D),
- A **vector** (1D),
- A **matrix** (2D), or 
- Have higher dimensions.

ML frameworks' tensors behave similarly to NumPy arrays and are just as simple to create. 

To specify the type of tensors to return (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument. For example:


In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result).

Here’s what the results look like as PyTorch tensors:

``` python
{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}
```

The output from the tokenizer is a dictionary containing two keys: `input_ids` and `attention_mask`.

- **`input_ids`**: This contains rows of integers, where each row represents a sentence. The integers are unique identifiers for the tokens in the corresponding sentence.
- **`attention_mask`**: We'll explain the purpose of this key later.


## Going through the model

We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In this code snippet, we downloaded the same checkpoint we used in our pipeline earlier (it should already be cached) and instantiated a model with it.

This architecture includes only the base Transformer module. Given some inputs, it outputs what we call **hidden states** (also referred to as **features**). For each model input, the Transformer model generates a high-dimensional vector that represents a contextual understanding of that input.

Although these hidden states can be valuable on their own, they are typically passed as inputs to another part of the model called the **head**. In Part 1, we saw that different tasks could use the same Transformer architecture, but each task has its own specialized head associated with it.


#### A high-dimensional vector?

The vector output by the Transformer module is typically large and has three dimensions:

- **Batch size**: The number of sequences processed simultaneously (e.g., 2 in our example).
- **Sequence length**: The length of the numerical representation of the sequence (e.g., 16 in our example).
- **Hidden size**: The dimensionality of the vector for each model input. 

The term "high dimensional" refers to the **hidden size**, which can be quite large. For smaller models, this is often 768, while larger models may reach 3072 or more.

We can observe this if we feed the preprocessed inputs to our model:


In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])

### Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

<img src="img/head.svg">

The output of the Transformer model is sent directly to the **model head** for further processing.

#### Model Structure
In the diagram, the model consists of:
- **Embeddings layer**: Converts each input ID from the tokenized input into a vector representing the associated token.
- **Subsequent layers**: Process these vectors using the attention mechanism to generate the final representation of the sentences.

#### Model Architectures
There are various architectures available in 🤗 Transformers, each tailored for specific tasks. Some examples include:

- `AutoModel` (retrieve the hidden states)
- `AutoModelForCausalLM`
- `AutoModelForMaskedLM`
- `AutoModelForMultipleChoice`
- `AutoModelForQuestionAnswering`
- `AutoModelForSequenceClassification`
- `AutoModelForTokenClassification`
- And more 🤗

#### Selecting the Right Model
For our example, we require a model with a **sequence classification head** to classify sentences as positive or negative. Therefore, instead of using the `AutoModel` class, we’ll use `AutoModelForSequenceClassification`:


In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])

Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

### Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

In [None]:
print(outputs.logits)

``` python
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)
```

Our model predicted the following logits:

- For the first sentence: `[-1.5607, 1.6123]`
- For the second sentence: `[4.1692, -3.3464]`

These are raw, unnormalized scores outputted by the last layer of the model. To convert logits into probabilities, they must pass through a **SoftMax** layer. 

All 🤗 Transformers models output logits because the loss function used during training often combines the final activation function (e.g., SoftMax) with the actual loss function (e.g., cross entropy). This approach allows for better numerical stability and flexibility during training.


In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

``` python
tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)
```

Now we can see that the model predicted:

- `[0.0402, 0.9598]` for the first sentence
- `[0.9995, 0.0005]` for the second sentence

These are recognizable probability scores.

To identify the labels corresponding to each position, we can inspect the `id2label` attribute of the model's configuration (we’ll discuss this in more detail in the next section).


In [None]:
print(model.config.id2label)

``` python
{0: 'NEGATIVE', 1: 'POSITIVE'}
```

Now we can conclude that the model predicted the following:

- **First sentence**: NEGATIVE: 0.0402, POSITIVE: 0.9598  
- **Second sentence**: NEGATIVE: 0.9995, POSITIVE: 0.0005  

We have successfully reproduced the three steps of the pipeline: 

1. **Preprocessing** with tokenizers  
2. **Passing the inputs through the model**  
3. **Postprocessing**  



## Models

In this section we’ll take a closer look at creating and using a model. We’ll use the `AutoModel` class, which is handy when you want to instantiate any model from a checkpoint.

The `AutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let’s take a look at how this works with a BERT model.


### Creating a Transformer model

The first thing we’ll need to do to initialize a `BERT` model is load a configuration object:

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

The configuration contains many attributes that are used to build the model:

In [None]:
print(config)

``` python
BertConfig {
  [...]
  "hidden_size": 768,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  [...]
}
```

While you haven’t seen what all of these attributes do yet, you should recognize some of them: the `hidden_size` attribute defines the size of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers the Transformer model has.


#### Different loading methods

Creating a model from the default configuration initializes it with random values:

In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but as you saw in Part 1, this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it’s imperative to be able to share and reuse models that have already been trained.

Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained()` method:

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

As you saw earlier, we could replace `BertModel` with the equivalent `AutoModel` class. We’ll do this from now on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task).

In the code sample above we didn’t use `BertConfig`, and instead loaded a pretrained model via the `bert-base-cased` identifier. This is a model checkpoint that was trained by the authors of BERT themselves; you can find more details about it in its model card.

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the `from_pretrained()` method won’t re-download them) in the cache folder, which defaults to `~/.cache/huggingface/transformers`. You can customize your cache folder by setting the `HF_HOME` environment variable.

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture. The entire list of available BERT checkpoints can be found [here](https://huggingface.co/models?other=bert).


#### Saving methods

Saving a model is as easy as loading one — we use the `save_pretrained()` method, which is analogous to the `from_pretrained()` method:

In [None]:
model.save_pretrained("directory_on_my_computer")

This saves two files to your disk:

- `pytorch_model.bin`: The weights of the model.
- `config.json`: The configuration of the model.

If you take a look at the `config.json` file, you’ll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.

The `pytorch_model.bin` file is known as the state dictionary; it contains all your model’s weights. The two files go hand in hand: the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.


### Using a Transformer model for inference

Now that you know how to load and save a model, let’s try using it to make some predictions. Transformer models can only process numbers — numbers that the tokenizer generates. But before we discuss tokenizers, let’s explore what inputs the model accepts.

Tokenizers can take care of casting the inputs to the appropriate framework’s tensors, but to help you understand what’s going on, we’ll take a quick look at what must be done before sending the inputs to the model.

Let’s say we have a couple of sequences:

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]

The tokenizer converts these to vocabulary indices which are typically called input IDs. Each sequence is now a list of numbers! The resulting output is:

In [4]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices).

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

#### Using the tensors as inputs to the model

Making use of the tensors with the model is extremely simple — we just call the model with the inputs:

In [None]:
output = model(model_inputs)

While the model accepts a lot of different arguments, only the input IDs are necessary. We’ll explain what the other arguments do and when they are required later, but first we need to take a closer look at the tokenizers that build the inputs that a Transformer model can understand.


## Tokenizers

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we’ll explore exactly what happens in the tokenization pipeline.

In NLP tasks, the data that is generally processed is raw text. Here’s an example of such text:

``` text
Jim Henson was a puppeteer
```

However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That’s what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

Let’s take a look at some examples of tokenization algorithms, and try to answer some of the questions you may have about tokenization.



### Word-based

The first type of tokenizer that comes to mind is word-based. It’s generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them:

<img src="img/word_based_tokenization.svg">

There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python’s `split()` function:

In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

``` python
['Jim', 'Henson', 'was', 'a', 'puppeteer']
```

There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,” where a vocabulary is defined by the total number of independent tokens that we have in our corpus.

Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

If we want to completely cover a language with a word-based tokenizer, we’ll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we’d need to keep track of that many IDs. Furthermore, words like “dog” are represented differently from words like “dogs”, and the model will initially have no way of knowing that “dog” and “dogs” are similar: it will identify the two words as unrelated. The same applies to other similar words, like “run” and “running”, which the model will not see as being similar initially.

Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the “unknown” token, often represented as ”[UNK]” or ”<unk>”. It’s generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn’t able to retrieve a sensible representation of a word and you’re losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token.

One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.


### Character-based

Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:

1. **The vocabulary is much smaller.**
2. **There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.**

But here too some questions arise concerning spaces and punctuation:

<img src="img/character_based_tokenization.svg">

This approach isn’t perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it’s less meaningful: each character doesn’t mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language.

Another thing to consider is that we’ll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters.

To get the best of both worlds, we can use a third technique that combines the two approaches: **subword tokenization**.



### Subword tokenization

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords.

For instance, “annoyingly” might be considered a rare word and could be decomposed into “annoying” and “ly”. These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of “annoyingly” is kept by the composite meaning of “annoying” and “ly”.

Here is an example showing how a subword tokenization algorithm would tokenize the sequence “Let’s do tokenization!“:


<img src="img/subword.svg">

These subwords end up providing a lot of semantic meaning: for instance, in the example above “tokenization” was split into “token” and “ization”, two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

#### And more!
Unsurprisingly, there are many more techniques out there. To name a few:
- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models

You should now have sufficient knowledge of how tokenizers work to get started with the API.


#### Loading and saving

Loading and saving tokenizers is as simple as it is with models. Actually, it’s based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the `BertTokenizer` class:


In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to `AutoModel`, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

We can now use the tokenizer as shown in the previous section:

In [None]:
tokenizer("Using a Transformer network is simple")

``` python
{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
```

Saving a tokenizer is identical to saving a model:

In [None]:
tokenizer.save_pretrained("directory_on_my_computer")

`token_type_ids` and `attention_mask` are explained in the next section.

### Encoding

Translating text to numbers is known as **encoding**. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

As we’ve seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a **vocabulary**, which is the part we download when we instantiate it with the `from_pretrained()` method. Again, we need to use the same vocabulary used when the model was pretrained.

To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs.


#### Tokenization

The tokenization process is done by the `tokenize()` method of the tokenizer:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

The output of this method is a list of strings, or tokens:

``` python
['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']
```

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er. The ## is a special character that indicates that the token is a continuation of the previous one.

#### From tokens to input IDs

The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

``` python
[7993, 170, 11303, 1200, 2443, 1110, 3014]
```

These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen earlier.

#### Decoding

Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method as follows:

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

``` python
'Using a Transformer network is simple'
```

Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization).

By now you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string.



### Handling multiple sequences

In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already:

- **How do we handle multiple sequences?**  
- **How do we handle multiple sequences of different lengths?**  
- **Are vocabulary indices the only inputs that allow a model to work well?**  
- **Is there such a thing as too long of a sequence?**

Let’s see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API.


### Models expect a batch of inputs

In the previous exercise we saw how sequences get translated into lists of numbers. Let’s convert this list of numbers to a tensor and send it to the model:


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

``` python
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
```

Oh no! Why did this fail? 

“We followed the steps from the pipeline in the previous section,” you might say. “We tokenized the sequence, converted it to IDs, and then converted it to a tensor. What went wrong?”

The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor — it added a dimension on top of it:


In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

``` python
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
```

Let’s try again and add a new dimension:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

We print the input IDs as well as the resulting logits — here’s the output:

``` python
Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
Logits: [[-2.7276,  2.8789]]
```

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence.

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. 

There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. If you’ve ever worked with tensors before, you know that they need to be of a rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly. 

To work around this problem, we usually **pad the inputs**.


### Padding the Inputs

The following list of lists cannot be converted to a tensor:

``` python
batched_ids = [
    [200, 200, 200],
    [200, 200]
]
```

To work around this, we’ll use **padding** to ensure our tensors have a rectangular shape. Padding ensures that all sentences in a batch have the same length by adding a special word called the **padding token** to the shorter sentences. 

For example, if you have 10 sentences with 10 words each and 1 sentence with 20 words, padding will make all sentences 20 words long by appending padding tokens to the shorter ones. 

In our example, the resulting tensor looks like this:


In [None]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

The padding token ID can be found in `tokenizer.pad_token_id`. Let’s use it to pad our sentences and then send them through the model, both individually and as a batch:

Here’s how it looks in practice:


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

``` python
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)
```

There’s something wrong with the logits in our batched predictions: the second row should match the logits for the second sentence, but we’ve got completely different values!

This happens because the key feature of Transformer models is their attention layers, which contextualize each token by attending to all tokens in the sequence. Without additional instructions, these layers will also attend to the padding tokens, causing inconsistencies in the output.

To ensure the same results when processing individual sentences of different lengths or a padded batch, we need to inform the model to ignore the padding tokens. This is achieved by using an **attention mask**.


### Attention Masks

**Attention masks** are tensors with the same shape as the input IDs tensor, consisting of 0s and 1s:  
- **1s** indicate tokens that should be attended to.  
- **0s** indicate tokens that should be ignored by the model’s attention layers.  

This ensures the model only considers meaningful tokens and ignores padding tokens during computations.

Let’s revisit the previous example and include an attention mask:


In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

``` python
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
```

Now we get the same logits for the second sentence in the batch.

Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

### Longer sequences

With Transformer models, there is a **limit to the sequence lengths** they can process. Most models support sequences of up to **512 or 1024 tokens**, and processing longer sequences will result in errors. To address this, you have two options:

1. **Use a model designed for longer sequences:**  
   Some models, like **Longformer** or **LED**, specialize in handling very long sequences. If your task involves extensive sequences, consider exploring these models.

2. **Truncate your sequences:**  
   Specify a `max_sequence_length` to ensure that sequences exceeding the limit are truncated.

Here’s how you can apply truncation:


In [None]:
sequence = sequence[:max_sequence_length]

## Putting it all together

In the previous sections, we manually handled various steps of the tokenization process, including tokenization, converting to input IDs, padding, truncation, and adding attention masks.

However, as we already saw, the 🤗 Transformers API provides a high-level function that automates these tasks for us. By calling the tokenizer directly on the sentence, we can get inputs that are already prepared and ready to be passed through the model.

Here's how it works:


In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

Here, the `model_inputs` variable contains everything that’s necessary for a model to operate well. For DistilBERT, this includes the input IDs as well as the attention mask. Other models that require additional inputs will also output them through the tokenizer object.

This method is very powerful. For example, it can tokenize a single sequence:


In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

It also handles multiple sequences at a time, with no change in the API:

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

It can pad according to several objectives:

In [None]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

It can also truncate sequences:

In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample, we are prompting the tokenizer to return tensors from different frameworks — "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns NumPy arrays:


In [None]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

### Special tokens

If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier:

In [None]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

``` python
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
```

One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above to see what this is about:

In [None]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

``` python
"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."
```

The tokenizer ensures that the correct special tokens are added according to the specific model’s requirements. For example:

- **[CLS]** (classification token) is typically added at the beginning for models like BERT, which is used for tasks like classification.
- **[SEP]** (separator token) is added at the end to indicate the boundary between sentences or segments in tasks like question answering or sentence pair classification.

Different models might require different special tokens, or even none at all, depending on the task and model architecture. The tokenizer handles this seamlessly, allowing you to focus on the data and task rather than worrying about specific model requirements.

This automatic handling of special tokens helps ensure consistency and compatibility between your input data and the model during both training and inference.


## Wrapping up: From tokenization to model

Now that we’ve seen all the individual steps the tokenizer object uses when applied on texts, let’s see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

# Summary

To recap, in this chapter you:

- Learned the basic building blocks of a Transformer model.
- Learned what makes up a tokenization pipeline.
- Saw how to use a Transformer model in practice.
- Learned how to leverage a tokenizer to convert text to tensors that are understandable by the model.
- Set up a tokenizer and a model together to get from text to predictions.
- Learned the limitations of input IDs, and learned about attention masks.
- Played around with versatile and configurable tokenizer methods.

From now on, you should be able to freely navigate the 🤗 Transformers docs: the vocabulary will sound familiar, and you’ve already seen the methods that you’ll use the majority of the time.


# Quiz

### 1. What is the order of the language modeling pipeline?

<ol type="a">
  <li>First, the model, which handles text and returns raw predictions. The tokenizer then makes sense of these predictions and converts them back to text when needed.</li>
  <li> First, the tokenizer, which handles text and returns IDs. The model handles these IDs and outputs a prediction, which can be some text.</li>
  <li>The tokenizer handles text and returns IDs. The model handles these IDs and outputs a prediction. The tokenizer can then be used once again to convert these predictions back to some text.</li>
</ol>

### 2. How many dimensions does the tensor output by the base Transformer model have, and what are they?

<ol type="a">
  <li>2: The sequence length and the batch size</li>
  <li>2: The sequence length and the hidden size</li>
  <li>3: The sequence length, the batch size, and the hidden size</li>
</ol>

### 3. What is a model head?

<ol type="a">
  <li> A component of the base Transformer network that redirects tensors to their correct layers</li>
  <li>Also known as the self-attention mechanism, it adapts the representation of a token according to the other tokens of the sequence</li>
  <li>An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output</li>
</ol>

### 4. What is an AutoModel?

<ol type="a">
  <li>A model that automatically trains on your data</li>
  <li>An object that returns the correct architecture based on the checkpoint</li>
  <li>A model that automatically detects the language used for its inputs to load the correct weights</li>
</ol>

### 5. What are the techniques to be aware of when batching sequences of different lengths together?

<ol type="a">
  <li>Truncating</li>
  <li>Returning tensors</li>
  <li>Padding</li>
  <li>Attention masking</li>
</ol>

### 6. What is the point of applying a SoftMax function to the logits output by a sequence classification model?

<ol type="a">
  <li>It softens the logits so that they're more reliable.</li>
  <li>It applies a lower and upper bound so that they're understandable.</li>
  <li>The total sum of the output is then 1, resulting in a possible probabilistic interpretation.</li>
</ol>

### 7. What does the result variable contain in this code sample?

``` python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
result = tokenizer.tokenize("Hello!")
```

<ol type="a">
  <li>A list of strings, each string being a token</li>
  <li>A list of IDs</li>
  <li>A string containing all of the tokens</li>
</ol>

### 8. Is there something wrong with the following code?

``` python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("gpt2")

encoded = tokenizer("Hey!", return_tensors="pt")
result = model(**encoded)
```

<ol type="a">
  <li>No, it seems correct.</li>
  <li>The tokenizer and model should always be from the same checkpoint.</li>
  <li>It's good practice to pad and truncate with the tokenizer as every input is a batch.</li>
</ol>