# **Using Transformers 🤗:**

## **Introduction**

🤗 Transformers, often referred to as Hugging Face Transformers, is a popular library that provides easy access to state-of-the-art natural language processing (NLP) models. This chapter introduces you to the key concepts and tools within the 🤗 Transformers library, allowing you to leverage powerful NLP models for various tasks.

Behind the Pipeline:

Learn how the 🤗 Transformers library simplifies NLP tasks with the concept of pipelines. This section delves into the inner workings of pipelines, which automate common NLP workflows, and explores how to set up and configure pipelines for various tasks.

Models:

Discover the wide array of pre-trained NLP models available in the 🤗 Transformers library. These models, based on transformer architectures, are capable of tasks like text classification, question answering, translation, summarization, and more. You'll gain insights into selecting and using the right model for your specific needs.

Tokenizers:

Tokenization is a crucial aspect of NLP, and 🤗 Transformers offers various tokenization tools to preprocess text data effectively. This section explores tokenizers, their configurations, and how to tokenize text for model input.

Handling Multiple Sequences:

NLP tasks often require handling multiple sequences of data, such as translating text from one language to another. This section guides you on working with multiple sequences in the 🤗 Transformers library, ensuring you can manage complex tasks seamlessly.

Putting It All Together:

In this section, you'll put your knowledge to the test by applying the concepts learned in the previous sections. You'll walk through practical examples of using the 🤗 Transformers library to accomplish specific NLP tasks, demonstrating how to build end-to-end solutions.

Basic Usage Completed:

By the end of this chapter, you will have a strong foundation in using the 🤗 Transformers library for NLP tasks. You'll be prepared to tackle various real-world NLP challenges with the help of state-of-the-art models and tools.

End-of-Chapter Quiz:

Test your understanding of the key concepts covered in this chapter with an end-of-chapter quiz. This quiz will help reinforce your knowledge and ensure you're ready to apply what you've learned in practice.

## **Behind the Pipeline**

To harness the power of transformer models for natural language processing (NLP) tasks, it's crucial to understand the mechanics behind the pipeline. The pipeline is a high-level interface that streamlines the process of applying transformers to text data. It automates many common NLP tasks, making it easier to work with these models effectively. Let's dive into the key components of the pipeline:

1. **Tokenization**: The first step in the pipeline is tokenization. This process takes raw text and breaks it down into smaller units called tokens. Tokens are typically words or subwords, and they are the basic units of text that the model processes. Tokenization ensures that the text is structured for the model's input.

2. **Model Loading**: After tokenization, the pipeline loads a pre-trained transformer model. These models have been trained on large text corpora and have learned to understand the context and relationships between words and phrases. Examples of popular transformer models include BERT, GPT-2, and RoBERTa.

3. **Inference**: The pipeline uses the loaded model to make inferences on the tokenized input text. Depending on the specific NLP task, the model may generate predictions, classifications, or other outputs. This step leverages the model's understanding of context to process and analyze the text effectively.

4. **Output**: The pipeline returns the model's output, which can vary depending on the task. For instance, if the task is text classification, the output may be a label or category. If the task is text generation, the output may be a generated text sequence. The output is designed to be easily accessible and ready for further processing or analysis.

5. **Post-Processing**: In many cases, the pipeline also includes post-processing steps. This can involve converting model outputs into human-readable formats, such as decoding generated tokens or mapping prediction scores to class labels.

The pipeline encapsulates these steps, allowing you to work with transformer models in a more user-friendly and efficient way. This simplification is particularly valuable when you need to perform various NLP tasks without delving into the low-level details of model loading, tokenization, and post-processing.

By understanding the pipeline, you can leverage transformer models for a wide range of NLP tasks, from text classification and sentiment analysis to question answering and text generation. The pipeline provides a powerful and accessible interface for applying these state-of-the-art models to real-world text data.

![](images/pipeline1.png)

Preprocessing with a tokenizer

![](images/preprocessing.png)

Going through the model

![](images/model.png)

A high-dimensional vector?
The vector output by the Transformer module is usually large. It generally has three dimensions:

Batch size: The number of sequences processed at a time (2 in our example).
Sequence length: The length of the numerical representation of the sequence (16 in our example).
Hidden size: The vector dimension of each model input.
It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model:

![](images/model1.png)

Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:

![](images/auto_model.png)

Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

![](images/output.png)

Our model predicted [-4.2829, 4.5600] for the first sentence and [ -4.1897, 4.5334] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

![](images/predict.png)

Now we can conclude that the model predicted the following:

First sentence: NEGATIVE: 0.00014439, POSITIVE: 0.99986
Second sentence: NEGATIVE: 0.00016277, POSITIVE: 0.99984
We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing!

## **Models**

In this section, we'll delve deeper into the process of creating and utilizing a model. We'll be making use of the AutoModel class, which is quite convenient when you need to create an instance of any model from a checkpoint.

The AutoModel class, along with its related classes, serves as a straightforward wrapper for the diverse array of models available in the library. It's a clever wrapper because it can automatically deduce the appropriate model architecture for your checkpoint and then instantiate a model with that specific architecture.

However, if you happen to know the exact model type you wish to use, you can opt to use the class that explicitly defines its architecture. Let's explore this further with a BERT model as an example.

### Creating a Transformer

The first thing we’ll need to do to initialize a BERT model is load a configuration object:

![](images/transformer1.png)

The configuration contains many attributes that are used to build the model:

![](images/transformer2.png)

Although we haven't explored the specific functionality of all these attributes in detail yet, you might already be familiar with some of them. For instance, the `hidden_size` attribute specifies the dimension of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers present in the Transformer model. These are crucial architectural parameters that significantly influence the model's behavior and capacity. As you continue to work with Transformers, you'll gain a deeper understanding of how these attributes impact the model's performance and behavior.

### Different loading methods

Creating a model from the default configuration initializes it with random values:

![](images/model2.png)

Using the model in its initial state is possible, but it would produce unintelligible results since it needs to be trained for a specific task. Training the model from scratch on a particular task would indeed be a time-consuming process, requiring a substantial amount of data and having a significant environmental impact, as discussed in Chapter 1.

To circumvent the need for such extensive and duplicated efforts, it's crucial to have the capability to share and reuse models that have already undergone training. Loading a pretrained Transformer model is a straightforward process, and you can achieve this by using the `from_pretrained()` method. This method allows you to access and utilize models that have already been trained on various tasks, saving both time and resources.

![](images/model3.png)

As demonstrated earlier, we were able to replace the use of BertModel with the equivalent AutoModel class. We'll continue using this approach, as it leads to code that is agnostic to the specific checkpoint, ensuring that your code can seamlessly work with different checkpoints. This flexibility extends even to cases where the architecture might differ, as long as the checkpoint was trained for a similar task, such as sentiment analysis.

In the provided code sample, we didn't utilize BertConfig. Instead, we loaded a pretrained model using the identifier "bert-base-cased." This particular model checkpoint was trained by the creators of BERT themselves, and you can find more detailed information about it in its model card.

The model is initialized with all the weights from the checkpoint, making it suitable for immediate inference on the tasks it was originally trained for. Additionally, it can be fine-tuned for new tasks. Leveraging pretrained weights in this manner allows you to achieve good results more rapidly compared to training a model from scratch.

Furthermore, the weights have been downloaded and cached during this process, so subsequent calls to the from_pretrained() method won't involve redownloading them. The default cache folder is located at ~/.cache/huggingface/transformers, but you can customize this location by setting the HF_HOME environment variable.

The identifier used to load the model can be the identifier of any model available on the Model Hub, provided that it is compatible with the BERT architecture. You can find a comprehensive list of available BERT checkpoints on the official Transformers Model Hub.

### Saving methods

Saving a model is as easy as loading one — we use the save_pretrained() method, which is analogous to the from_pretrained() method:

![](images/save_model.png)

When you inspect the `config.json` file, you will find the attributes essential for constructing the model's architecture. This file also includes metadata, such as the source of the checkpoint and the version of 🤗 Transformers used when the checkpoint was last saved.

On the other hand, the `pytorch_model.bin` file is referred to as the state dictionary, and it contains all the weights of your model. These two files are tightly interrelated: the configuration file is critical for understanding your model's architecture, while the model weights stored in the state dictionary represent the parameters that define your model. Together, they are fundamental components for loading and using a pretrained model checkpoint.

### Using a Transformer model for inference

Now that you've learned how to load and save a model, let's explore using it to make predictions. Transformer models can only process numerical data, which is generated by tokenizers. Before delving into tokenizers, let's first understand what types of inputs the model can accept.

Tokenizers play a crucial role in converting inputs into the appropriate tensors for the underlying framework. However, to provide you with a better understanding of the process, let's take a quick look at what needs to be done before sending inputs to the model.

For instance, suppose we have a couple of sequences:

![](images/seq1.png)

The tokenizer's role is to convert these sequences into vocabulary indices, which are commonly referred to as input IDs. Consequently, each sequence is transformed into a list of numbers. The resulting output is as follows:

![](images/seq2.png)

Indeed, what we have now is a list of encoded sequences, essentially a list of lists. However, it's important to note that tensors, which are used in most deep learning frameworks, require rectangular shapes, similar to matrices. Fortunately, this "array" is already in a rectangular shape, making it straightforward to convert it into a tensor.

![](images/tensor.png)

### Using the tensors as inputs to the model

Making use of the tensors with the model is extremely simple — we just call the model with the inputs:

![](images/model4.png)

The Transformer model can accept various arguments, but for the basic usage, only the input IDs are essential. We will explore the functionality and use of the other arguments later on. However, before delving into those details, it's important to examine the tokenizers responsible for constructing inputs that a Transformer model can comprehend. These tokenizers play a critical role in preparing the input data for the model.

## **Tokenizers**

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller units, called tokens. Tokens are typically words or subwords, and they serve as the basic building blocks for NLP models. Tokenizers are essential for preparing text data for processing by transformer models, and here's why they are important:

1. **Handling Text Data**: Text data is inherently unstructured, and tokenization helps convert it into a format that can be processed effectively by NLP models. It breaks text into manageable chunks, making it suitable for analysis.

2. **Word and Subword Splitting**: Tokenization splits text into individual words or subwords, which allows models to understand and generate text at a granular level. Subword tokenization is particularly useful for handling languages with complex word formations.

3. **Vocabulary Mapping**: Tokenizers maintain a vocabulary or dictionary that maps tokens to numerical IDs. This mapping is crucial because NLP models operate on numerical input. Token IDs are used as inputs to the model, and the model's output is often in the form of token IDs, which are then mapped back to text.

4. **Special Tokens**: Tokenizers often include special tokens, such as [CLS] and [SEP], which are used for specific purposes. For example, [CLS] might be used to indicate the start of a text sequence in a classification task, and [SEP] can be used to separate segments of text in various tasks.

5. **Segmentation**: Tokenization can handle tasks involving multiple sequences or segments of text. For instance, in machine translation, a tokenizer can segment text into source and target languages.

6. **Subword Tokenization**: Subword tokenization is useful for handling languages with a vast vocabulary or for splitting long words into meaningful parts. This technique is particularly helpful for languages like Chinese or for handling domain-specific terminology.

7. **Special Tokens**: Tokenizers often include special tokens, such as [CLS] and [SEP], which are used for specific purposes. For example, [CLS] might be used to indicate the start of a text sequence in a classification task, and [SEP] can be used to separate segments of text in various tasks.

8. **Pre-processing and Post-processing**: Tokenizers can handle pre-processing tasks like lowercasing, removing punctuation, and handling special characters. They can also perform post-processing, such as converting token IDs back to text.

9. **Fine-Tuning**: When fine-tuning transformer models for specific tasks, tokenizers ensure that the text data used for fine-tuning is processed consistently with the pre-trained model's tokenization.

10. **Open-Source Tokenizers**: Many open-source tokenizer implementations are available for popular transformer models, such as the Hugging Face Transformers library, which provides access to a variety of tokenizers and pre-trained models.

In summary, tokenization is a crucial step in NLP that prepares text data for processing by transformer models. It enables the handling of text at the word or subword level, maintains token-to-ID mappings, and ensures that text data is structured for effective analysis and model input. Different languages and tasks may require specific tokenization strategies to handle their unique characteristics.

Let's explore some examples of tokenization algorithms and address any questions you might have regarding the tokenization process.

### Word-based

Word-based tokenization is one of the most straightforward and commonly used tokenization approaches. It is easy to set up and typically involves a few rules, making it a good choice for many applications. The main objective of word-based tokenization is to break down raw text into individual words and assign a numerical representation to each word. This approach is exemplified in the image below:

![](images/word_base1.png)

Indeed, there are various methods to split the text into words. One simple approach is to use whitespace as a delimiter, which can be accomplished by applying Python's `split()` function. This method splits the text wherever it encounters spaces, creating a list of words. This basic word-based tokenization can be quite effective for many tasks.

![](images/word_base2.png)

Word-based tokenizers are useful but can have limitations, especially when dealing with extensive vocabularies in languages with numerous words and variations. In such cases, the vocabulary can become very large, with each word being assigned an ID ranging from 0 to the size of the vocabulary. However, this approach has some drawbacks:

1. Variations between words: Similar words with slight differences, such as "dog" and "dogs," or "run" and "running," may be assigned different IDs initially. The model won't inherently recognize the similarity between such words.

2. Handling out-of-vocabulary words: To cover an entire language with a word-based tokenizer, you would need an identifier for every word in the language, resulting in an enormous number of tokens. To handle words not in the vocabulary, a special "unknown" token, often represented as "[UNK]" or something similar, is used. The presence of many unknown tokens can indicate a limitation of the tokenizer, as it signifies a loss of information.

To mitigate these issues and reduce the reliance on the unknown token, you can employ a character-based tokenizer, which delves one level deeper into text processing.

### Character-based

Character-based tokenizers operate by splitting the text into individual characters, as opposed to words. This approach offers two key advantages:

1. Smaller Vocabulary: Character-based tokenizers result in significantly smaller vocabularies compared to word-based tokenizers. The vocabulary size is determined by the number of distinct characters in the language, which is typically much smaller than the number of words.

2. Fewer Out-of-Vocabulary Tokens: With character-based tokenization, there are far fewer out-of-vocabulary (unknown) tokens. Since every word can be constructed from characters, the model can represent any word using the characters in its vocabulary.

However, character-based tokenization introduces some questions and challenges related to spaces and punctuation. For example:

![](images/character_based.png)

Character-based tokenization has its merits, but it also has limitations. One argument against character-based tokenization is that, on an intuitive level, individual characters may not carry as much meaning on their own as whole words. However, this perception can vary depending on the language. For instance, in languages like Chinese, each character can convey more information compared to a character in a Latin language.

Additionally, character-based tokenization can lead to a substantial increase in the number of tokens to be processed by the model. While a word is typically represented by a single token in word-based tokenization, it can translate to 10 or more tokens when broken down into characters.

To strike a balance between these two approaches, a third technique called subword tokenization is often used. This approach combines the advantages of both word-based and character-based tokenization, making it a versatile choice for handling different languages and text types.

### Subword tokenization

Subword tokenization algorithms are designed with the principle that commonly used words should remain intact and not be split into smaller subwords, while rare words should be decomposed into meaningful subword units. This approach preserves the meaning of rare words while utilizing frequently occurring subwords, making the tokenization process more efficient and informative.

For instance, consider the word "annoyingly," which might be considered a rare word. Using subword tokenization, it could be decomposed into "annoying" and "ly." Both "annoying" and "ly" are likely to appear more frequently as standalone subwords in various contexts. At the same time, the composite meaning of "annoyingly" is retained through the combination of "annoying" and "ly."

Here's an example to illustrate how a subword tokenization algorithm would tokenize the sequence "Let’s do tokenization!":

![](images/subword.png)

Subword tokenization techniques indeed offer the advantage of preserving semantic meaning while remaining space-efficient. For example, in the given example, "tokenization" was split into "token" and "ization," both of which have semantic meaning, and only two tokens are needed to represent a long word. This approach allows for relatively good coverage with smaller vocabularies and minimizes the need for unknown tokens.

Subword tokenization is particularly valuable in languages with agglutinative features, like Turkish, where complex words can be constructed by concatenating subwords.

In addition to the techniques mentioned, such as Byte-level BPE (used in GPT-2), WordPiece (used in BERT), SentencePiece, and Unigram (used in various multilingual models), there are even more tokenization methods available. These techniques provide flexibility in handling different languages, text types, and use cases.

With this foundational knowledge of how tokenizers work, you should be well-equipped to start using the API and effectively preprocess text data for various natural language processing tasks.

### Loading and saving

Loading and saving tokenizers is straightforward and follows a similar pattern as models, making use of the same two methods: `from_pretrained()` and `save_pretrained()`. These methods handle the loading and saving of the tokenizer's algorithm (akin to the model's architecture) as well as its vocabulary (similar to the model's weights).

When loading a tokenizer that matches the checkpoint, the process is much like loading the corresponding model. For example, to load the BERT tokenizer trained with the same checkpoint as BERT, you can use the `BertTokenizer` class:

![](images/loading.png)

This ensures that the tokenizer and the model are in sync, and you can perform tokenization and preprocessing in a consistent manner.

Just like the `AutoModel` class, the `AutoTokenizer` class is designed to automatically select the appropriate tokenizer class from the library based on the checkpoint name. This allows you to use it directly with any checkpoint, making it a convenient way to handle tokenization for various models without needing to explicitly specify the tokenizer class.

![](images/loading1.png)

With the tokenizer loaded, you can now use it as demonstrated in the previous section to tokenize and preprocess text data for various natural language processing tasks. This involves converting text into numerical inputs that can be fed into your Transformer model for inference or training.

![](images/tokenize.png)

We will explore token_type_ids in more detail in Chapter 3, and we will provide an explanation of the attention_mask key a bit later in this discussion. Before diving into those aspects, let's focus on understanding how the input_ids are generated, which requires looking at the intermediate methods of the tokenizer. This will give you a more comprehensive insight into the tokenization process and the generation of input IDs.

### Encoding

The process of translating text into numerical representations is known as encoding, and it typically involves a two-step procedure: tokenization, followed by the conversion to input IDs.

As we've already seen, the first step involves breaking the text into words (or parts of words, punctuation symbols, etc.), known as tokens. This tokenization process can be subject to various rules, which is why it's crucial to instantiate the tokenizer using the model's name to ensure consistency with the rules used during pretraining.

The second step is to convert these tokens into numerical values, which allows us to construct a tensor for feeding into the model. This conversion is achieved using the tokenizer's vocabulary, which is downloaded when the tokenizer is instantiated with the `from_pretrained()` method. It's essential to use the same vocabulary that was employed during the model's pretraining.

To provide a better understanding of these two steps, we'll explore them separately. Please note that in practice, you would typically call the tokenizer directly on your inputs, and it would handle both tokenization and conversion to input IDs seamlessly. The separation is for illustrative purposes to show the intermediate results of these processes.

### Tokenization

The tokenization process is done by the tokenize() method of the tokenizer:

![](images/tokenize1.png)

The example you provided, `['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']`, demonstrates the tokenization process using a subword tokenizer. In this case, the tokenizer splits words into smaller subword tokens until it finds tokens that can be represented by its vocabulary.

For instance, the word "transformer" is divided into two tokens: "Trans" and "##former." The "##" prefix indicates that the token is a continuation of the previous one, and together they represent the original word. This approach allows the tokenizer to efficiently handle a wide range of words and subwords while keeping the vocabulary size manageable.

### From tokens to input IDs

The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:

![](images/tokenize2.png)

**Exercise:** Replicate Tokenization and Conversion to Input IDs

Replicate the tokenization and conversion to input IDs for the following input sentences:
1. "HuggingFace provides amazing NLP resources.",
2. "I love learning about natural language processing!",


Using the appropriate tokenizer, tokenize these sentences and convert them into input IDs. Ensure that you obtain the same input IDs as seen earlier in this chapter.

### Decoding

Decoding is the process of converting vocabulary indices back into a human-readable text string. This can be accomplished using the `decode()` method, which allows you to reverse the tokenization and obtain the original text from the numerical representations.

![](images/decoding.png)

It's worth noting that the `decode()` method not only translates the indices back into tokens but also intelligently groups together the tokens that were originally part of the same words, resulting in a coherent and readable sentence. This behavior becomes particularly valuable when working with models that generate new text, such as text generation from a prompt or for sequence-to-sequence tasks like translation or summarization.

So far, we've covered the fundamental operations that a tokenizer can perform: tokenization, conversion to IDs, and decoding IDs back into a human-readable string. However, this is just the beginning, and there's much more to explore. In the upcoming section, we'll push our understanding of tokenization to its limits and explore strategies for overcoming challenges in text processing.

## **Handling multiple sequences**

In the previous section, we covered the basic use case of performing inference on a single sequence of relatively small length. However, as we delve deeper into natural language processing tasks, several questions arise:

1. How do we handle multiple sequences?
2. How do we handle multiple sequences of different lengths?
3. Are vocabulary indices the only inputs that allow a model to work well?
4. Is there a practical limit to the length of a sequence that a model can handle effectively?

These questions highlight some of the challenges and complexities that can arise when working with text data. In the following sections, we will explore the solutions to these questions using the 🤗 Transformers API, which offers tools and techniques to address these issues and make it easier to work with a wide range of natural language processing tasks.

#### Models expect a batch of inputs

In the previous section, we observed how sequences are translated into lists of numbers. Now, let's take this list of numbers and convert it into a tensor, which can then be sent to the model for processing. This transformation is a crucial step in preparing the data for feeding into the model.

![](images/batch_input.png)

The issue here is that we attempted to send a single sequence to the model, but 🤗 Transformers models expect input in the form of multiple sentences by default. When using the tokenizer, it doesn't merely convert the list of input IDs into a tensor; it adds an extra dimension on top of it. This extra dimension is critical for correctly formatting the input data for the model. To resolve this issue, we need to ensure that our input is structured correctly to match the model's expectations.

![](images/batch_input1.png)

To correctly format the input data for the model, we should add a new dimension to our input. Let's try the conversion again with the extra dimension added to the tensor. This ensures that the input data aligns with the model's expectations for processing multiple sequences.

![](images/batch_input2.png)

Batching involves sending multiple sentences through the model simultaneously. If you have only one sentence, you can still create a batch with a single sequence. This is a common practice, and it allows for consistent handling of both single sentences and batches in the model.

![](images/batch_input3.png)

Batching indeed enables the model to process multiple sentences simultaneously. However, when working with multiple sequences, a new challenge arises: different sentences may have varying lengths. Tensors, being rectangular data structures, require consistent dimensions, making it impossible to directly convert a list of input IDs into a tensor if the sequences have different lengths. To overcome this challenge, we commonly use padding to ensure uniform dimensions in the input data. Padding involves adding zeros to the shorter sequences so that all sequences within a batch have the same length. This ensures compatibility when converting the input data into tensors and feeding it into the model.

### Padding the inputs

The following list of lists cannot be converted to a tensor:

![](images/padding1.png)

To address the issue of varying sentence lengths, we can use padding to ensure that our tensors have a rectangular shape. Padding involves adding a special word, often referred to as the padding token, to sentences with fewer values. For instance, if you have 10 sentences with 10 words each and 1 sentence with 20 words, padding will be applied to ensure all sentences have the same length of 20 words. In this scenario, the resulting tensor will have a rectangular shape, as illustrated below:

![](images/padding2.png)

The padding token ID is accessible through `tokenizer.pad_token_id`. Let's leverage this information to send our two sentences through the model, both individually and as part of a batch. This will allow us to observe how the model handles padding and process multiple sentences with varying lengths.

![](images/padding3.png)

The discrepancy in the logits for the batched predictions, where the second row should match the logits for the second sentence but contains different values, is due to the attention layers in Transformer models. These attention layers contextualize each token and consider padding tokens since they attend to all tokens within a sequence. To achieve consistent results when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to instruct the attention layers to ignore the padding tokens. This is achieved by using an attention mask.

### Attention masks

Attention masks are tensors with the same shape as the input IDs tensor, filled with 0s and 1s. In this context, 1s indicate that the corresponding tokens should be attended to, while 0s indicate that the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

Let's extend the previous example by incorporating an attention mask:

![](images/attention_mask.png)

Now, with the inclusion of the attention mask, we obtain consistent logits for the second sentence in the batch. It's important to observe that the last value of the second sequence corresponds to a padding ID, and this is correctly reflected by the 0 value in the attention mask. This ensures that the attention layers appropriately handle padding tokens during processing.

### Longer sequences

Transformer models come with a limitation on the maximum sequence length they can handle. Most models support sequences of up to 512 or 1024 tokens, and attempting to process longer sequences may result in crashes. To address this limitation, you have two main solutions:

1. **Use a model with a longer supported sequence length:** Some models are specifically designed to handle very long sequences, such as [Longformer](https://huggingface.co/transformers/model_doc/longformer.html) or [LED](https://huggingface.co/transformers/model_doc/led.html). If your task requires processing extended sequences, consider exploring these models.

2. **Truncate your sequences:** For models with standard sequence length limits, it's common practice to truncate longer sequences. You can achieve this by specifying the `max_sequence_length` parameter when tokenizing your input data. This parameter allows you to control the length of the input sequences and ensure they fit within the model's constraints.

![](images/longer_seq.png)

## **Putting it all together**

In the previous sections, we delved into the details of various text preprocessing steps, including tokenization, conversion to input IDs, padding, truncation, and attention masks. We aimed to understand the underlying mechanisms and challenges involved in preparing text data for Transformer models.

However, it's important to highlight that the 🤗 Transformers API provides a high-level function that automates most of these processes for us. When you directly invoke your tokenizer on a sentence, you receive inputs that are ready to be fed into your model. This simplifies the workflow and abstracts away many of the manual steps we explored earlier, making it more convenient to work with Transformer models in real-world applications.

![](images/all_together1.png)

In the code snippet provided, the `model_inputs` variable encompasses all the necessary components for the model to function effectively. For DistilBERT, this includes the input IDs and the attention mask. For models that accept additional inputs, the tokenizer object will also output those.

This approach proves to be highly versatile, as demonstrated in the examples below. Firstly, it efficiently tokenizes a single sequence:

```python
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
```

Furthermore, it seamlessly manages multiple sequences simultaneously, without any alteration in the API:

![](images/all_together2.png)

The tokenizer object streamlines the conversion to specific framework tensors, ready to be directly fed into the model. In the following code snippet, we demonstrate how to instruct the tokenizer to produce tensors for different frameworks — "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns NumPy arrays:

![](images/all_together3.png)

### Special tokens

Examining the input IDs returned by the tokenizer reveals a slight variation from what we observed earlier:

![](images/all_together4.png)

Two additional token IDs, one at the beginning and one at the end, have been included in the returned input IDs. To understand the purpose of these additions, let's decode the two sequences of IDs:

![](images/all_together5.png)

The tokenizer introduced the special token `[CLS]` at the beginning and the special token `[SEP]` at the end. These tokens are added because the model was pretrained with them. To ensure consistent results during inference, we need to include them as well. It's important to note that different models may use different special tokens or include them in distinct ways — for example, some models might only add special tokens at the beginning or end. Nevertheless, the tokenizer is aware of the expected special tokens and automatically manages them, alleviating the need for manual handling.

### Summing up: Transitioning from Tokenizer to Model

Now that we've thoroughly examined each step the tokenizer object performs when applied to texts, let's revisit how it proficiently manages multiple sequences (including padding!), handles very long sequences (employing truncation!), and caters to various tensor types using its primary API:

![](images/all_together6.png)

### Basic usage completed!

Fantastic job in progressing through this chapter! To summarize, you have:

1. Gained an understanding of the fundamental components of a Transformer model.
2. Explored the key elements of a tokenization pipeline.
3. Practically used a Transformer model.
4. Mastered the art of employing a tokenizer to convert text into tensors suitable for the model.
5. Successfully configured a tokenizer and a model to seamlessly transition from text to predictions.
6. Acquired insights into the limitations of input IDs and delved into the concept of attention masks.
7. Experimented with the versatile and customizable methods provided by the tokenizer.

With this knowledge, you should feel confident navigating the 🤗 Transformers documentation. The terminology is now familiar, and the methods you've encountered will be your go-to tools in various applications.