# **Using Transformers 🤗:**

## **Introduction**

🤗 Transformers, often referred to as Hugging Face Transformers, is a popular library that provides easy access to state-of-the-art natural language processing (NLP) models. This chapter introduces you to the key concepts and tools within the 🤗 Transformers library, allowing you to leverage powerful NLP models for various tasks.

Behind the Pipeline:

Learn how the 🤗 Transformers library simplifies NLP tasks with the concept of pipelines. This section delves into the inner workings of pipelines, which automate common NLP workflows, and explores how to set up and configure pipelines for various tasks.

Models:

Discover the wide array of pre-trained NLP models available in the 🤗 Transformers library. These models, based on transformer architectures, are capable of tasks like text classification, question answering, translation, summarization, and more. You'll gain insights into selecting and using the right model for your specific needs.

Tokenizers:

Tokenization is a crucial aspect of NLP, and 🤗 Transformers offers various tokenization tools to preprocess text data effectively. This section explores tokenizers, their configurations, and how to tokenize text for model input.

Handling Multiple Sequences:

NLP tasks often require handling multiple sequences of data, such as translating text from one language to another. This section guides you on working with multiple sequences in the 🤗 Transformers library, ensuring you can manage complex tasks seamlessly.

Putting It All Together:

In this section, you'll put your knowledge to the test by applying the concepts learned in the previous sections. You'll walk through practical examples of using the 🤗 Transformers library to accomplish specific NLP tasks, demonstrating how to build end-to-end solutions.

Basic Usage Completed:

By the end of this chapter, you will have a strong foundation in using the 🤗 Transformers library for NLP tasks. You'll be prepared to tackle various real-world NLP challenges with the help of state-of-the-art models and tools.

End-of-Chapter Quiz:

Test your understanding of the key concepts covered in this chapter with an end-of-chapter quiz. This quiz will help reinforce your knowledge and ensure you're ready to apply what you've learned in practice.

## **Behind the Pipeline**

To harness the power of transformer models for natural language processing (NLP) tasks, it's crucial to understand the mechanics behind the pipeline. The pipeline is a high-level interface that streamlines the process of applying transformers to text data. It automates many common NLP tasks, making it easier to work with these models effectively. Let's dive into the key components of the pipeline:

1. **Tokenization**: The first step in the pipeline is tokenization. This process takes raw text and breaks it down into smaller units called tokens. Tokens are typically words or subwords, and they are the basic units of text that the model processes. Tokenization ensures that the text is structured for the model's input.

2. **Model Loading**: After tokenization, the pipeline loads a pre-trained transformer model. These models have been trained on large text corpora and have learned to understand the context and relationships between words and phrases. Examples of popular transformer models include BERT, GPT-2, and RoBERTa.

3. **Inference**: The pipeline uses the loaded model to make inferences on the tokenized input text. Depending on the specific NLP task, the model may generate predictions, classifications, or other outputs. This step leverages the model's understanding of context to process and analyze the text effectively.

4. **Output**: The pipeline returns the model's output, which can vary depending on the task. For instance, if the task is text classification, the output may be a label or category. If the task is text generation, the output may be a generated text sequence. The output is designed to be easily accessible and ready for further processing or analysis.

5. **Post-Processing**: In many cases, the pipeline also includes post-processing steps. This can involve converting model outputs into human-readable formats, such as decoding generated tokens or mapping prediction scores to class labels.

The pipeline encapsulates these steps, allowing you to work with transformer models in a more user-friendly and efficient way. This simplification is particularly valuable when you need to perform various NLP tasks without delving into the low-level details of model loading, tokenization, and post-processing.

By understanding the pipeline, you can leverage transformer models for a wide range of NLP tasks, from text classification and sentiment analysis to question answering and text generation. The pipeline provides a powerful and accessible interface for applying these state-of-the-art models to real-world text data.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

Preprocessing with a tokenizer

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


Going through the model

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

A high-dimensional vector?
The vector output by the Transformer module is usually large. It generally has three dimensions:

Batch size: The number of sequences processed at a time (2 in our example).
Sequence length: The length of the numerical representation of the sequence (16 in our example).
Hidden size: The vector dimension of each model input.
It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model:

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])


Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

In [None]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005
We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing!

## **Models**

In this section, we'll delve deeper into the process of creating and utilizing a model. We'll be making use of the AutoModel class, which is quite convenient when you need to create an instance of any model from a checkpoint.

The AutoModel class, along with its related classes, serves as a straightforward wrapper for the diverse array of models available in the library. It's a clever wrapper because it can automatically deduce the appropriate model architecture for your checkpoint and then instantiate a model with that specific architecture.

However, if you happen to know the exact model type you wish to use, you can opt to use the class that explicitly defines its architecture. Let's explore this further with a BERT model as an example.

### Creating a Transformer

The first thing we’ll need to do to initialize a BERT model is load a configuration object:

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

The configuration contains many attributes that are used to build the model:

In [None]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.34.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Although we haven't explored the specific functionality of all these attributes in detail yet, you might already be familiar with some of them. For instance, the `hidden_size` attribute specifies the dimension of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers present in the Transformer model. These are crucial architectural parameters that significantly influence the model's behavior and capacity. As you continue to work with Transformers, you'll gain a deeper understanding of how these attributes impact the model's performance and behavior.

### Different loading methods

Creating a model from the default configuration initializes it with random values:

In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

Using the model in its initial state is possible, but it would produce unintelligible results since it needs to be trained for a specific task. Training the model from scratch on a particular task would indeed be a time-consuming process, requiring a substantial amount of data and having a significant environmental impact, as discussed in Chapter 1.

To circumvent the need for such extensive and duplicated efforts, it's crucial to have the capability to share and reuse models that have already undergone training. Loading a pretrained Transformer model is a straightforward process, and you can achieve this by using the `from_pretrained()` method. This method allows you to access and utilize models that have already been trained on various tasks, saving both time and resources.

In [1]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

As demonstrated earlier, we were able to replace the use of BertModel with the equivalent AutoModel class. We'll continue using this approach, as it leads to code that is agnostic to the specific checkpoint, ensuring that your code can seamlessly work with different checkpoints. This flexibility extends even to cases where the architecture might differ, as long as the checkpoint was trained for a similar task, such as sentiment analysis.

In the provided code sample, we didn't utilize BertConfig. Instead, we loaded a pretrained model using the identifier "bert-base-cased." This particular model checkpoint was trained by the creators of BERT themselves, and you can find more detailed information about it in its model card.

The model is initialized with all the weights from the checkpoint, making it suitable for immediate inference on the tasks it was originally trained for. Additionally, it can be fine-tuned for new tasks. Leveraging pretrained weights in this manner allows you to achieve good results more rapidly compared to training a model from scratch.

Furthermore, the weights have been downloaded and cached during this process, so subsequent calls to the from_pretrained() method won't involve redownloading them. The default cache folder is located at ~/.cache/huggingface/transformers, but you can customize this location by setting the HF_HOME environment variable.

The identifier used to load the model can be the identifier of any model available on the Model Hub, provided that it is compatible with the BERT architecture. You can find a comprehensive list of available BERT checkpoints on the official Transformers Model Hub.

### Saving methods

Saving a model is as easy as loading one — we use the save_pretrained() method, which is analogous to the from_pretrained() method:

In [None]:
model.save_pretrained("directory_on_my_computer")

This saves two files to your disk:

In [None]:
ls directory_on_my_computer

config.json  pytorch_model.bin


When you inspect the `config.json` file, you will find the attributes essential for constructing the model's architecture. This file also includes metadata, such as the source of the checkpoint and the version of 🤗 Transformers used when the checkpoint was last saved.

On the other hand, the `pytorch_model.bin` file is referred to as the state dictionary, and it contains all the weights of your model. These two files are tightly interrelated: the configuration file is critical for understanding your model's architecture, while the model weights stored in the state dictionary represent the parameters that define your model. Together, they are fundamental components for loading and using a pretrained model checkpoint.

### Using a Transformer model for inference

Now that you've learned how to load and save a model, let's explore using it to make predictions. Transformer models can only process numerical data, which is generated by tokenizers. Before delving into tokenizers, let's first understand what types of inputs the model can accept.

Tokenizers play a crucial role in converting inputs into the appropriate tensors for the underlying framework. However, to provide you with a better understanding of the process, let's take a quick look at what needs to be done before sending inputs to the model.

For instance, suppose we have a couple of sequences:



In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]

The tokenizer's role is to convert these sequences into vocabulary indices, which are commonly referred to as input IDs. Consequently, each sequence is transformed into a list of numbers. The resulting output is as follows:

In [None]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

Indeed, what we have now is a list of encoded sequences, essentially a list of lists. However, it's important to note that tensors, which are used in most deep learning frameworks, require rectangular shapes, similar to matrices. Fortunately, this "array" is already in a rectangular shape, making it straightforward to convert it into a tensor.

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

### Using the tensors as inputs to the model

Making use of the tensors with the model is extremely simple — we just call the model with the inputs:

In [None]:
output = model(model_inputs)

The Transformer model can accept various arguments, but for the basic usage, only the input IDs are essential. We will explore the functionality and use of the other arguments later on. However, before delving into those details, it's important to examine the tokenizers responsible for constructing inputs that a Transformer model can comprehend. These tokenizers play a critical role in preparing the input data for the model.

## **Tokenizers**

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller units, called tokens. Tokens are typically words or subwords, and they serve as the basic building blocks for NLP models. Tokenizers are essential for preparing text data for processing by transformer models, and here's why they are important:

1. **Handling Text Data**: Text data is inherently unstructured, and tokenization helps convert it into a format that can be processed effectively by NLP models. It breaks text into manageable chunks, making it suitable for analysis.

2. **Word and Subword Splitting**: Tokenization splits text into individual words or subwords, which allows models to understand and generate text at a granular level. Subword tokenization is particularly useful for handling languages with complex word formations.

3. **Vocabulary Mapping**: Tokenizers maintain a vocabulary or dictionary that maps tokens to numerical IDs. This mapping is crucial because NLP models operate on numerical input. Token IDs are used as inputs to the model, and the model's output is often in the form of token IDs, which are then mapped back to text.

4. **Special Tokens**: Tokenizers often include special tokens, such as [CLS] and [SEP], which are used for specific purposes. For example, [CLS] might be used to indicate the start of a text sequence in a classification task, and [SEP] can be used to separate segments of text in various tasks.

5. **Segmentation**: Tokenization can handle tasks involving multiple sequences or segments of text. For instance, in machine translation, a tokenizer can segment text into source and target languages.

6. **Subword Tokenization**: Subword tokenization is useful for handling languages with a vast vocabulary or for splitting long words into meaningful parts. This technique is particularly helpful for languages like Chinese or for handling domain-specific terminology.

7. **Special Tokens**: Tokenizers often include special tokens, such as [CLS] and [SEP], which are used for specific purposes. For example, [CLS] might be used to indicate the start of a text sequence in a classification task, and [SEP] can be used to separate segments of text in various tasks.

8. **Pre-processing and Post-processing**: Tokenizers can handle pre-processing tasks like lowercasing, removing punctuation, and handling special characters. They can also perform post-processing, such as converting token IDs back to text.

9. **Fine-Tuning**: When fine-tuning transformer models for specific tasks, tokenizers ensure that the text data used for fine-tuning is processed consistently with the pre-trained model's tokenization.

10. **Open-Source Tokenizers**: Many open-source tokenizer implementations are available for popular transformer models, such as the Hugging Face Transformers library, which provides access to a variety of tokenizers and pre-trained models.

In summary, tokenization is a crucial step in NLP that prepares text data for processing by transformer models. It enables the handling of text at the word or subword level, maintains token-to-ID mappings, and ensures that text data is structured for effective analysis and model input. Different languages and tasks may require specific tokenization strategies to handle their unique characteristics.

Let's explore some examples of tokenization algorithms and address any questions you might have regarding the tokenization process.

### Word-based

Word-based tokenization is one of the most straightforward and commonly used tokenization approaches. It is easy to set up and typically involves a few rules, making it a good choice for many applications. The main objective of word-based tokenization is to break down raw text into individual words and assign a numerical representation to each word. This approach is exemplified in the image below:

![](word_base.png)

Indeed, there are various methods to split the text into words. One simple approach is to use whitespace as a delimiter, which can be accomplished by applying Python's `split()` function. This method splits the text wherever it encounters spaces, creating a list of words. This basic word-based tokenization can be quite effective for many tasks.

In [None]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

Word-based tokenizers are useful but can have limitations, especially when dealing with extensive vocabularies in languages with numerous words and variations. In such cases, the vocabulary can become very large, with each word being assigned an ID ranging from 0 to the size of the vocabulary. However, this approach has some drawbacks:

1. Variations between words: Similar words with slight differences, such as "dog" and "dogs," or "run" and "running," may be assigned different IDs initially. The model won't inherently recognize the similarity between such words.

2. Handling out-of-vocabulary words: To cover an entire language with a word-based tokenizer, you would need an identifier for every word in the language, resulting in an enormous number of tokens. To handle words not in the vocabulary, a special "unknown" token, often represented as "[UNK]" or something similar, is used. The presence of many unknown tokens can indicate a limitation of the tokenizer, as it signifies a loss of information.

To mitigate these issues and reduce the reliance on the unknown token, you can employ a character-based tokenizer, which delves one level deeper into text processing.

### Character-based

Character-based tokenizers operate by splitting the text into individual characters, as opposed to words. This approach offers two key advantages:

1. Smaller Vocabulary: Character-based tokenizers result in significantly smaller vocabularies compared to word-based tokenizers. The vocabulary size is determined by the number of distinct characters in the language, which is typically much smaller than the number of words.

2. Fewer Out-of-Vocabulary Tokens: With character-based tokenization, there are far fewer out-of-vocabulary (unknown) tokens. Since every word can be constructed from characters, the model can represent any word using the characters in its vocabulary.

However, character-based tokenization introduces some questions and challenges related to spaces and punctuation. For example:

![](https://github.com/TelRich/Gen-AI-Open-AI/blob/main/image/word_base.png)