# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook. 

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m521.2/521.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m115.3/115.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
 

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing with a Tokenizer

Transformer models, like other neural networks, cannot directly process raw text. Hence, the initial step in our pipeline involves converting text inputs into numerical representations that the model can comprehend. This process is handled by a tokenizer, responsible for:

1. Breaking the input into tokens (words, subwords, or symbols like punctuation)
2. Mapping each token to an integer
3. Adding supplementary inputs beneficial to the model

This preprocessing must mirror the same process applied during the model's pretraining. Therefore, we need to retrieve this information from the Model Hub. The `AutoTokenizer` class and its `from_pretrained()` method help in this task. By using the checkpoint name of our model, it automatically fetches and caches the associated data.


In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Tokenizing and Preparing Input

Once we have the tokenizer, we can directly pass our sentences to it and receive a dictionary ready to feed into our model! The final step is converting the list of input IDs to tensors.

While ü§ó Transformers abstracts the underlying ML framework (PyTorch, TensorFlow, or Flax) from the user, Transformer models accept tensors as input. Tensors are akin to NumPy arrays‚Äîa scalar (0D), a vector (1D), a matrix (2D), or can have higher dimensions, effectively representing tensors. Other ML framework tensors exhibit similar behavior to NumPy arrays and are usually as simple to create.

To specify the type of tensors we wish to obtain (PyTorch, TensorFlow, or plain NumPy), we utilize the `return_tensors` argument:


In [4]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## Understanding the Base Transformer Module

This architecture specifically comprises the base Transformer module. When given inputs, it generates what we refer to as hidden states, also recognized as features. Each model input corresponds to a high-dimensional vector obtained from the Transformer model, representing the contextual comprehension of that specific input.


In [7]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

## Understanding High-Dimensional Vectors from the Transformer Module

The vector output from the Transformer module tends to be large and encompasses three primary dimensions:

1. **Batch size:** Represents the number of sequences processed simultaneously (e.g., 2 in our example).
2. **Sequence length:** Denotes the length of the numerical representation of the sequence (e.g., 16 in our example).
3. **Hidden size:** Signifies the vector dimension of each model input, contributing to the high-dimensional aspect of the vector.

The "high dimensional" attribute arises primarily from the hidden size, which can be notably large. For instance, smaller models commonly have a hidden size of 768, while larger models might extend to 3072 or even higher.

This structure becomes evident when we input our preprocessed inputs into the model. The outputs from ü§ó Transformers models are akin to namedtuples or dictionaries. Accessing these elements can be achieved through:

1. **Attributes:** Elements can be accessed using attributes.
   - Example: `outputs.last_hidden_state`

2. **Keys:** Access elements using keys as in dictionaries.
   - Example: `outputs["last_hidden_state"]`

3. **Indices:** If aware of the specific position of the required element, it can be accessed using index.
   - Example: `outputs[0]`


In [8]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


## Understanding Model Heads: Transforming Hidden States

Model heads receive the high-dimensional vector of hidden states as input and project them onto a different dimension. Typically, these heads consist of one or a few linear layers, and they process the output of the Transformer model directly.

The embeddings layer converts each tokenized input ID into a corresponding vector, while subsequent layers manipulate these vectors using the attention mechanism to generate the final sentence representations.

ü§ó Transformers offers various architectures tailored for specific tasks, including but not limited to:

- `Model` (retrieves hidden states)
- `ForCausalLM`
- `ForMaskedLM`
- `ForMultipleChoice`
- `ForQuestionAnswering`
- `ForSequenceClassification`
- `ForTokenClassification`
- and others ü§ó

For our example, we require a model equipped with a sequence classification head (for classifying sentences as positive or negative). Thus, we'll utilize not the `AutoModel` class, but `AutoModelForSequenceClassification`.


In [9]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

When examining the shape of our outputs, we'll notice a significant reduction in dimensionality. The model head, operating on the previously mentioned high-dimensional vectors, outputs vectors containing only two values, representing each label. As we have two sentences and two labels, the resulting shape from our model is 2 x 2.


In [10]:
print(outputs.logits.shape)

torch.Size([2, 2])


## Postprocessing Model Output

The values obtained from our model are not probabilities but logits, representing raw, unnormalized scores from the last layer. To convert these logits into probabilities, they must pass through a SoftMax layer. All ü§ó Transformers models output logits as the loss function for training typically integrates the last activation function (like SoftMax) with the actual loss function (such as cross-entropy).

In [11]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [12]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


- Model prediction before SoftMax:
  - First sentence: [-1.5607, 1.6123]
  - Second sentence: [4.1692, -3.3464]

After applying SoftMax, the model predicted:
- First sentence: [0.0402, 0.9598]
- Second sentence: [0.9995, 0.0005]
  - These scores represent recognizable probability distributions.




In [13]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

To determine the labels for each position, we can inspect the `id2label` attribute of the model config (detailed in the next section). Conclusively, the model predicted:

- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

This reproduces the three primary steps of the pipeline: preprocessing using tokenizers, model input processing, and postprocessing. Let's delve deeper into each of these steps.

# Models (PyTorch)

## Understanding Model Creation and Usage

To initialize a BERT model, the initial step is loading a configuration object, which includes various attributes used in building the model:

- `hidden_size`: Defines the size of the hidden_states vector.
- `num_hidden_layers`: Specifies the number of layers within the Transformer model.

Different loading methods are available for initializing Transformer models. Creating a model from the default configuration initializes it with random values, requiring training before meaningful inference.



In [14]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [15]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [16]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

Loading a pre-trained Transformer model is achieved through the `from_pretrained()` method, simplifying the process of sharing and reusing trained models. Utilizing the `AutoModel` class ensures checkpoint-agnostic code, facilitating compatibility across different architectures trained for similar tasks.

The `bert-base-cased` identifier, for instance, represents a BERT checkpoint trained by the authors of BERT, initializing the model with pre-existing weights for inference or fine-tuning on new tasks.




In [17]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Saving a model can be done using the `save_pretrained()` method, which generates two files: `config.json` (detailing model architecture and metadata) and `pytorch_model.bin` (containing model weights).



In [18]:
model.save_pretrained("directory_on_my_computer")

## Using a Transformer model for inference
Using a Transformer model for inference involves tokenizing inputs into vocabulary indices, termed as input IDs, before processing them through the model. Tokenizers handle input conversion to the appropriate tensors, crucial for the model's understanding.

For instance, sequences are converted to input IDs by tokenizers, resulting in lists of encoded sequences, which can be converted to tensors, enabling compatibility with the model's input requirements.



In [19]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [20]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [21]:
import torch

model_inputs = torch.tensor(encoded_sequences)

### Using the tensors as inputs to the model
The process of using tensors as inputs to the model is straightforward ‚Äî call the model with the inputs, typically passing the input IDs. While the model accepts various arguments, only the input IDs are necessary for inference. Understanding the functionalities of additional arguments and their necessity is explained later in this section, following an exploration of tokenizers that facilitate Transformer model inputs.

In [22]:
output = model(model_inputs)

# Tokenizers (PyTorch)

## Understanding Tokenization in NLP

Tokenizers play a crucial role in the NLP pipeline by translating text into data that can be processed by models. Since models exclusively process numerical data, tokenizers are pivotal in converting text inputs into numerical formats. In typical NLP tasks, the data processed is raw text.

However, as models solely handle numbers, text-to-number conversion becomes imperative. Tokenizers step in to achieve this conversion, employing various algorithms to transform raw text into numerical representations. The primary aim is to identify the most meaningful and compact representation that aligns with the model's comprehension.



## Word-based Tokenization

In NLP, word-based tokenization is a straightforward approach, usually simple to implement and effective in yielding reasonable results. The objective is to segment raw text into words and assign a numerical representation to each word. Here's a basic breakdown:

### Tokenization Process:
- **Using Whitespace:** We can tokenize text into words by applying Python's `split()` function, using whitespace as a delimiter.
  
### Vocabulary and Word IDs:
- Each word is assigned an ID, typically starting from 0 and spanning the size of the vocabulary. These IDs are utilized by the model to differentiate words.
- **Vocabulary Size:** Covering an entire language with a word-based tokenizer demands a vast number of unique tokens. For instance, English has over 500,000 words, necessitating the management of a substantial number of IDs.
- **Word Variations:** Words like "dog" and "dogs" might be represented differently, leading the model to initially treat them as distinct. Similar variations such as "run" and "running" also pose similar challenges initially.
  
### Handling Unknown Tokens:
- **Unknown Token ([UNK]):** Words not in the vocabulary are represented using a custom token like "[UNK]" or "". An abundance of these tokens during tokenization suggests the tokenizer struggles to represent certain words adequately, resulting in information loss.
- **Optimizing Vocabulary:** The goal is to construct the vocabulary in a manner that minimizes the usage of the unknown token, thereby retaining as much information as possible.



In [24]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


## Character-based Tokenization

Character-based tokenizers segment text into individual characters rather than words. This method presents several advantages:

### Advantages:
1. **Smaller Vocabulary:** The vocabulary size is significantly reduced.
2. **Minimized Unknown Tokens:** As every word can be constructed from characters, there are fewer out-of-vocabulary tokens.

However, some considerations arise regarding spaces and punctuation:

### Challenges:
- **Representation:** Character-based tokenization may seem less meaningful as each character lacks the context and semantic value that words possess. However, this varies across languages; for instance, Chinese characters often carry more inherent meaning.
- **Token Count:** The number of tokens increases considerably compared to word-based tokenization. A single word could result in multiple tokens when represented by characters.

### Balancing Approaches: Subword Tokenization
To address these limitations and capitalize on the benefits of both word-based and character-based approaches, a hybrid technique called subword tokenization comes into play.


## Subword Tokenization

Subword tokenization algorithms function based on the principle that commonly used words should remain intact while infrequently used words are divided into meaningful subwords.

For example, "annoyingly" might be considered rare and split into "annoying" and "ly", as these subwords appear more frequently independently. Yet, the composite meaning of "annoyingly" remains through the combination of "annoying" and "ly".

Subwords contribute rich semantic meaning; for instance, "tokenization" is segmented into "token" and "ization", providing meaningful representation with a reduced number of tokens. This allows for comprehensive coverage with smaller vocabularies and minimizes unknown tokens.

Subword tokenization is particularly beneficial in agglutinative languages like Turkish, where complex words can be formed by stringing together subwords almost limitlessly.

### Additional Techniques
Other notable techniques include:

- **Byte-level BPE:** Employed in models like GPT-2
- **WordPiece:** Utilized in BERT
- **SentencePiece or Unigram:** Found in various multilingual models

This diverse array of techniques provides a comprehensive understanding of tokenization methods, enabling us to effectively leverage the API.


## Loading and Saving Tokenizers

The process of loading and saving tokenizers mirrors that of models. In essence, it's reliant on two familiar methods: `from_pretrained()` and `save_pretrained()`. These methods handle loading or saving the tokenizer's underlying algorithm (similar to the model's architecture) along with its vocabulary (akin to the model's weights).

For instance, loading the BERT tokenizer, which is trained with the same checkpoint as BERT, follows a similar process to loading the model. However, this time we employ the `BertTokenizer` class. The `from_pretrained()` method enables the instantiation of the tokenizer using the specified pretrained tokenizer checkpoint, such as 'bert-base-uncased'. This straightforward approach allows seamless loading of tokenizers pre-trained with specific models.




In [25]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

## AutoTokenizer for Efficient Tokenization

Similar to `AutoModel`, the `AutoTokenizer` class efficiently selects the appropriate tokenizer class from the library based on the checkpoint name. It seamlessly pairs with any checkpoint:


In [26]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [27]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

The AutoTokenizer offers a flexible approach to work with various checkpoints directly. Saving a tokenizer follows the same pattern as saving a model:

In [28]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

## Encoding

Translating text into numerical representations involves a two-step process: tokenization followed by conversion to input IDs.

Tokenization breaks text into tokens‚Äîwords, parts of words, or punctuation symbols‚Äîgoverned by various rules. It is essential to instantiate the tokenizer using the model's name to ensure consistency with the rules used during the model's pretraining.

The second step involves converting these tokens into numbers, allowing the creation of a tensor to be fed into the model. This process relies on the tokenizer's vocabulary, which is obtained during instantiation with the `from_pretrained()` method. Consistency in using the same vocabulary as the one used during model pretraining is crucial.

Note that for practical purposes, we should directly invoke the tokenizer on your inputs, as demonstrated earlier.


Tokenization is performed using the `tokenize()` method of the tokenizer. The output from this method is a list of strings or tokens.

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This specific tokenizer is a subword tokenizer, meaning it splits words until it generates tokens present in its vocabulary. For example, "transformer" is divided into two tokens: "transform" and "##er".

**From Tokens to Input IDs**

The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method. Once these outputs are converted to the appropriate framework tensor, they can be utilized as inputs for a model, as demonstrated earlier.

In [30]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


## Decoding
Decoding involves the reverse operation: starting from vocabulary indices, we aim to obtain a string. This can be achieved using the `decode()` method.

The decode method not only converts the indices back to tokens but also groups together tokens that were part of the same words to generate a readable sentence. This behavior becomes particularly valuable when working with models that predict new text, be it generated text from a prompt or for sequence-to-sequence tasks like translation or summarization.


In [32]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


# Handling multiple sequences (PyTorch)

Our earlier exploration delved into straightforward inference with a single sequence of modest length. However, this led to a cascade of questions:

- How do we manage multiple sequences?
- What about sequences of varying lengths?
- Are vocabulary indices the exclusive inputs that yield optimal model performance?
- Is there a limit to the sequence length that a model can handle effectively?

Let's delve into these inquiries and uncover solutions utilizing the ü§ó Transformers API.

### Batch Inputs for Models

We witnessed the transformation of sequences into numerical lists in the previous exercise. Now, let's take these lists of numbers, convert them into a tensor, and feed them into the model:


In [35]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

IndexError: ignored

## Unforeseen Obstacle: Failure in Process

Oops! What went wrong here? It seems we encountered a glitch while following the procedures outlined in section 2 of the pipeline.

The issue stems from our attempt to pass a solitary sequence to the model, while ü§ó Transformers models inherently anticipate multiple sentences. We attempted to replicate the tokenization process the tokenizer autonomously undertakes when applied to a sequence. But upon closer inspection, we notice that the tokenizer didn't merely convert the input IDs list into a tensor. It appended an additional dimension to the data.

In [36]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


## Second Attempt: Introducing a New Dimension

Let's give this another shot and introduce an additional dimension to the data:

In [37]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


Batching involves the process of concurrently sending multiple sentences through the model. When we have just one sentence, we can effortlessly construct a batch containing a single sequence:

In [38]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

By batching, the model becomes equipped to handle multiple sentences fed into it. Incorporating multiple sequences is as straightforward as forming a batch with a single sequence. However, a subsequent concern arises. When attempting to batch together two or more sentences, the lengths might differ. Given tensors necessitate a rectangular shape, we can't directly convert the input IDs list into a tensor. To circumvent this obstacle, the typical approach involves padding the inputs.


## Applying Padding to Inputs

The tensor representation formed by a list of lists can't be directly converted. To address this issue, padding is used to ensure that tensors possess a rectangular shape. Padding aims to standardize the length of all sentences by introducing a special token, the padding token, to sentences with fewer tokens. For instance, if there are 10 sentences with 10 words each and 1 sentence with 20 words, padding ensures uniformity by expanding all sentences to 20 words.

In [39]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

The ID for the padding token can be accessed using `tokenizer.pad_token_id`. Let's use this token and pass both individual sentences and a batched sequence through the model.


In [41]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)



However, an issue arises with the logits in our batched predictions: the second row doesn't match the logits for the second sentence; instead, we have entirely different values!

This discrepancy emerges because Transformer models utilize attention layers that contextualize each token. These layers consider padding tokens, as they attend to every token within a sequence. To ensure consistent results between passing individual sentences of varying lengths through the model and using a padded batch with the same sentences, we need to instruct these attention layers to disregard the padding tokens. This is accomplished using an attention mask.


## Understanding Attention Masks

Attention masks are tensors shaped identically to the input IDs tensor, containing 0s and 1s. In this context, 1s denote tokens to be attended to, while 0s indicate tokens that should be ignored by the model's attention layers.

To illustrate, let's enhance the prior example by including an attention mask:

In [42]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


By incorporating the attention mask, we now observe consistent logits for the second sentence within the batch.

It's important to note that the last value of the second sequence aligns with a padding ID, represented as a 0 value in the attention mask.

## Dealing with Lengthy Sequences

Transformer models typically come with restrictions on the lengths of sequences they can process. These limits usually range from 512 to 1024 tokens. Exceeding these lengths might cause the models to crash. To tackle this issue, consider two strategies:

1. **Opt for a Model Supporting Longer Sequences:**
   Certain models specialize in handling longer sequences, such as Longformer or LED. They are designed to manage extended inputs beyond the standard limitations. If your task necessitates processing very long sequences, exploring these models might be beneficial.

2. **Sequence Truncation:**
   An alternative approach involves truncating sequences by setting the `max_sequence_length` parameter:



In [44]:
max_sequence_length = 300
sequence = sequence[:max_sequence_length]

By employing this parameter, we can control the maximum sequence length, ensuring it adheres to the model's accepted input size.

# Putting it all together (PyTorch)

## Leveraging High-Level Functions in ü§ó Transformers API

In the preceding sections, we delved into the intricacies of tokenizers, covering tokenization, input ID conversion, padding, truncation, and attention masks manually.

Nevertheless, as showcased in Section 2, the ü§ó Transformers API streamlines these processes with a high-level function. When we employ our tokenizer directly on a sentence, the outputs are primed for direct transmission to our model.

In [45]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)


This high-level functionality abstracts away the complexities we previously managed manually, enabling a more straightforward and efficient workflow.

## Tokenizer's Power in Preparing Model Inputs

The `model_inputs` variable aggregates all essential elements necessary for a model's optimal functionality. For instance, in DistilBERT, this includes input IDs along with the attention mask. Other models accommodating additional inputs will have these components provided by the tokenizer object.

Demonstrating its versatility, this method efficiently tokenizes a single sequence:

In [48]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

Moreover, it seamlessly manages multiple sequences without altering the API:

In [49]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

Offering diverse padding approaches to cater to distinct objectives:


In [50]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)


Additionally, it adeptly truncates sequences:




In [51]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)


The tokenizer object effortlessly handles tensor conversions specific to various frameworks. In the snippet below, the tokenizer is directed to produce tensors compatible with different frameworks:

In [52]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

## Special Tokens in Tokenized Sequences

Upon inspecting the input IDs produced by the tokenizer, a slight difference is evident compared to earlier representations:

In [53]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


Upon decoding the two sequences of IDs mentioned above, the presence of special tokens becomes apparent:

In [54]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


The tokenizer has inserted the special token `[CLS]` at the sequence's start and `[SEP]` at its conclusion. This addition stems from the model's pretraining requirements, ensuring consistency for inference. It is important to note that certain models might not introduce special tokens, or they might include different ones. Additionally, some models could add these special tokens solely at the start or only at the end. The tokenizer handles these distinctions automatically, understanding the requisite tokens and managing them accordingly.


## Concluding the Tokenizer's Functionality

In summarizing the diverse capabilities of the tokenizer object when dealing with text inputs, we revisit its proficiency in managing various scenarios involving multiple sequences, long sequences, and diverse tensor types via its core API.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

This comprehensive representation underscores the tokenizer's adaptability in handling multiple sequences-efficiently managing padding, accommodating lengthy sequences through truncation, and seamlessly converting inputs into different tensor types. The tokenizer's extensive functionality streamlines text processing across a multitude of scenarios.

