<a href="https://colab.research.google.com/github/Eddiebee/AI-Craft/blob/main/Handling_multiple_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Models expects a batch of inputs.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)


sequence = "I've been waiting for a great HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# this line should fail
model(input_ids)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

IndexError: too many indices for tensor of dimension 1

🤗 Transformers expects a group of sequences and not just a single sequence but in cell above we passed in a single sequence.

Also, if we closely take a look at the`input_ids` created by tokenizer we see that a dimension was added.

In [3]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037,  2307, 17662,
         12172,  2607,  2026,  2878,  2166,  1012,   102]])


Let's try again but now with an additional dimension.

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for an amazing HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  2019,  6429, 17662, 12172,
          2607,  2026,  2878,  2166,  1012]])
Logits: tensor([[-3.8957,  4.1832]], grad_fn=<AddmmBackward0>)


_Batching_ is the act of sending multiple sentences through the model, all at once.

In [7]:
batched_ids = [ids, ids]

batched_input_ids = torch.tensor(batched_ids)
print("Input IDs:", batched_input_ids)

batched_outputs = model(batched_input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  2019,  6429, 17662, 12172,
          2607,  2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  2019,  6429, 17662, 12172,
          2607,  2026,  2878,  2166,  1012]])
Logits: tensor([[-3.8957,  4.1832]], grad_fn=<AddmmBackward0>)


_Padding_ the inputs enable us to curb the problem of having a batch of sequences with different lengths which is often the case. Remember, tensors are rectangular in shape, hence we need to pad the inputs to be of the same length.

The special word addedd by setting `padding=True` is called the _padding token_.

In [8]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


Observely closely, we see that the output of the second `batched_ids` is quite different from the first. This is because of a key feature in Transformers called the _attention layers_ that contextualizes each token. The _attention layers_ will attend to the _padding tokens_ also since it pays attention to all the tokens. Hence, we need a way to tell the _attention layers_ to not attend to the _padding tokens_ and we do this by using an _attention mask_. Which simply masks out the indexes where a _padding token_ exists.

#### Attention Masks:
Attention masks are tensors with the exact same shape as the input IDs tensor.

In [9]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids),
                attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


In [10]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0]
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids),
            attention_mask=torch.tensor(attention_mask)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
