We explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already:

- How do we handle multiple sequences?
- How do we handle multiple sequences of different lengths?
- Are vocabulary indices the only inputs that allow a model to work well?
- Is there such a thing as too long a sequence?

## Models expect a batch of inputs
Previously we saw, sequences get translated into lists of numbers. Let’s convert this list of numbers to a tensor and send it to the model:

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pprint import pprint

In [5]:
# define our model
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [6]:
# initialize out tokenizer based on our model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [7]:
# Initializing our model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [8]:
sequence = "I've been waiting for a HuggingFace course my whole life."

In [9]:
# Creating tokens from sequence
tokens = tokenizer.tokenize(sequence)
pprint(tokens, compact=True)

['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course',
 'my', 'whole', 'life', '.']


In [10]:
# Converting tokens to numbers(ids)
ids = tokenizer.convert_tokens_to_ids(tokens)
pprint(ids, compact=True)

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166,
 1012]


In [11]:
type(ids)

list

In [12]:
# Converting ids from list to tensors
input_ids = torch.tensor(ids)
input_ids

tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])

In [13]:
# sending the tensor ids to models
# it'll give an error
model(input_ids)

IndexError: too many indices for tensor of dimension 1

The problem is that we sent a single sequence to the model, whereas Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. 

But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it.

Let’s try again and add a new dimension:

In [25]:
# First, We have to send our model and tensors to device
# Otherwise in my case kernel dies
device = torch.device('mps')

In [26]:
model = model.to(device)

In [27]:
input_ids = torch.tensor([ids], device=device)
input_ids

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]], device='mps:0')

In [28]:
# model output
output = model(input_ids)
output

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789]], device='mps:0', grad_fn=<LinearBackward0>), hidden_states=None, attentions=None)

## Putting it all together

In [32]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

torch.mps.empty_cache()
device = torch.device('mps')

# Data
sequence = "I've been waiting for a HuggingFace course my whole life."

# Model
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

# Initializing model and send to device
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)

# Initializing tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# tokens
tokens = tokenizer.tokenize(sequence)

# tokens to ids
ids = tokenizer.convert_tokens_to_ids(tokens)

# Adding a dimension to ids and converting ids from list to tensors
# Also sending it to device
input_ids = torch.tensor([ids], device=device)

# model output
output = model(input_ids)

In [33]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789]], device='mps:0', grad_fn=<LinearBackward0>), hidden_states=None, attentions=None)

In [34]:
# logits
logits = output.logits
logits

tensor([[-2.7276,  2.8789]], device='mps:0', grad_fn=<LinearBackward0>)

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

In [36]:
batched_ids = [ids, ids]
pprint(batched_ids, compact=True)

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878,
  2166, 1012],
 [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878,
  2166, 1012]]


This is a batch of two identical sequences!

Convert this batched_ids list into a tensor and pass it through our model.

In [37]:
batched_input_ids = torch.tensor(batched_ids, device=device)
batched_input_ids

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]], device='mps:0')

In [38]:
# batched model output
batch_output = model(batched_input_ids)
batch_output

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], device='mps:0', grad_fn=<LinearBackward0>), hidden_states=None, attentions=None)

In [39]:
batch_logits = batch_output.logits
batch_logits

tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], device='mps:0', grad_fn=<LinearBackward0>)

We obtain the same logits as before (but twice)!

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. 

There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. 

If you’ve ever worked with tensors before, you know that they need to be of rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly. 

To work around this problem, we usually pad the inputs.

## Padding the inputs
The following list of lists cannot be converted to a tensor:

In [40]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In order to work around this, we’ll use padding to make our tensors have a rectangular shape. 

Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. 

For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. 

In our example, the resulting tensor looks like this:

In [41]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

The padding token ID can be found in `tokenizer.pad_token_id`. 

Let’s use it and send our two sentences through the model individually and batched together:

In [42]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

In [44]:
tokenizer.pad_token_id

0

In [43]:
# Now we batch the sequence ids
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

In [45]:
# Converting sequences and batch to tensors and moving to device
sequence1_ids = torch.tensor(sequence1_ids, device=device)
sequence2_ids = torch.tensor(sequence2_ids, device=device)
batched_ids = torch.tensor(batched_ids, device=device)

In [47]:
# printing their logits
print(model(sequence1_ids).logits)
print(model(sequence2_ids).logits)
print(model(batched_ids).logits)

tensor([[ 1.5694, -1.3895]], device='mps:0', grad_fn=<LinearBackward0>)
tensor([[ 0.5803, -0.4125]], device='mps:0', grad_fn=<LinearBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], device='mps:0', grad_fn=<LinearBackward0>)


There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that `contextualize` each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. 

To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

## Attention masks
Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

Let’s complete the previous example with an attention mask:

In [56]:
# '0' to ignore padding token id
attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

attention_mask = torch.tensor(attention_mask, device=device)
batched_ids = torch.tensor(batched_ids, device=device)

In [57]:
outputs = model(batched_ids, attention_mask)
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], device='mps:0', grad_fn=<LinearBackward0>)


In [58]:
print(model(sequence1_ids).logits)
print(model(sequence2_ids).logits)

tensor([[ 1.5694, -1.3895]], device='mps:0', grad_fn=<LinearBackward0>)
tensor([[ 0.5803, -0.4125]], device='mps:0', grad_fn=<LinearBackward0>)


Now we get the same logits for the second sentence in the batch.

### Task

```
Apply the tokenization manually on the two sentences used in section 2 (“I’ve been waiting for a HuggingFace course my whole life.” and “I hate this so much!”). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model!
```

### Auto Padding

In [227]:
torch.manual_seed(42)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
     "I hate this so much!",
]

checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

model_0 = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

token_results = tokenizer(
    raw_inputs, padding=True,
    return_tensors="pt"
)

token_results.to(device)

output = model_0(token_results.input_ids, token_results.attention_mask)
output

logits_1 = output.logits
logits_1

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], device='mps:0', grad_fn=<LinearBackward0>)

### Manual Padding

In [287]:
torch.manual_seed(42)

sentence_1 = "I’ve been waiting for a HuggingFace course my whole life."
sentence_2 = "I hate this so much!"

checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

model_1 = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokens
token_sent_1 = tokenizer.tokenize(sentence_1)
input_id_1 = tokenizer.convert_tokens_to_ids(token_sent_1)
input_id_1 = tokenizer.prepare_for_model(input_id_1)
input_id_1 = input_id_1['input_ids']
# pprint(input_id_1, compact=True)
# print(len(input_id_1))

token_sent_2 = tokenizer.tokenize(sentence_2)
input_id_2 = tokenizer.convert_tokens_to_ids(token_sent_2)
input_id_2 = tokenizer.prepare_for_model(input_id_2)
input_id_2 = input_id_2['input_ids']
# pprint(input_id_2, compact=True)
# print(len(input_id_2))

padding_range = len(input_id_1) - len(input_id_2)
# padding_range

# Applying manual padding to input_id_2
for pad in range(padding_range):
    input_id_2.append(tokenizer.pad_token_id)

# input_id_2

# batched
batched_ids = [input_id_1, input_id_2]
# pprint(batched_ids, compact=True)

# to tensor and device
batched_ids = torch.tensor(batched_ids, device=device)

# Creating attention masks
attention_mask_1 = []
attention_mask_2 = []

for a in input_id_1:
    if a != 0:
        attention_mask_1.append(1)
    else:
        attention_mask_1.append(0)

for a in input_id_2:
    if a != 0:
        attention_mask_2.append(1)
    else:
        attention_mask_2.append(0)

# print(attention_mask_1)
# print(attention_mask_2)

attention_mask = [attention_mask_1, attention_mask_2]
attention_mask = torch.tensor(attention_mask, device=device)
# attention_mask

# Output logits
logits_2 = model_1(batched_ids, attention_mask).logits
logits_2

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-1.5979,  1.6390],
        [ 4.1692, -3.3464]], device='mps:0', grad_fn=<LinearBackward0>)

In [288]:
logits_1 

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], device='mps:0', grad_fn=<LinearBackward0>)

In [191]:
token_results['input_ids']

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]], device='mps:0')

In [193]:
token_results['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], device='mps:0')

In [197]:
batched_ids

tensor([[  101,  1045,  1521,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]], device='mps:0')

In [198]:
token_results['input_ids'][0] == batched_ids[0]

tensor([ True,  True, False,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True], device='mps:0')

In [199]:
token_results['input_ids'][1] == batched_ids[1]

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True], device='mps:0')

In [200]:
token_results['attention_mask'][0] == attention_mask[0]

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True], device='mps:0')

In [201]:
token_results['attention_mask'][1] == attention_mask[1]

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True], device='mps:0')

## Longer sequences
With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

- Use a model with a longer supported sequence length.
- Truncate your sequences.
  
Models have different supported sequence lengths, and some specialize in handling very long sequences. 

`Longformer` is one example, and another is `LED`. If you’re working on a task that requires very long sequences, we recommend you take a look at those models.

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:

`sequence = sequence[:max_sequence_length]`

In [236]:
0%3

0

In [284]:
import random
player_choice = 0
print(f"Player choice: {player_choice}")
# num = -1
revealed = 0
prize_box = random.randrange(3)
print(f"PrizeBox: {prize_box}")

def open_box():
    # Reveals a box that is not the one selected by the user
    #	and also does not contain a prize
    global revealed
    # Assigns num to be either -1 or +1
    # Num is a random direction starting from player_choice
    num = random.randrange(-1, 3, 2)
    revealed = (player_choice + num) % 3
    if revealed == prize_box:
        revealed = (revealed + num) % 3
    print ("By the way, box", revealed, "is empty.")
    print ("Would you like to change your guess?")

open_box()

Player choice: 0
PrizeBox: 2
By the way, box 1 is empty.
Would you like to change your guess?


In [234]:
-1%4

3