<a href="https://colab.research.google.com/github/Monalisha-Roy/huggingFace/blob/main/Handling_multiple_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Handling multiple sequences(PyTorch)

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

IndexError: too many indices for tensor of dimension 1

The problem is that we sent a single sequence to the model, whereas Transformers model sezpect multiple sentences by default.

In [4]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence ="I've been waiting for a HuggingFace cource my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs: ", input_ids)

output = model(input_ids)
print("Logits: ", output.logits)

Input IDs:  tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2522,
          3126,  3401,  2026,  2878,  2166,  1012]])
Logits:  tensor([[-1.9851,  2.0809]], grad_fn=<AddmmBackward0>)


Batching si the act of sending multiple sentences through the model, all at once.

When trying to batch two or more sentences, they might be of different lengths. But we need a rectangular shape so we won't be able to convert the list of inputs into a tensor directly.

## Padding the inputs

We'll use padding to make our tensors have a rectangular shape.
Padding makes sure all our sentences have the same length by adding a special word called the **padding token** to the sentences with fewer values.


In [6]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
               ]

In [7]:
padding_id = 100
batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [8]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


here’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence.

## Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [9]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0]
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))

print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


In [10]:
sequence = sequence[:max_sequence_length]

NameError: name 'max_sequence_length' is not defined

In [16]:
import torch
from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1 = "I've been waiting for a course on HuggingFace my whole life."
sequence2 = "I hate this so much."

token1 = tokenizer.tokenize(sequence1)
token2 = tokenizer.tokenize(sequence2)

id1 = tokenizer.convert_tokens_to_ids(token1)
id2 = tokenizer.convert_tokens_to_ids(token2)

input_id1 = torch.tensor([id1])
print("Input ID1: ",input_id1 )
input_id2 = torch.tensor([id2])
print("Input ID2: ", input_id2)

output1 = model(input_id1)
print("Logits1: ", output1.logits)
output2 = model(input_id2)
print("Logits2: ", output2.logits)

Input ID1:  tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037,  2607,  2006, 17662,
         12172,  2026,  2878,  2166,  1012]])
Input ID2:  tensor([[1045, 5223, 2023, 2061, 2172, 1012]])
Logits1:  tensor([[-1.5003,  1.6616]], grad_fn=<AddmmBackward0>)
Logits2:  tensor([[ 3.1249, -2.6450]], grad_fn=<AddmmBackward0>)


In [24]:
print(input_id1, input_id2)
input_id1.shape, input_id2.shape

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037,  2607,  2006, 17662,
         12172,  2026,  2878,  2166,  1012]]) tensor([[1045, 5223, 2023, 2061, 2172, 1012]])


(torch.Size([1, 15]), torch.Size([1, 6]))

In [29]:
input_id1 = [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037,  2607,  2006, 17662,
         12172,  2026,  2878,  2166,  1012]]

input_id2 =[[1045, 5223, 2023, 2061, 2172, 1012]]

batched_ids = [
    [ 1045,  1005,  2310,  2042,  3403,  2005,  1037,  2607,  2006, 17662,
         12172,  2026,  2878,  2166,  1012],
    [1045, 5223, 2023, 2061, 2172, 1012, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id, tokenizer.pad_token_id ]
]

attention_mask = [
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
]

print("Logits1: ", output1.logits)
print("Logits2: ", output2.logits)
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

Logits1:  tensor([[-1.5003,  1.6616]], grad_fn=<AddmmBackward0>)
Logits2:  tensor([[ 3.1249, -2.6450]], grad_fn=<AddmmBackward0>)
tensor([[-1.5003,  1.6616],
        [ 3.1249, -2.6450]], grad_fn=<AddmmBackward0>)


Enhanced version  of the above code

In [35]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Input sequences
sequences = [
    "I've been waiting for a course on HuggingFace my whole life.",
    "I hate this so much."
]

# Tokenize manually without adding special tokens
tokenized_inputs = tokenizer(
    sequences,
    add_special_tokens=False,  # Disable [CLS] and [SEP]
    padding='max_length',      # Pad manually to match original length (15)
    max_length=15,
    truncation=True,
    return_tensors='pt'
)

# Extract input IDs and attention masks
input_ids = tokenized_inputs['input_ids']
attention_mask = tokenized_inputs['attention_mask']

# Display input details
print("Input IDs:\n", input_ids)
print("Attention Mask:\n", attention_mask)

# Pass inputs through the model
outputs = model(input_ids, attention_mask=attention_mask)

# Extract logits
logits = outputs.logits
print("Logits:\n", logits)


Input IDs:
 tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037,  2607,  2006, 17662,
         12172,  2026,  2878,  2166,  1012],
        [ 1045,  5223,  2023,  2061,  2172,  1012,     0,     0,     0,     0,
             0,     0,     0,     0,     0]])
Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Logits:
 tensor([[-1.5003,  1.6616],
        [ 3.1249, -2.6450]], grad_fn=<AddmmBackward0>)


In [34]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Original sequences
sequence1 = "I've been waiting for a course on HuggingFace my whole life."
sequence2 = "I hate this so much."

# Tokenize manually without special tokens
token1 = tokenizer.tokenize(sequence1)
token2 = tokenizer.tokenize(sequence2)

# Convert tokens to IDs
id1 = tokenizer.convert_tokens_to_ids(token1)
id2 = tokenizer.convert_tokens_to_ids(token2)

# Pad manually
batched_ids = [
    id1 + [tokenizer.pad_token_id] * (15 - len(id1)),  # Manually pad sequence 1
    id2 + [tokenizer.pad_token_id] * (15 - len(id2))   # Manually pad sequence 2
]

# Create attention masks
attention_mask = [
    [1] * len(id1) + [0] * (15 - len(id1)),  # Mask for sequence 1
    [1] * len(id2) + [0] * (15 - len(id2))   # Mask for sequence 2
]

# Convert to PyTorch tensors
input_ids = torch.tensor(batched_ids)
attention_mask = torch.tensor(attention_mask)

# Verify the inputs
print("Input IDs:\n", input_ids)
print("Attention Mask:\n", attention_mask)

# Get model outputs
outputs = model(input_ids, attention_mask=attention_mask)

# Extract logits
logits = outputs.logits
print("Logits:\n", logits)


Input IDs:
 tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037,  2607,  2006, 17662,
         12172,  2026,  2878,  2166,  1012],
        [ 1045,  5223,  2023,  2061,  2172,  1012,     0,     0,     0,     0,
             0,     0,     0,     0,     0]])
Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Logits:
 tensor([[-1.5003,  1.6616],
        [ 3.1249, -2.6450]], grad_fn=<AddmmBackward0>)
