# Handling multiple sequences

"In the previous exercise you saw how sequences get translated into lists of numbers. Let’s convert this list of numbers to a tensor and send it to the model:"

In [1]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tf.constant(ids)
# This line will fail.
model(input_ids)

2024-05-27 19:10:19.962948: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.7276218,  2.8789387]], dtype=float32)>, hidden_states=None, attentions=None)

"The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default."

"If you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it:"

In [2]:
tokenized_inputs = tokenizer(sequence, return_tensors="tf")
print(tokenized_inputs["input_ids"])

tf.Tensor(
[[  101  1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026
   2878  2166  1012   102]], shape=(1, 16), dtype=int32)


In [3]:
# Let’s try again and add a new dimension:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = tf.constant([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Input IDs: tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]], shape=(1, 14), dtype=int32)
Logits: tf.Tensor([[-2.7276218  2.8789387]], shape=(1, 2), dtype=float32)


"Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:"

In [4]:
batched_ids = [ids, ids]
batched_tensor = tf.constant(batched_ids)
print("Input IDs:", batched_tensor)
output = model(batched_tensor)
print("Logits:", output.logits)

Input IDs: tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]
 [ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]], shape=(2, 14), dtype=int32)
Logits: tf.Tensor(
[[-2.7276225  2.8789394]
 [-2.727621   2.878938 ]], shape=(2, 2), dtype=float32)


The inputs need to be the same length. For example, take this list of lists:

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

They need to be padded into a rectangular shape.  
"Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. "

In [5]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

"The padding token ID can be found in tokenizer.pad_token_id. Let’s use it and send our two sentences through the model individually and batched together:"

In [6]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(tf.constant(sequence1_ids)).logits)
print(model(tf.constant(sequence2_ids)).logits)
print(model(tf.constant(batched_ids)).logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tf.Tensor([[ 1.5693687 -1.3894594]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 0.5802988  -0.41252303]], shape=(1, 2), dtype=float32)
tf.Tensor(
[[ 1.5693679 -1.3894584]
 [ 1.3373474 -1.2163184]], shape=(2, 2), dtype=float32)


In the batched logits, you'd expect the second row of logits to be the same as the individual logits for sequence 2, but they aren't.  
"This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask."  
* Tell the attention layers to ignore the padding tokens to get the same logits output

### Attention Mask

"Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to"

In [8]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask))
print("output logits from model: ",outputs.logits)
print(model(tf.constant(sequence1_ids)).logits)
print(model(tf.constant(sequence2_ids)).logits)

output logits from model:  tf.Tensor(
[[ 1.5693679  -1.3894584 ]
 [ 0.58030313 -0.41252705]], shape=(2, 2), dtype=float32)
tf.Tensor([[ 1.5693687 -1.3894594]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 0.5802988  -0.41252303]], shape=(1, 2), dtype=float32)


In [36]:
sent1 = "I love machine learning"
sent2 = "artificial intelligence is way better"

tokens1 = tokenizer.tokenize(sent1)
ids1 = tokenizer.convert_tokens_to_ids(tokens1)

tokens2 = tokenizer.tokenize(sent2)
ids2 = tokenizer.convert_tokens_to_ids(tokens2)

batched_ids = [ids1, ids2]
print(batched_ids)

[[1045, 2293, 3698, 4083], [7976, 4454, 2003, 2126, 2488]]


In [37]:
new_ids1 = ids1.copy()
new_ids1.append(tokenizer.pad_token_id) # append padding token to make equal lengths
print(new_ids1)

[1045, 2293, 3698, 4083, 0]


In [38]:
batched_ids_new = [new_ids1, ids2]
print(batched_ids_new)

[[1045, 2293, 3698, 4083, 0], [7976, 4454, 2003, 2126, 2488]]


In [39]:
attention_mask = [
    [1, 1, 1, 1, 0],
    [1, 1, 1, 1, 1],
]

In [40]:
input_tensor = tf.constant(batched_ids_new)

outputs = model(input_tensor, attention_mask=tf.constant(attention_mask))
print("output logits from model: ",outputs.logits)

print(model(tf.constant(ids1)).logits)
print(model(tf.constant(ids2)).logits)

output logits from model:  tf.Tensor(
[[-2.7265766  2.8925383]
 [ 3.191364  -2.592868 ]], shape=(2, 2), dtype=float32)
tf.Tensor([[-2.7265766  2.8925393]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 3.1913645 -2.5928683]], shape=(1, 2), dtype=float32)


### Longer Sequences  
Models will often have a limit to how many tokens they can be passed. Most will handle up to 512 or 1014 tokens.  

Solutions:  
* Use a model which supports longer sequence lengths (and more tokens)
* Truncate your sequences  

Sequence: a sentence that has been tokenized (sequence length I think == number of tokens).

How do we handle multiple sequences?  
* Batch them  

How do we handle multiple sequences of different lengths?  
* Use paddings  

Are vocabulary indices the only inputs that allow a model to work well?  
* tbd.  

Is there such a thing as too long a sequence?  
* Yes, you can handle them by truncating your sequences, or using a model which can handle more