## **2.4 Handling multiple sequences**

### Model expects batch inputs

In [1]:
import torch
from transformers import AutoModelForSequenceClassification,AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = 'Transformer is good for large language model'
tokens = tokenizer.tokenize(sequence)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [2]:
tokens

['transform', '##er', 'is', 'good', 'for', 'large', 'language', 'model']

In [3]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[10938, 2121, 2003, 2204, 2005, 2312, 2653, 2944]


In [4]:
input_ids = torch.tensor(ids)
input_ids

tensor([10938,  2121,  2003,  2204,  2005,  2312,  2653,  2944])

In [5]:
model(input_ids)

IndexError: too many indices for tensor of dimension 1

这是因为transformer默认期望得到多个句子，而我们在这里只传了一个，那么将一维的input_ids转换为二维的即可。

In [7]:
tokenized_inputs = tokenizer(sequence,return_tensors='pt')
print(tokenized_inputs)
print(tokenized_inputs['input_ids'])

{'input_ids': tensor([[  101, 10938,  2121,  2003,  2204,  2005,  2312,  2653,  2944,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
tensor([[  101, 10938,  2121,  2003,  2204,  2005,  2312,  2653,  2944,   102]])


SyntaxError: invalid syntax (ipython-input-2539555144.py, line 1)

In [12]:
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = 'Transformer is good for large language model'
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids]) # 就是这里改动了，从一维变成了二维
print(f"Input IDs:{input_ids}")
output = model(input_ids)
print('Logits:',output.logits)

Input IDs:tensor([[10938,  2121,  2003,  2204,  2005,  2312,  2653,  2944]])
Logits: tensor([[-0.1372,  0.1444]], grad_fn=<AddmmBackward0>)


### Padding the inputs

In [16]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200,200,200]]
sequence2_ids = [[200,200]]
batched_ids = [
    [200,200,200],
    [200,200,tokenizer.pad_token_id],
]
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


可以看到上面的输出是有问题的，batch_ids得到的第二个向量的值与sequence2_ids得到的向量不同。由于它们会关注序列中的所有标记，因此会将填充标记考虑在内。为了在模型中传递不同长度的单个句子时，或传递包含相同句子并应用了填充的批量数据时获得相同的结果，我们需要告诉这些注意力层忽略填充标记。这可以通过使用注意力掩码来实现。

### Attention masks

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200,200,200]]
sequence2_ids = [[200,200]]
batched_ids = [
    [200,200,200],
    [200,200,tokenizer.pad_token_id],
]
attention_mask = [
    [1,1,1],
    [1,1,0]
]
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids),attention_mask=torch.tensor(attention_mask)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
