# 从分词到模型输出

接下来，我们将会介绍完整的流水线：“实例化模型、 分词得到token、 将token转换为词汇表ID、 构造模型输入以及将模型用于推理以得到输出”。

## 1. 实例化模型

内容大纲：
1. 加载空白的模型（未经过训练的模型）
2. 加载预训练的模型
3. 保存模型

### 1.1 加载空白的模型

如果我们需要从头预训练一个模型，比如说bert-like的模型，可以这样做：

In [1]:
from transformers import BertConfig, BertModel

model_cfg = BertConfig()
model = BertModel(config=model_cfg)
print(f"type of model: {type(model)}")
print(model)

  from .autonotebook import tqdm as notebook_tqdm


type of model: <class 'transformers.models.bert.modeling_bert.BertModel'>
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, el

其实上面的输出已经包含了模型的架构信息，但更常用的做法是查看模型配置：

In [4]:
print(model_cfg)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.52.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



### 1.2 加载预训练模型

从头开始预训练一个bert-like模型是一个氪金游戏，更理智的做法是针对我们的目标任务，基于一个预训练好的模型进行微调.

下面我们来看看如何加载预训练好的模型.

#### 1.2.1 通过AutoModel来加载

使用AutoModel来实例化模型非常方便，他只要求我们提供正确的模型名称即可：

In [6]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-uncased")
print(f"type of model: {type(model)}")
# print(model)

type of model: <class 'transformers.models.bert.modeling_bert.BertModel'>


除了使用AutoModel之外，由于我们已经知道我们将要使用的是bert-like的模型，所以也可以使用BertModel的from_pretrained方法来导入预训练好的模型：

In [7]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
print(f"type of model: {type(model)}")
# print(model)

type of model: <class 'transformers.models.bert.modeling_bert.BertModel'>


### 1.3 保存模型

在transformers中保存模型也非常简单，与加载的语句类似：

In [8]:
model.save_pretrained("bert_base_uncased")

运行该语句之后，将会在当前目录下创建一个名为“bert_base_uncased”的子目录，盖子目录包含以下两个文件：
- config.json
- model.safetensors

前者保存了模型架构的配置，后者就是模型的权重参数。

In [9]:
!ls bert_base_uncased

config.json       model.safetensors


## 2. 分词并构造模型输入

内容大纲：
1. 直接将文本序列转换为模型输入 
2. 分词得到token序列
3. 将token序列转换为词汇表id序列
4. 将词汇表id序列转换为文本序列
5. 处理批量输入

### 2.1 直接将文本序列转换为模型输入

In [74]:
from pprint import pprint
from transformers import AutoTokenizer

ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence, return_tensors="pt")
# print(f"model_inputs shape: {model_inputs['input_ids'].shape}")
pprint(model_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


### 2.2 分词得到token序列

In [11]:
tokens = tokenizer.tokenize(sequence)
print(f"tokens: {tokens}")

tokens: ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']


### 2.3 将token序列转换为词汇表id序列

In [20]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)  # list

import torch
print(f"token_ids: {torch.tensor(token_ids)}")

token_ids: tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])


注意，通过convert_tokens_to_ids方法得到的序列与直接调用tokenizer的不一样：

In [21]:
print(f"model_input_ids' shape: {model_inputs['input_ids'].shape}")
print(f"model_input_ids: \n{model_inputs['input_ids']}")

token_ids = torch.tensor(token_ids)
print(f"token_ids' shape: {token_ids.shape}")
print(f"token_ids: \n{token_ids}")

model_input_ids' shape: torch.Size([1, 16])
model_input_ids: 
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
token_ids' shape: torch.Size([14])
token_ids: 
tensor([ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012])


可以看到调用tokenizer得到的token id序列在首尾多了两个token，后面讲到decode方法时再来回顾这个问题.

### 2.4 将词汇表id序列转换为文本序列

现在，我们看一下直接解码调用convert_tokens_to_ids得到的序列会怎样：

In [24]:
sentence1 = tokenizer.decode(token_ids)
print(f"sentence1: {sentence1}")

sentence1: i ' ve been waiting for a huggingface course my whole life.


然后，对比一下直接解码调用tokenizer得到的结果：

In [26]:
sentence2 = tokenizer.decode(model_inputs["input_ids"][0])
print(f"sentence2: {sentence2}")

sentence2: [CLS] i ' ve been waiting for a huggingface course my whole life. [SEP]


可以看到，tokenizer会为原始的token id序列的首末尾添加特殊标记：[CLS]和[SEP].

### 2.5 处理批量输入

内容大纲：
1. Pad
2. Truncate

#### 2.5.1 Pad

对于一批具有不同长度的输入句子来说，我们可以将各个句子填充到以下长度：
- 该批句子中最长句子的长度
- 自己指定的长度
- 预训练模型支持的最大长度

先定义一批要输入到模型的句子：

In [27]:
sentences =[
    "I've been waiting for a HuggingFace course my whole life.", 
    "So have I!"
]

In [75]:
# 填充到该批句子中最长句子的长度：
batch_inputs1 = tokenizer(sentences, padding="longest")
print(f"batch_inputs1:")
for k, v in batch_inputs1.items():
    print(f"{k}: {v}")
assert len(batch_inputs1["input_ids"][1]) == len(batch_inputs1["input_ids"][0])
print("\n")

# 填充到自己指定的长度：
batch_inputs2 = tokenizer(sentences, padding="max_length", max_length=20)
print(f"batch_inputs2:")
for k, v in batch_inputs2.items():
    print(f"{k}: {v}")
assert len(batch_inputs2["input_ids"][0]) == len(batch_inputs2["input_ids"][1]) == 20
print("\n")

# 填充到预训练模型支持的最大长度：
batch_inputs3 = tokenizer(sentences, padding="max_length")
print(f"batch_inputs3:")
for k, v in batch_inputs3.items():
    print(f"{k}: {v}")
assert len(batch_inputs3["input_ids"][0]) == len(batch_inputs3["input_ids"][1]) == model_cfg.max_position_embeddings



batch_inputs1:
input_ids: [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


batch_inputs2:
input_ids: [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, 0], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


batch_inputs3:
input_ids: [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 1

#### 2.5.2 Truncate

如果输入序列的长度超过了预训练模型支持的最大长度，那么就需要将输入序列进行阶段操作。

In [54]:
sentence_to_truncate = "Ha" * (model_cfg.max_position_embeddings + 10)
inputs = tokenizer(sentence_to_truncate, truncation=True, return_tensors="pt")

## 3. 将模型用于推理以得到模型输出

内容大纲：
1. 模型输入张量的形状
2. Attention Mask
3. 调用模型得到输出

### 3.1 模型输入张量的形状

In [58]:
ckpt = "bert-base-uncased"
# print(f"ckpt is: {ckpt}")
model = AutoModel.from_pretrained(ckpt)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
sentence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids = torch.tensor(token_ids)
outputs = model(token_ids)

IndexError: too many indices for tensor of dimension 1

从抛出的 IndexError 可以看到，模型对输入张量的形状是有要求的：(batch_size, sequence length or token number, embedding size).

In [59]:
ckpt = "bert-base-uncased"
# print(f"ckpt is: {ckpt}")
model = AutoModel.from_pretrained(ckpt)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
sentence = "I've been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids = torch.tensor([token_ids])  # 添加batch_size维度
outputs = model(token_ids)

### 3.2  Attention Mask

使用了Pad之后要配合使用Attention Mask，因为attention机制会考虑输入序列的所有token，包括用来填充的token。但是，这些用来填充的token往往并没有实际的语义，可以在计算attention分布的时候忽略他们，这时候就要用到Attention Mask对输入序列进行掩码操作。

首先，如果我们不进行掩码操作，模型的最终输出会不一样（最终输出是指加上了针对不同任务的head之后的输出）：

In [64]:
from transformers import AutoModelForSequenceClassification

ckpt = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(ckpt)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print("using sequence1_ids:\n", model(torch.tensor(sequence1_ids)).logits)
print("using sequence2_ids:\n", model(torch.tensor(sequence2_ids)).logits)
print("using batched_ids:\n", model(torch.tensor(batched_ids)).logits)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


using sequence1_ids:
 tensor([[-0.1254,  0.0164]], grad_fn=<AddmmBackward0>)
using sequence2_ids:
 tensor([[-0.2898,  0.1925]], grad_fn=<AddmmBackward0>)
using batched_ids:
 tensor([[-0.1254,  0.0164],
        [-0.1301, -0.0056]], grad_fn=<AddmmBackward0>)


可以看到，using_batch_ids的结果的第二行与using sequence2_ids的不一样，这时候，掩码操作就是必须的了：

In [65]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[-0.1254,  0.0164],
        [-0.2898,  0.1925]], grad_fn=<AddmmBackward0>)


OK，现在using_batch_ids的结果的第二行与using sequence2_ids的已经一致了.  

其实，掩码操作在直接调用tokenizer进行分词就会自动完成的：

In [73]:
tokenizer = AutoTokenizer.from_pretrained(ckpt)
inputs = tokenizer(
    ["I've been waiting for a HuggingFace course my whole life.", "So have I!"],
    padding=True
)
pprint(inputs)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'input_ids': [[101,
                1045,
                1005,
                2310,
                2042,
                3403,
                2005,
                1037,
                17662,
                12172,
                2607,
                2026,
                2878,
                2166,
                1012,
                102],
               [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}


可以看到，attention_mask已经包含在tokenizer的调用结果里面了.

### 3.3 调用模型得到输出

In [68]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForSequenceClassification.from_pretrained(ckpt)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
pprint(output.logits)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[-0.4455, -0.0447],
        [-0.4071, -0.1793]], grad_fn=<AddmmBackward0>)
