pipeline包含三个步骤：   
预处理、模型计算和后处理  
预处理：将文本转化为inputID，由Tokenizer完成  
模型计算：利用对应机器学习及深度学习算法，得到结果数字，通常为logit（对数函数）  
后处理：将logits转化为概率即prediction  


利用tokenizer获取input  
tokenizer负责：将对应单词转换为token，再将token映射到一个数字，称为input ID  
然后将其转化为tensor  


tokenization：通过tokenizer的tokenize()方法实现，该方法输出为字符表，即tokens

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)


['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

token向inputID的转化：通过tokenizer的convert_token_to_ids()方法实现

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)



[7993, 170, 11303, 1200, 2443, 1110, 3014]

上述输出InputID若是有适当的框架或size，即可作为tensor成为模型的输入

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

'Using a Transformer network is simple'

还需注意，作为模型输入的tensor在单纯的inputID上加了一层维度   
例子如下：

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# 这一行会运行失败，原因在于通过convert_tokens_to_ids得到的还差一层维度才能直接作为tensor被输入模型
model(input_ids)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])
#此处直接实行tokenizer函数，相当于自动完成tokenize和convert过程，且自动加了一层维度，所以是合格的tensor
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
#对上述错误代码修改如下，其实就是加上一层维度
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)


Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
Logits: [[-2.7276,  2.8789]]

##### 填充输入(padding)         
若是进行批处理(batch)，会出现两个句子不一样上（单词量多少不一样），从而inputID并不是一个矩阵，无法作为tensor，所以需要对较短的句子对应的INputID进行填充。       
但若单纯只使用padding,会发生以下问题：  

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

我们会发现：加了padding，会导致model过后的logits值发生改变，这是因为Transformers的注意力层特性，也就是说，Transformers会注意上下文，而pad也会被纳入该考虑范围。    
##### 为了解决该问题，引入attention mask，即标记哪些token该被忽略

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

##### tokenizer函数，包含tokenize以及convert操作，十分强大，还具有填充、截断的功能

In [None]:
# 将句子序列填充到最长句子的长度
model_inputs = tokenizer(sequences, padding="longest")

# 将句子序列填充到模型的最大长度
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# 将句子序列填充到指定的最大长度
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# 将截断比模型最大长度长的句子序列
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# 将截断长于指定最大长度的句子序列
model_inputs = tokenizer(sequences, max_length=8, truncation=True)


sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# 返回 PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# 返回 TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# 返回 NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

tokenizer还会加上特殊的token即[CLS]和[SEP]，分别用来标记开始和结束