**pipeline可以直接对输入进行预处理以及对模型的输出结果处理**

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

**还可以对句子列表进行操作**

In [2]:

classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# behind the pipeline #
- 对输入进行预处理
- 将预处理后的数据输入到模型里
- 对模型的输出进行处理

## 如何对输入进行预处理 ## 
### 使用tokenizer将输入文本编码成机器可以理解的数字 ### 
- 用分词方法来对句子分词，也就是把句子拆分成一个个token（用很多种分词方法)
- 对每个token映射到数字上，用表
- 添加模型需要的额外输入，比如bert里面的cls,sep标签，又或者位置编码等

**用AutoTokenizer类的from_pretrained方法可以加载模型所对应的tokenizer**

In [4]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

**输出有两个：input_ids和attention_mask**

**可以用AutoModel类的from_pretrained方法来加载模型**

**Automodel类的输出是hidden_states**

In [6]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
# 解包的作用
# 定义一个函数，接受多个参数
def my_function(a, b, c):
    return a + b + c

# 准备传递给函数的参数，以字典形式存储
inputs = {'a': 1, 'b': 2, 'c': 3}

# 直接传递字典作为参数会报错
# result = my_function(inputs)  # 这行代码会报错

# 使用 **inputs 解决这个问题
result = my_function(**inputs)
result  # 输出结果为 6


6


In [9]:
# **inputs将字典里的键值解包传递给函数
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


![image.png](attachment:image.png)

In [10]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [12]:
outputs.logits.shape

torch.Size([2, 2])

In [13]:
# 模型的输出只是logits，还需要用softmax函数
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

In [14]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [15]:
# 通过查看model的config的id2label属性可以查看索引对应的标签
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}