# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [7]:
!pip install datasets evaluate "transformers[sentencepiece]" tf-keras

Looking in indexes: https://mirrors.aliyun.com/pypi/simple
Collecting tf-keras
  Downloading https://mirrors.aliyun.com/pypi/packages/45/6b/d245122d108a94df5969ee7408ad343af1627730e91478e01ef098976bfa/tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: tf-keras
Successfully installed tf-keras-2.19.0


In [3]:
from pprint import pprint

## 1. 复现用于情感分析的pileine

在这个章节中，我们将会复现以下pipeline的功能：

In [8]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{"timestamp":"2025-06-03T07:29:37.604369Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, url: \"https://transfer.xethub.hf.co/xorbs/default/dd40b0cffe04aa9cb306c402b95a3b5663c6c5a4943acf9a58692c926ec8574b?X-Xet-Signed-Range=bytes%3D0-63129476&Expires=1748939377&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly90cmFuc2Zlci54ZXRodWIuaGYuY28veG9yYnMvZGVmYXVsdC9kZDQwYjBjZmZlMDRhYTljYjMwNmM0MDJiOTVhM2I1NjYzYzZjNWE0OTQzYWNmOWE1ODY5MmM5MjZlYzg1NzRiP1gtWGV0LVNpZ25lZC1SYW5nZT1ieXRlcyUzRDAtNjMxMjk0NzYiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE3NDg5MzkzNzd9fX1dfQ__&Signature=JkxxSWcDrhTqDmo1RF58O9af8F0dxrSqgLIx7oBKJ9NTs-xlpuZs~5ndZTzqDzSHjDWfzp2yXHAPYf8ogsJVrZaa1sxSh8oU5PzkWWAswCRcMsY4NFPIZ6qIQFp-sPq5y48AKc3uYxnMYUxZwKBk0pWuFcrqfORs892CKTbT2WoCF5vlfTTbZ-Zw3caqELo~e3rciv8JsDR3A7oGeWQ-TqCict1741MAjt9-LfU3Qb6yRPbFQuPoi-VPwN-fqJF4iMP-rA48IUScklJw5PyRqT3Yg3rjfDOSvNtlKyj9-tD2GpsYnQhNySxbjMniIYBqBOSXv33iTaUffOiOmShcLg__&Key-Pair-Id=K2L8F4GPSG1IFC\",

Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## 2. 分词

在transformers的语境中，分词所需要的处理必须与模型训练时完全一致的方式来完成，换句话说，分词器其实是跟具体的模型绑定到一起的。下面的代码展示了这个特点：

In [11]:
from transformers import AutoTokenizer

ckpt = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(ckpt)

下面，我们就可以使用这个分词器对刚才的两个句子进行分词，返回结果将以pytorch.tensor的形式返回：

In [13]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
pprint(inputs)
print(f"type of input ids: {type(inputs['input_ids'])}")

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])}
type of input ids: <class 'torch.Tensor'>


## 3. 原始的模型输出

原始的模型输出是指没有经过为特定任务所添加的head的输出，在transformers的语境中，head是指为诸如文本分类、情感分析等任务而在原始预训练后的transformer-like模型的最后一层所添加的线性层、softmax层或者其他非线性层。

首先，让我们导入模型：

In [14]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

{"timestamp":"2025-06-03T07:46:58.732768Z","level":"WARN","fields":{"message":"Reqwest(reqwest::Error { kind: Request, url: \"https://cas-server.xethub.hf.co/reconstruction/8f5bd24518cc18de2a591e24027100367709bfcb9829ba16749752f1b21cf6da\", source: hyper_util::client::legacy::Error(Connect, Error { code: -9806, message: \"connection closed via error\" }) }). Retrying..."},"filename":"/Users/runner/work/xet-core/xet-core/cas_client/src/http_client.rs","line_number":175}
{"timestamp":"2025-06-03T07:46:58.735402Z","level":"WARN","fields":{"message":"Retry attempt #0. Sleeping 2.477295602s before the next attempt"},"filename":"/Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/reqwest-retry-0.6.1/src/middleware.rs","line_number":166}


接着，看一下原始的模型输出：

In [15]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


上面的last_hidden_state其实就是transformer block中feed forward层后面的残差输出，因此他最后一个维度的大小与embedding向量的维度大小一致：

In [18]:
assert outputs.last_hidden_state.shape[-1] == model.embeddings.word_embeddings.weight.shape[-1]

## 4. 为情感分析任务微调过的模型输出

"sentiment-analysis"类型的pipeline本质上是"sequence classification"类型的任务，我们可以导入经过这类任务微调过的模型，然后观察它的输出：

In [19]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [20]:
print(outputs.logits.shape)

torch.Size([2, 2])


注意，微调后的模型输出并不是概率形式的：

In [11]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


如果需要输出概率，需要调用softmax函数：

In [21]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4419e-04]], grad_fn=<SoftmaxBackward0>)


情感分析的标签编码是：

In [13]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

最后，让我们将输出结果转换为pipeline的输出形式：

In [23]:
"""
[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]
"""
final_results = [
    {
        "label": model.config.id2label[prediction.argmax().item()],
        "score": prediction.max().item(),
    }
    for prediction in predictions
]
pprint(final_results)


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]
