# 理解 Hugging Face 的 `AutoModel` 系列：不同任务的自动模型加载类

> 指导文章：[05. 理解 Hugging Face 的 `AutoModel` 系列：不同任务的自动模型加载类](https://github.com/Hoper-J/LLM-Guide-and-Demos-zh_CN/blob/master/Guide/05.%20理解%20Hugging%20Face%20的%20%60AutoModel%60%20系列：不同任务的自动模型加载类.md)

这里是一些不同类的代码示例。

在线链接：[Kaggle](https://www.kaggle.com/code/aidemos/04-hugging-face-automodel) | [Colab](https://colab.research.google.com/drive/1gLTXcvG-tEDOqnR7qM-3-S812qnBUGlh?usp=sharing)


## 安装库

In [1]:
!uv add transformers
!uv add sentencepiece
!uv add sacremoses

[2mResolved [1m198 packages[0m [2min 0.99ms[0m[0m
[2mAudited [1m193 packages[0m [2min 0.06ms[0m[0m
[2mResolved [1m198 packages[0m [2min 0.98ms[0m[0m
[2mAudited [1m193 packages[0m [2min 0.06ms[0m[0m
[2mResolved [1m198 packages[0m [2min 0.96ms[0m[0m
[2mAudited [1m193 packages[0m [2min 0.06ms[0m[0m


## 设置模型下载镜像

注意，需要在导入transformers等模块前进行设置才能起效。

In [2]:
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

## 示例 1：文本生成 (`AutoModelForCausalLM`)

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# 指定模型名称
model_name = "gpt2"

# 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 设置 pad_token 为 eos_token
tokenizer.pad_token = tokenizer.eos_token

# 加载预训练模型
model = AutoModelForCausalLM.from_pretrained(model_name)

# 输入文本
input_text = "Once upon a time"

# 编码输入
inputs = tokenizer(input_text, return_tensors="pt")

# 生成文本
outputs = model.generate(**inputs, max_length=50, do_sample=True, top_p=0.95, temperature=0.7)

# 解码生成的文本
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, it was a new way to do things. The new tools are better and faster.

The toolset now lets you make the most of your time with your phone. It can save you more time. You can use


## 示例 2：填空任务 (`AutoModelForMaskedLM`)


In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

# 指定模型名称
model_name = "bert-base-uncased"

# 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载预训练模型
model = AutoModelForMaskedLM.from_pretrained(model_name)

# 输入文本，包含 [MASK] 标记
input_text = "The capital of France is [MASK]."

# 编码输入
inputs = tokenizer(input_text, return_tensors="pt")

# 获取预测
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# 获取最高得分的预测词
masked_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = predictions[0, masked_index].argmax(dim=-1).item()
predicted_token = tokenizer.decode([predicted_token_id])

print(f"预测结果: {predicted_token}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


预测结果: paris


## 示例 3：序列到序列任务 (`AutoModelForSeq2SeqLM`)


In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 指定模型名称
model_name = "Helsinki-NLP/opus-mt-en-de"

# 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载预训练模型
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 输入文本
input_text = "Hello, how are you?"

# 编码输入
inputs = tokenizer(input_text, return_tensors="pt")

# 生成翻译
outputs = model.generate(**inputs, max_length=40, num_beams=4, early_stopping=True)

# 解码生成的文本
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"翻译结果: {translated_text}")

翻译结果: Hallo, wie geht's?


## 示例 4：问答系统 (`AutoModelForQuestionAnswering`)

In [6]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# 指定模型名称
model_name = "distilbert-base-uncased-distilled-squad"

# 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载预训练模型
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# 输入上下文和问题
context = "Hugging Face is creating a tool that democratizes AI."
question = "What is Hugging Face creating?"

# 编码输入
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")

# 获取预测
with torch.no_grad():
    outputs = model(**inputs)

# 获取答案的起始和结束位置
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

# 解码答案
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))
print(f"答案: {answer}")

答案: a tool that democratizes ai


## 示例 5：命名实体识别 (`AutoModelForTokenClassification`)


In [7]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

# 指定模型名称
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"

# 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载预训练模型
model = AutoModelForTokenClassification.from_pretrained(model_name)

# 标签列表
label_list = model.config.id2label

# 输入文本
input_text = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge."

# 编码输入
inputs = tokenizer(input_text, return_tensors="pt")

# 获取模型输出
with torch.no_grad():
    outputs = model(**inputs)

# 获取预测分数
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)

# 将预测结果映射到标签
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
pred_labels = [label_list[prediction.item()] for prediction in predictions[0]]

# 打印结果
for token, label in zip(tokens, pred_labels):
    print(f"{token}: {label}")

model.safetensors:  46%|####5     | 136M/298M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[CLS]: O
Hu: I-ORG
##gging: I-ORG
Face: I-ORG
Inc: I-ORG
.: O
is: O
a: O
company: O
based: O
in: O
New: I-LOC
York: I-LOC
City: I-LOC
.: O
Its: O
headquarters: O
are: O
in: O
D: I-LOC
##UM: I-LOC
##BO: I-LOC
,: O
therefore: O
very: O
close: O
to: O
the: O
Manhattan: I-LOC
Bridge: I-LOC
.: O
[SEP]: O


## 示例 6：文本分类 (`AutoModelForSequenceClassification`)


In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# 指定模型名称
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# 加载 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载预训练模型
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 输入文本
input_text = "I love using transformers library!"

# 编码输入
inputs = tokenizer(input_text, return_tensors="pt")

# 获取模型输出
with torch.no_grad():
    outputs = model(**inputs)

# 获取预测分数
logits = outputs.logits
probabilities = F.softmax(logits, dim=1)

# 获取标签
labels = ['Negative', 'Positive']
prediction = torch.argmax(probabilities, dim=1)
predicted_label = labels[prediction]

# 打印结果
print(f"文本: {input_text}")
print(f"情感预测: {predicted_label}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

文本: I love using transformers library!
情感预测: Positive


## 示例 7：特征提取 (`AutoModel`)


In [9]:
from transformers import AutoTokenizer, AutoModel
import torch

# 指定模型名称
model_name = "bert-base-uncased"

# 加载 Tokenizer 和模型
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# 输入文本
input_text = "This is a sample sentence."

# 编码输入
inputs = tokenizer(input_text, return_tensors="pt")

# 获取模型输出
with torch.no_grad():
    outputs = model(**inputs)

# 获取最后一层隐藏状态
last_hidden_states = outputs.last_hidden_state

# 输出维度
print(f"Last hidden state shape: {last_hidden_states.shape}")

Last hidden state shape: torch.Size([1, 8, 768])


## 查看源码

以 `AutoModelForQuestionAnswering` 为例，使用 `inspect` 库查看对应源码:

### 查看 `__init__` 方法

In [10]:
import inspect
from transformers import AutoModelForQuestionAnswering

# 加载预训练模型
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

# 获取并打印 __init__ 方法的源码
init_code = inspect.getsource(model.__init__)
print(init_code)

    def __init__(self, config: PretrainedConfig):
        super().__init__(config)

        self.distilbert = DistilBertModel(config)
        self.qa_outputs = nn.Linear(config.dim, config.num_labels)
        if config.num_labels != 2:
            raise ValueError(f"config.num_labels should be 2, but it is {config.num_labels}")

        self.dropout = nn.Dropout(config.qa_dropout)

        # Initialize weights and apply final processing
        self.post_init()



### 查看 `forward` 方法

In [11]:
# 获取并打印 forward 方法的源码
forward_code = inspect.getsource(model.forward)
print(forward_code)

    @auto_docstring
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        start_positions: Optional[torch.Tensor] = None,
        end_positions: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[QuestionAnsweringModelOutput, tuple[torch.Tensor, ...]]:
        r"""
        input_ids (`torch.LongTensor` of shape `(batch_size, num_choices)`):
            Indices of input sequence tokens in the vocabulary.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        inputs_embeds (`torch.FloatTenso

### 使用 `help` 快速查看

除了 `inspect`，我们还可以使用 Python 内置的 `help` 函数查看模型的文档和方法:

In [12]:
help(AutoModelForQuestionAnswering)

Help on class AutoModelForQuestionAnswering in module transformers.models.auto.modeling_auto:

class AutoModelForQuestionAnswering(transformers.models.auto.auto_factory._BaseAutoModelClass)
 |  AutoModelForQuestionAnswering(*args, **kwargs) -> None
 |
 |  This is a generic model class that will be instantiated as one of the model classes of the library (with a question answering head) when created
 |  with the [`~AutoModelForQuestionAnswering.from_pretrained`] class method or the [`~AutoModelForQuestionAnswering.from_config`] class
 |  method.
 |
 |  This class cannot be instantiated directly using `__init__()` (throws an error).
 |
 |  Method resolution order:
 |      AutoModelForQuestionAnswering
 |      transformers.models.auto.auto_factory._BaseAutoModelClass
 |      builtins.object
 |
 |  Class methods defined here:
 |
 |  from_config(**kwargs) from transformers.models.auto.auto_factory._BaseAutoModelClass
 |      Instantiates one of the model classes of the library (with a questi