本文涉及的jupter notebook在[篇章4代码库中](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A04-%E4%BD%BF%E7%94%A8Transformers%E8%A7%A3%E5%86%B3NLP%E4%BB%BB%E5%8A%A1)。

也直接使用google colab notebook打开本教程，下载相关数据集和模型。
如果您正在google的colab中打开这个notebook，您可能需要安装Transformers和🤗Datasets库。将以下命令取消注释即可安装。

In [None]:
!pip install transformers datasets

如果您正在本地打开这个notebook，请确保您已经进行上述依赖包的安装。
您也可以在[这里](https://github.com/huggingface/transformers/tree/master/examples/text-classification)找到本notebook的多GPU分布式训练版本。


# 微调预训练模型进行文本分类

我们将展示如何使用 [🤗 Transformers](https://github.com/huggingface/transformers)代码库中的模型来解决文本分类任务，任务来源于[GLUE Benchmark](https://gluebenchmark.com/).

![Widget inference on a text classification task](https://github.com/huggingface/notebooks/blob/master/examples/images/text_classification.png?raw=1)

GLUE榜单包含了9个句子级别的分类任务，分别是：
- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) 鉴别一个句子是否语法正确.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) 给定一个假设，判断另一个句子与该假设的关系：entails, contradicts 或者 unrelated。
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) 判断两个句子是否互为paraphrases.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) 判断第2句是否包含第1句问题的答案。
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) 判断两个问句是否语义相同。
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment)判断一个句子是否与假设成entail关系。
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) 判断一个句子的情感正负向.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) 判断两个句子的相似性（分数为1-5分）。
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.

对于以上任务，我们将展示如何使用简单的Dataset库加载数据集，同时使用transformer中的`Trainer`接口对预训练模型进行微调。

In [2]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

本notebook理论上可以使用各种各样的transformer模型（[模型面板](https://huggingface.co/models)），解决任何文本分类分类任务。

如果您所处理的任务有所不同，大概率只需要很小的改动便可以使用本notebook进行处理。同时，您应该根据您的GPU显存来调整微调训练所需要的btach size大小，避免显存溢出。

In [3]:
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## 加载数据

我们将会使用[🤗 Datasets](https://github.com/huggingface/datasets)库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用`load_dataset`和`load_metric`即可。

In [10]:
from datasets import load_dataset

- load_metric has been removed in datasets@3.0.0

In [12]:
!pip install evaluate==0.4.0



In [15]:
import evaluate


除了`mnli-mm`以外，其他任务都可以直接通过任务名字进行加载。数据加载之后会自动缓存。

In [17]:
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
# metric = load_metric('glue', actual_task)
metric = evaluate.load('glue', actual_task)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

这个`datasets`对象本身是一种[`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict)数据结构. 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据。

In [18]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

给定一个数据切分的key（train、validation或者test）和下标即可查看数据。

In [19]:
dataset["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0}

为了能够进一步理解数据长什么样子，下面的函数将从数据集里随机选择几个例子进行展示。

In [20]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [21]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,John suddenly put off the customers.,acceptable,3681
1,The change pocketed.,unacceptable,2616
2,There soared oil in price.,unacceptable,3276
3,He's the happiest that I believe that he's ever been.,acceptable,1707
4,I loved the policeman the baker intensely with all my heart.,unacceptable,5797
5,"The money which I will make a proposal for us to squander amounts to $400,000.",acceptable,1225
6,Mary wondered which picture of himself Bill saw?,acceptable,366
7,It was the director that she wants to meet.,acceptable,5142
8,"They can't stand each other, him and her.",acceptable,1851
9,Not reading Shakespeare satisfied me,acceptable,8243


评估metic是[`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric)的一个实例:

In [22]:
metric

EvaluationModule(name: "glue", module_type: "metric", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = evaluate.load('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=ref

直接调用metric的`compute`方法，传入`labels`和`predictions`即可得到metric的值：

In [23]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': -0.12855839970025792}

每一个文本分类任务所对应的metic有所不同，具体如下:

- for CoLA: [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient)
- for MNLI (matched or mismatched): Accuracy
- for MRPC: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for QNLI: Accuracy
- for QQP: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for RTE: Accuracy
- for SST-2: Accuracy
- for STS-B: [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman's_Rank_Correlation_Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
- for WNLI: Accuracy

所以一定要将metric和任务对齐

## 数据预处理

在将数据喂入模型之前，我们需要对数据进行预处理。预处理的工具叫`Tokenizer`。`Tokenizer`首先对输入进行tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输入格式。

为了达到数据预处理的目的，我们使用`AutoTokenizer.from_pretrained`方法实例化我们的tokenizer，这样可以确保：

- 我们得到一个与预训练模型一一对应的tokenizer。
- 使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

In [24]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

注意：`use_fast=True`要求tokenizer必须是transformers.PreTrainedTokenizerFast类型，因为我们在预处理的时候需要用到fast tokenizer的一些特殊特性（比如多线程快速tokenizer）。如果对应的模型没有fast tokenizer，去掉这个选项即可。

几乎所有模型对应的tokenizer都有对应的fast tokenizer。我们可以在[模型tokenizer对应表](https://huggingface.co/transformers/index.html#bigtable)里查看所有预训练模型对应的tokenizer所拥有的特点。

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

In [25]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

取决于我们选择的预训练模型，我们将会看到tokenizer有不同的返回，tokenizer和预训练模型是一一对应的，更多信息可以在[这里](https://huggingface.co/transformers/preprocessing.html)进行学习。

为了预处理我们的数据，我们需要知道不同数据和对应的数据格式，因此我们定义下面这个dict。


In [26]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

对数据格式进行检查:

In [27]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


随后将预处理的代码放到一个函数中：

In [28]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

预处理函数可以处理单个样本，也可以对多个样本进行处理。如果输入是多个样本，那么返回的是一个list：

In [29]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

In [30]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]


更好的是，返回的结果会自动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输入有改动，可能会被缓存影响！）。datasets库函数会对输入的参数进行检测，判断是否有变化，如果没有变化就使用缓存数据，如果有变化就重新处理。但如果输入参数不变，想改变输入的时候，最好清理调这个缓存。清理的方式是使用`load_from_cache_file=False`参数。另外，上面使用到的`batched=True`这个参数是tokenizer的特点，以为这会使用多线程同时并行对输入进行处理。

## 微调预训练模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。我们使用`AutoModelForSequenceClassification` 这个类。和tokenizer相似，`from_pretrained`方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

需要注意的是：STS-B是一个回归问题，MNLI是一个3分类问题：


In [31]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


由于我们微调的任务是文本分类任务，而我们加载的是预训练的语言模型，所以会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了文本分类的神经网络head）。

为了能够得到一个`Trainer`训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数 [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)。这个训练设定包含了能够定义训练过程的所有属性。


In [32]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)



上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

最后，由于不同的任务需要不同的评测指标，我们定一个函数来根据任务名字得到评价方法:

In [33]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

全部传给 `Trainer`:

In [None]:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

开始训练:

- 训练前需要先登录wandb

In [38]:
import wandb
wandb.init(project="your_project_name", name="your_run_name")

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [39]:
trainer.train()

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.2329,0.450604,0.516439
2,0.2096,0.781156,0.514215
3,0.1448,0.901496,0.537666
4,0.1021,1.048726,0.527142
5,0.0672,1.081243,0.522959


TrainOutput(global_step=2675, training_loss=0.14512323932112933, metrics={'train_runtime': 172.655, 'train_samples_per_second': 247.633, 'train_steps_per_second': 15.493, 'total_flos': 229000686898068.0, 'train_loss': 0.14512323932112933, 'epoch': 5.0})

训练完成后进行评估:

In [40]:
trainer.evaluate()

{'eval_loss': 0.9014955163002014,
 'eval_matthews_correlation': 0.5376662783035702,
 'eval_runtime': 0.7932,
 'eval_samples_per_second': 1314.854,
 'eval_steps_per_second': 83.203,
 'epoch': 5.0}

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

## 超参数搜索

`Trainer`同样支持超参搜索，使用[optuna](https://optuna.org/) or [Ray Tune](https://docs.ray.io/en/latest/tune/)代码库。

反注释下面两行安装依赖：

In [41]:
! pip install optuna
! pip install ray[tune]

Collecting optuna
  Downloading optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.8-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.1.0-py3-none-any.whl (364 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m364.4/364.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.14.0-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.8-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Ma

超参搜索时，`Trainer`将会返回多个训练好的模型，所以需要传入一个定义好的模型从而让`Trainer`可以不断重新初始化该传入的模型：

In [42]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

和之前调用 `Trainer`类似:

In [43]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


调用方法`hyperparameter_search`。注意，这个过程可能很久，我们可以先用部分数据集进行超参搜索，再进行全量训练。
比如使用1/10的数据进行搜索：

In [44]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[I 2024-12-20 13:40:58,280] A new study created in memory with name: no-name-2a2aee3a-ea2f-41bd-8e98-4ddbe0f8b6f7
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁▅▆██▆
eval/matthews_correlation,▂▁█▅▄█
eval/runtime,▇▁█▁█▇
eval/samples_per_second,▂█▁█▁▂
eval/steps_per_second,▂█▁█▁▂
train/epoch,▁▁▃▃▄▅▆▆▇███
train/global_step,▁▁▃▃▄▅▆▆▇███
train/grad_norm,▂▄▁▁█
train/learning_rate,█▆▅▃▁
train/loss,█▇▄▂▁

0,1
eval/loss,0.9015
eval/matthews_correlation,0.53767
eval/runtime,0.7932
eval/samples_per_second,1314.854
eval/steps_per_second,83.203
total_flos,229000686898068.0
train/epoch,5.0
train/global_step,2675.0
train/grad_norm,54.1263
train/learning_rate,0.0


[34m[1mwandb[0m: Currently logged in as: [33myuanxq0[0m ([33myuanxq0-beijing-institute-of-technology[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5712,0.562609,0.065589
2,0.5226,0.574807,0.348903
3,0.5111,0.567856,0.415156
4,0.4901,0.609662,0.418103
5,0.4807,0.623321,0.415114


[I 2024-12-20 13:49:24,080] Trial 0 finished with value: 0.41511377999074933 and parameters: {'learning_rate': 1.2197396229337402e-06, 'num_train_epochs': 5, 'seed': 8, 'per_device_train_batch_size': 4}. Best is trial 0 with value: 0.41511377999074933.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁▂▂▆█
eval/matthews_correlation,▁▇███
eval/runtime,▄▁▄▁█
eval/samples_per_second,▅█▅█▁
eval/steps_per_second,▅█▅█▁
train/epoch,▁▁▂▂▂▂▃▃▃▄▄▄▄▅▅▅▅▆▆▆▇▇▇████
train/global_step,▁▁▂▂▂▂▃▃▃▄▄▄▄▅▅▅▅▆▆▆▇▇▇████
train/grad_norm,▁▁▂▂▂▃▃▂▃▄▂▄▄▁▃█▂▂▆▁▂
train/learning_rate,██▇▇▇▆▆▆▅▅▄▄▄▃▃▃▂▂▂▁▁
train/loss,█▇▆▆▅▄▃▄▄▂▃▃▂▂▂▂▃▂▁▃▂

0,1
eval/loss,0.62332
eval/matthews_correlation,0.41511
eval/runtime,0.9259
eval/samples_per_second,1126.439
eval/steps_per_second,71.28
total_flos,174211705108248.0
train/epoch,5.0
train/global_step,10690.0
train/grad_norm,9.42177
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.579611,0.0
2,No log,0.536322,0.33369
3,No log,0.527782,0.400568
4,0.521700,0.517789,0.412741
5,0.521700,0.517443,0.424645


[I 2024-12-20 13:51:40,301] Trial 1 finished with value: 0.42464460528655273 and parameters: {'learning_rate': 4.703631753423137e-06, 'num_train_epochs': 5, 'seed': 14, 'per_device_train_batch_size': 64}. Best is trial 1 with value: 0.42464460528655273.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▃▂▁▁
eval/matthews_correlation,▁▇███
eval/runtime,▅▁▃█▇
eval/samples_per_second,▃█▆▁▂
eval/steps_per_second,▃█▆▁▂
train/epoch,▁▃▅▆▆██
train/global_step,▁▃▅▆▆██
train/grad_norm,▁
train/learning_rate,▁
train/loss,▁

0,1
eval/loss,0.51744
eval/matthews_correlation,0.42464
eval/runtime,0.8032
eval/samples_per_second,1298.597
eval/steps_per_second,82.174
total_flos,280518862332792.0
train/epoch,5.0
train/global_step,670.0
train/grad_norm,3.47756
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.464252,0.466462
2,0.421900,0.520083,0.512052
3,0.421900,0.649905,0.495515
4,0.143000,0.84499,0.497995


[I 2024-12-20 13:53:40,253] Trial 2 finished with value: 0.49799539554804517 and parameters: {'learning_rate': 6.554162171029673e-05, 'num_train_epochs': 4, 'seed': 25, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.49799539554804517.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁▂▄█
eval/matthews_correlation,▁█▅▆
eval/runtime,▂▁▄█
eval/samples_per_second,▇█▅▁
eval/steps_per_second,▇█▅▁
train/epoch,▁▃▃▆▇██
train/global_step,▁▃▃▆▇██
train/grad_norm,█▁
train/learning_rate,█▁
train/loss,█▁

0,1
eval/loss,0.84499
eval/matthews_correlation,0.498
eval/runtime,0.8437
eval/samples_per_second,1236.169
eval/steps_per_second,78.224
total_flos,204631601327736.0
train/epoch,4.0
train/global_step,1072.0
train/grad_norm,4.81289
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.505,0.470903,0.471959
2,0.3654,0.546821,0.48127
3,0.3138,0.734678,0.493572
4,0.2481,0.811071,0.519931
5,0.2025,0.862313,0.519563


[I 2024-12-20 13:58:21,715] Trial 3 finished with value: 0.519563286537562 and parameters: {'learning_rate': 9.815740017028768e-06, 'num_train_epochs': 5, 'seed': 16, 'per_device_train_batch_size': 8}. Best is trial 3 with value: 0.519563286537562.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁▂▆▇█
eval/matthews_correlation,▁▂▄██
eval/runtime,▇▁▂▁█
eval/samples_per_second,▂█▇█▁
eval/steps_per_second,▂█▇█▁
train/epoch,▁▂▂▂▃▃▄▅▅▅▆▆▇███
train/global_step,▁▂▂▂▃▃▄▅▅▅▆▆▇███
train/grad_norm,▁▃▆▁█▂▁▆▂▆
train/learning_rate,█▇▆▆▅▄▃▃▂▁
train/loss,█▇▅▄▃▃▃▂▁▁

0,1
eval/loss,0.86231
eval/matthews_correlation,0.51956
eval/runtime,0.8189
eval/samples_per_second,1273.661
eval/steps_per_second,80.596
total_flos,201335439884616.0
train/epoch,5.0
train/global_step,5345.0
train/grad_norm,34.1228
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5696,0.558246,0.065589
2,0.4999,0.534413,0.366856
3,0.4628,0.541587,0.412188
4,0.4327,0.525048,0.428606
5,0.4236,0.530894,0.422394


[I 2024-12-20 14:03:21,618] Trial 4 finished with value: 0.42239354524181305 and parameters: {'learning_rate': 1.7442191955755123e-06, 'num_train_epochs': 5, 'seed': 29, 'per_device_train_batch_size': 8}. Best is trial 3 with value: 0.519563286537562.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▃▄▁▂
eval/matthews_correlation,▁▇███
eval/runtime,▄▁▆▆█
eval/samples_per_second,▅█▂▂▁
eval/steps_per_second,▅█▂▂▁
train/epoch,▁▂▂▂▃▃▄▅▅▅▆▆▇███
train/global_step,▁▂▂▂▃▃▄▅▅▅▆▆▇███
train/grad_norm,▁▂▂▃▂▃▂▂▆█
train/learning_rate,█▇▆▆▅▄▃▃▂▁
train/loss,█▇▅▄▃▃▂▁▁▁

0,1
eval/loss,0.53089
eval/matthews_correlation,0.42239
eval/runtime,0.8518
eval/samples_per_second,1224.457
eval/steps_per_second,77.482
total_flos,201322244889828.0
train/epoch,5.0
train/global_step,5345.0
train/grad_norm,29.28366
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5655,0.535146,0.34115
2,0.4752,0.515124,0.40747
3,0.4371,0.511156,0.401824


[I 2024-12-20 14:05:24,982] Trial 5 finished with value: 0.40182361094459157 and parameters: {'learning_rate': 4.184271341577123e-06, 'num_train_epochs': 3, 'seed': 29, 'per_device_train_batch_size': 16}. Best is trial 3 with value: 0.519563286537562.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,█▂▁
eval/matthews_correlation,▁█▇
eval/runtime,█▁▅
eval/samples_per_second,▁█▃
eval/steps_per_second,▁█▃
train/epoch,▁▁▄▅▇██
train/global_step,▁▁▄▅▇██
train/grad_norm,▄█▁
train/learning_rate,█▄▁
train/loss,█▃▁

0,1
eval/loss,0.51116
eval/matthews_correlation,0.40182
eval/runtime,0.8598
eval/samples_per_second,1213.063
eval/steps_per_second,76.761
total_flos,137062102821492.0
train/epoch,3.0
train/global_step,1605.0
train/grad_norm,5.84591
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5352,0.489974,0.452479
2,0.3959,0.463008,0.488237
3,0.31,0.53593,0.477348
4,0.2404,0.621681,0.506213
5,0.2087,0.666568,0.495548


[I 2024-12-20 14:08:29,396] Trial 6 finished with value: 0.49554751348625065 and parameters: {'learning_rate': 1.0840630246494861e-05, 'num_train_epochs': 5, 'seed': 26, 'per_device_train_batch_size': 16}. Best is trial 3 with value: 0.519563286537562.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▂▁▄▆█
eval/matthews_correlation,▁▆▄█▇
eval/runtime,▆▁█▃▇
eval/samples_per_second,▃█▁▆▂
eval/steps_per_second,▃█▁▆▂
train/epoch,▁▁▃▃▄▅▆▆▇██
train/global_step,▁▁▃▃▄▅▆▆▇██
train/grad_norm,▁▅█▂▄
train/learning_rate,█▆▅▃▁
train/loss,█▅▃▂▁

0,1
eval/loss,0.66657
eval/matthews_correlation,0.49555
eval/runtime,0.8091
eval/samples_per_second,1289.071
eval/steps_per_second,81.571
total_flos,228168626050260.0
train/epoch,5.0
train/global_step,2675.0
train/grad_norm,11.40759
train/learning_rate,0.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.531707,0.379245


[I 2024-12-20 14:08:57,579] Trial 7 pruned. 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/matthews_correlation,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/loss,0.53171
eval/matthews_correlation,0.37925
eval/runtime,0.7828
eval/samples_per_second,1332.392
eval/steps_per_second,84.312
train/epoch,1.0
train/global_step,268.0


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.571957,0.0


[I 2024-12-20 14:09:32,395] Trial 8 finished with value: 0.0 and parameters: {'learning_rate': 8.780971681506628e-06, 'num_train_epochs': 1, 'seed': 30, 'per_device_train_batch_size': 64}. Best is trial 3 with value: 0.519563286537562.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0,1
eval/loss,▁
eval/matthews_correlation,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁▁
train/global_step,▁▁

0,1
eval/loss,0.57196
eval/matthews_correlation,0.0
eval/runtime,0.8195
eval/samples_per_second,1272.653
eval/steps_per_second,80.532
total_flos,107408809926648.0
train/epoch,1.0
train/global_step,134.0
train_loss,0.59127
train_runtime,34.2247


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.464937,0.460974
2,0.427300,0.491362,0.496228
3,0.427300,0.544285,0.541357
4,0.160400,0.731689,0.530857
5,0.160400,0.884695,0.516214


[I 2024-12-20 14:11:58,177] Trial 9 finished with value: 0.5162139919308135 and parameters: {'learning_rate': 5.213880648932613e-05, 'num_train_epochs': 5, 'seed': 25, 'per_device_train_batch_size': 32}. Best is trial 3 with value: 0.519563286537562.


`hyperparameter_search`会返回效果最好的模型相关的参数：

In [45]:
best_run

BestRun(run_id='3', objective=0.519563286537562, hyperparameters={'learning_rate': 9.815740017028768e-06, 'num_train_epochs': 5, 'seed': 16, 'per_device_train_batch_size': 8}, run_summary=None)

将`Trainner`设置为搜索到的最好参数，进行训练：

In [46]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.505,0.470903,0.471959
2,0.3654,0.546821,0.48127
3,0.3138,0.734678,0.493572
4,0.2481,0.811071,0.519931
5,0.2025,0.862313,0.519563


TrainOutput(global_step=5345, training_loss=0.3305862112732219, metrics={'train_runtime': 274.1791, 'train_samples_per_second': 155.938, 'train_steps_per_second': 19.495, 'total_flos': 201335439884616.0, 'train_loss': 0.3305862112732219, 'epoch': 5.0})

最后别忘了，查看如何上传模型 ，上传模型到](https://huggingface.co/transformers/model_sharing.html) 到[🤗 Model Hub](https://huggingface.co/models)。随后您就可以像这个notebook一开始一样，直接用模型名字就能使用您自己上传的模型啦。