# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [4]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [5]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,2 star,"First time at China Tango, I ordered Singapore noodles for lunch. Dry and super salty, just the way I don't like it. It's close to my workplace so I decided to give this place another shot. On the plus side, restaurant is new, clean and modern. Although the food is pretty average, the lunch specials are a really good value."
1,4 stars,"A must-try if you're in Vegas! \n\nEven though it looks packed, the lines actually go super quick and you get your food pretty quickly as well. I got the Caribbean Jerk Chicken and I usually would be picky and say no to the bell peppers, but I decided to keep them in. So glad I did because they were DELICIOUS!! The bread was so crisp, the bell peppers were sweet and they went perfect with the chicken and spicy jerk sauce (haHhaa jerk sauce sounds totally wrong). \n\nThe sandwich was smaller than I had anticipated but in the end, I was happy because I wasn't overly full and could still fit into my dress that night :)"
2,2 star,"I give this place two stars, but I do it with love. I was a semi-regular when I lived on Peel. Always a lineup outside, because they card at the door (bring ID!) and it's usually packed to the brim. Amazingly cheap for most things. The 99-cent shooters on Thursday night are only slightly alcoholic but most are delicious. Never bothered with the food, although I hear it's in the okay/fine/nothing to write home about realm of quality. A decent place to knock back a pitcher or six with the homies. Impossible to hear yourself over the din. Some people say the bartenders are rude; I've never had one I minded. Surprising number of tourists. Don't bring anyone here that you want to get with, for the love of God; you will automatically fail."
3,5 stars,"Like another reviewer, I was intending to go to the other Greek caf\u00e9 down the street, but found it closed. Gratefully, I found this little gem! We had the gyro and fries, soup and the falafel platter. Everything was delicious and well seasoned. All items were fresh, (as the owner explained, nothing is frozen), and all items are individually prepared from her personal recipes. Couldn't be much better, especially with a menu that contained other Greek and Mediterranean specialties such as leg of lamb. While the servings were ample, the desserts, rice pudding and cheesecake are something to save room for. The server was a very nice older lady, just like mom. In fact, the entire experience was like being invited to the owner's house while she prepared delicacies from her kitchen. Definitely worth a try...I will be back."
4,4 stars,"Excelent! This is a terrific restaraunt, good food and great service. One of the few open on Thanksgiving. Terrific fried chicken."
5,3 stars,"We like this place and go often. I like their happy hour options to add an appetizer for $.99 to your beer. We order edamame for our daughter who loves it, when coupled with their egg drop soup is a nutritious and affordable kid friendly option. We like the different rolls you can get here and the prices are reasonable. I do think they add too much vinegar to their sushi rice which makes all of the rolls taste similar and tangy. They are friendly and I just adore the decor."
6,4 stars,"I was unaware of the existence of Ross until I moved to Vegas. It was while shopping at Target that I noticed the store just down the way. I figured I would pop by and see what they had to offer. Boy, was I in for a surprise! I now visit Ross on a weekly basis, searching for good buys. \n\nThe store is iffy on cleanliness. I think it depends on the time of day you go. I've seen it in shambles with merchandise everywhere and I've seen it relatively clean and organized. The employees aren't enthusiastic by any means, but they will help if asked. \n\nThe main reason I shop here, though, is for the deals. They have tons of linens, clothes, shoes, and home decor for great prices. I dare say 90% of my home decor is from Ross. \n\nIt's hit or miss as far as finding what you're looking for. If you're just in to browse though, you'll more than likely walk out with something."
7,3 stars,"Well, All good things must come to an end. Something has changed and this place has lost stars in my review. Not sure where to go next..... Today's experience was bad; loud discussion about the cost and time associated with her 30 vs 60 min massage, after she was seated in the dark quiet massage area. Seemed like miscommunication was abundant because within the first 15 min of my 30 min foot massage, they changed therapists three times! I had reviewed my preferences (deep pressure, concentrate at the arch) with two people already-just tolerated the light feathery stroking of the third person. Luckily, I had earned a free massage on my frequent visitor card therefore, didn't pay beyond the tip. Very sad day"
8,4 stars,"We went here for dinner this weekend and the wait was about an hour and fifteen minutes, we were able to put our name in and they would just send us a text letting us know our table was ready which is pretty great. That meant we could leave and go to another place for a drink if we had wanted to. Luckily there was room at the bar for us so we ordered a drink and the pesto fries appetizer while we waited. The fries were delicious and the pesto was amazing. So good I decided to order the pesto pasta as my entree and I am so glad I did. The pasta sauce was also very creamy which I loved. My husband ordered the fish and chips and tried a few bites of it as well, it was great with wonderful flaky crust. Their tartar sauce was amazing. My best friend ordered their burger which I didn't try because I don't eat beef but she said it was delicious. I will be going here again in the future."
9,4 stars,I enjoy shopping here because of the good prices and the staff actually are nice too u. Good selection on fish and produce. When im feeling lazy they have frozen loompia all ready that actually taste good and not freezer burned. Found a store that was selling freezer burned loompia before and it still bugs me. I buy cases of coconut water here as well cause of the low price.


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [8]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,1 star,"While saying a place is full of knuckleheads is not saying much in Las Vegas the level of knucklehead at the Hard Rock is even higher. They are usually closing the pool for one thing or another and seriously who gives a fuck about Classic Rock anymore? \nThe Rehab pool party is a big disappointment too. Each deck chair will cost you $150 to rent, the people there are not attractive, and the DJ recycles his tracks through a list of very weak hip-hop, uninteresting dance set and cheeseball medleys that will make you want to kill someone.\nThe only highlight of Rehab was when these 10 women brawled it out after the liquor had been flowing and the sun was shining. Luckily I was only visiting friends and was able to retire to my room closer to the strip when I wanted to get away. I did enjoy the craps table by the pool. The people running it were entertaining and I was on a roll. If you think it might be a good idea to spend your money at the Hard Rock, think again and spend it on a nice BJ or maybe a meal at the MGM.","[101, 1799, 2157, 170, 1282, 1110, 1554, 1104, 18325, 19046, 12970, 1110, 1136, 2157, 1277, 1107, 5976, 6554, 1103, 1634, 1104, 18325, 19046, 3925, 1120, 1103, 9322, 2977, 1110, 1256, 2299, 119, 1220, 1132, 1932, 5134, 1103, 4528, 1111, 1141, 1645, 1137, 1330, 1105, 5536, 1150, 3114, 170, 9367, 1164, 6667, 2977, 4169, 136, 165, 183, 1942, 4638, 11336, 17266, 4528, 1710, 1110, 170, 1992, 10866, 1315, 119, 2994, 5579, 2643, 1209, 2616, 1128, 109, 4214, 1106, 9795, 117, 1103, 1234, 1175, 1132, 1136, 8394, 117, 1105, 1103, 6027, 1231, 19964, 1116, 1117, 2390, 1194, 170, 2190, 1104, 1304, 4780, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [9]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [27]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer",
                                  logging_dir=f"{model_dir}/test_trainer/runs",
                                  logging_steps=100)

In [28]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubSt

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [29]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [30]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [31]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=f"{model_dir}/test_trainer", evaluation_strategy="epoch")

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [32]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [33]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.137018,0.553
2,No log,1.309998,0.559
3,No log,1.370839,0.583


TrainOutput(global_step=375, training_loss=0.5570142822265625, metrics={'train_runtime': 345.3604, 'train_samples_per_second': 8.687, 'train_steps_per_second': 1.086, 'total_flos': 789354427392000.0, 'train_loss': 0.5570142822265625, 'epoch': 3.0})

In [18]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [19]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 0.9963672757148743,
 'eval_accuracy': 0.56,
 'eval_runtime': 2.9722,
 'eval_samples_per_second': 33.645,
 'eval_steps_per_second': 4.374,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [20]:
trainer.save_model()

In [21]:
trainer.save_state()