# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。


## 下载数据集

In [2]:
from datasets import load_dataset

In [3]:
# 65w条训练数据 5w条测试数据

dataset = load_dataset('yelp_review_full')

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [5]:
dataset['train'][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [6]:
import pandas as pd
import datasets
from IPython.display import display, HTML
import numpy as np

In [7]:
def show_random_elements(dataset, num_examples=10):
    # '如果随机抽取的长度如果大于dataset长度则报错'
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    # picks = []
    # for _ in range(num_examples):
    #     # 随机在0-数据长度之间选择一个
    #     pick = random.randint(0,dataset.num_rows-1)
    #     # 如果在picks中出现过，则从新选择一个,直到没有在picks中出现过
    #     while pick in picks:
    #         pick = random.randint(0,dataset.num_rows-1)
    #     picks.append(pick)
    picks = np.random.choice(dataset.num_rows-1, size=num_examples, replace=False).tolist()
    df = pd.DataFrame(dataset[picks])
    
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i:typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(dataset['train'])

Unnamed: 0,label,text
0,3 stars,"The other day I stopped into Dillon's on Central Avenue just off of the canal. I typically ride my bike by that location at least once a week, so this time I decided to stop by for lunch. The atmosphere was different than I had expected. It was designed in a Hollywood kitsch design to give it the atmosphere on a New York/L.A. style deli (outside of New York/L.A.) which was bizarre. I want barbeque, not deli!\nThe dining area was nice - clean, nice furniture, etc. However, it did not feel like a barbeque place. The crowd was mainly older (60+) on a Saturday afternoon. I did not fit in with my bike apparel, Nook and cell phone. \nThe service was good, not great. I do not want to be too harsh in saying that the older customers got most of the attention. I hardly got noticed. Not that I require constant nurturing, but just fill my ice tea!\n\nThe food was very good. I got a beef brisket sandwich with mashed potatoes and a side of gravy. Also, I could not resist the $1.00 plate of onion rings. I was satisfied. It was done just right and not loaded with salt that keeps you thirsty for hours on end. I thought the meal was pricey for what I received - no not the onion rings, but a $10.00 sandwich!\n\nWhen my check arrived they forgot to include a pen for me. I waited patiently for the waitress to return, but was somewhat frustrated.\n\nI would say it was good, but I would not rush back."
1,5 stars,"Vancouver could take a hint from these guys, they sure know how to do Lebanese food. For under $5 you can get a hearty wrap or for under $10 you can choose up to 5 items from their delicious array of grilled eggplant, sauteed cauliflower, falafel, garlic potatoes and many meaty delights.\nNo matter how broke you are this place will treat you right and keep you coming back for more."
2,5 stars,"OMG, what can I say?! \n\nI have a reluctant sweet tooth but that's not a problem here where you can have the most delicious kick-arse hot chocolate with chilli ever, I'm still thinking about it this morning! \n\nThe only problem here is trying to decide which chocolates to buy and not wanting to buy them for everyone you know! \n\nAs Arnie said, I'll be back...."
3,3 stars,"I cannot recall the number of times I have been here. But, it sure has been a lot. The food here is slightly inconsistent at times, but mostly really good. I wish the owners had thought about expanding the place, as it is too cramped right now. Absolutely love the chole bhature. This is the most authentic chole bhature you can find anywhere in Tempe. The only problem with this establishment is the staff at the counter. I always find them a bit rude and disinterested in customer service."
4,2 star,"The food is decent for the neighborhood, but the customer service needs some serious help. I asked the hostess if I needed to put my name on the list for the bar area. She stated \""first come first serve.\"" The first table in bar area that came open I sat down and the bartender asked me to please get up that people were waiting for the table. He said what the hostess stated was true, but he had his \""own list\"""
5,4 stars,Still Love this place!
6,1 star,"I am totally disgusted with this restaurant. \n\nI first visited this restaurant last weekend after hearing good things about their food. My visit was enjoyable- the food was delicious and tasted like it was made from scratch. The waitress was nice and the prices were reasonable. I was excited about my visit so I told invited a friend to join me the following weekend.\n\nMy friend and I went for breakfast this morning (now a week later) and it was like I went to an entirely different restaurant. The place was busy but we were able to get a table immediately. The waitress was in a bad mood from the time we sat down. My friend ordered coffee and juice and the waitress looked at her like she was crazy. She came back 3 times within 5 minutes to ask if we were ready to order after we told her we needed time. She was annoyed that my friend needed more time and then asked a question about the menu. I know it must have been terribly inconvenient for the waitress to do her job. The food came out in about 20-30 minutes. The food tasted good, just like my last visit. When we were only half way through our food, she brought us our bill and asked us if we needed anything else. We told her we were fine, but were just finishing our food and chatting. She came by 5 times in 10 minutes asking if we needed anything else so we would hurry up and pay the bill and leave. She had another waitress come ask us if we needed anything else within those 10 minutes as well. Then, my friend took out her card and had it on the table and I was getting out my card and the manager came and without saying a word grabbed the bill and walked away. I tried to stop him but he wouldn't turn around. I had to flag down the waitress to bring our bill back as we had two separate bills and my card hadn't been added yet. Then, the hostess, who is probably the snootiest young lady I've ever met came and said \""can I take this\"" referring to the book t hat the bill is held in and didn't wait for a response before grabbing it and walking away. She came back saying she needed the other receipt and I told her I wasn't done yet but I would leave it when I was. I could see the manager, and several of the staff pointing and talking about us (on more than one occasion) at the hostess station which was 3 booths away. I have never felt so bullied at a restaurant in my entire life!\n\nSince when is it a crime to talk with your party when you are out a restaurant? We were only there for about 1 1/2 hours from the time we sat down to the time we left which is not an exuberant amount of time. Furthermore, there were many open tables elsewhere. Regardless of whether there are tables open or not, there is no reason for a restaurant staff to bother guests to get them to leave. The manager has no place in management. \n\nThe only reason my waitress got a tip was so the hard working buser got a portion of the money. She certainly didn't deserve one.\n\nThe restaurant deserves an F in manners, customers service, atmosphere, hospitality, and professionalism. I'm completely embarrassed that I spoke so highly of it to my friend who had to experience such terrible and rude service. \n\nGood luck staying open."
7,1 star,"Yes, super over priced! Like most people, the buffet line was too long so had to eat here. The place is nice, but the food and service sucks.\n\nThe hostess skirt is so short that just standing I can almost see something I don't need to see.\n\nThe noodles were oily, the XLB were disgusting, most items lacked flavor. The one decent thing was the beef pancake.\n\nAlso our waitress tried to be nice, but she was up in our face. She would come to our table and stand there for a few seconds, stare at us then say something. I really wanted green onion pancake. It took forever to come. I asked twice where is my green onion pancake. The people there kept saying it's coming. Finally everyone was done eating and still no green onion pancake. I complained and the stupid lady goes \""oh you ordered that?\"" I was so mad, she can see the red in my eyes and then she switced to nice mode saying how sorry she is and she will put in the order now. I said no, I want the check so I can leave. She kept saying it will be ready. I'm like NO! \n\nBad service! A little too late to fix anything."
8,1 star,"My friends strangely loved going to QQ Asian Buffet, but I haven't the slightest clue as to why they'd trek all the way from downtown and west side Madison to this remote place. Maybe I lied. They probably just loved the notion of all-you-can-eat for $10, and taking a break from their frequent Fugu and Saigon adventures. \n\nBut I myself hated it and dreaded having to go to QQ Asian Buffet all the time. Sure, $10 for a buffet dinner is great. But you get what you pay for, so what kind of quality will you expect? So, I'll admit the selection of foods here isn't under par. But their sushi is terrible, their Chinese foods are really oily and greasy, and their American food is unappetizing. I'm probably not alone in thinking this way because every time I've been to QQ Asian Buffet, there were lots of empty tables."
9,3 stars,"Convenience for a reasonable price,,,thats it.. no frills.......Best location"


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [9]:
from transformers import AutoTokenizer

In [10]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [11]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

In [12]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [13]:
show_random_elements(tokenized_datasets['train'], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,1 star,"Oh my... really having a difficult time with this because we love our Veterinarian, Dr. Foster. We however were faced with one of the worst customer service situations today with one of their staff. Kimberly (who referred to herself as a vet-tech) was just horribly, horribly rude and invited us to seek out the services of a different vet. Our dog Charlie has had to go in every couple of months to have her thyroid levels checked as it runs low and is on medication. We have been working with Dr. Foster in keeping in at the level it should. So far frequent blood draws and changes in medication dosages. 2 weeks ago was her last blood draw and results and treatment have been withheld from us. Apparently the presiding vet is the only vet allowed to interpret the results regardless that their are other vets that are on staff that could interpret the results. Vets went on vacation, vets had days off and as yet we still have no results. we have had to schedule another meeting with our vet in order to hear the results.\nThe earliest they seem to be able to do this is 3 weeks from the date the blood was drawn. Needless to say staff can destroy a business. We as consumers fund the company, create jobs for people and ultimately pay the salaries.","[101, 2048, 1139, 119, 119, 119, 1541, 1515, 170, 2846, 1159, 1114, 1142, 1272, 1195, 1567, 1412, 159, 24951, 2983, 5476, 117, 1987, 119, 7895, 119, 1284, 1649, 1127, 3544, 1114, 1141, 1104, 1103, 4997, 8132, 1555, 7832, 2052, 1114, 1141, 1104, 1147, 2546, 119, 26564, 113, 1150, 2752, 1106, 1941, 1112, 170, 1396, 1204, 118, 13395, 114, 1108, 1198, 16358, 14791, 4999, 117, 16358, 14791, 4999, 14708, 1105, 4022, 1366, 1106, 5622, 1149, 1103, 1826, 1104, 170, 1472, 1396, 1204, 119, 3458, 3676, 4117, 1144, 1125, 1106, 1301, 1107, 1451, 2337, 1104, 1808, 1106, 1138, 1123, 21153, 16219, 3001, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [15]:
# small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000))
# small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [16]:
small_train_dataset.num_rows, small_eval_dataset.num_rows

(10000, 1000)

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [17]:
from transformers import AutoModelForSequenceClassification

In [18]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [19]:
from transformers import TrainingArguments

In [18]:
model_dir = "models/bert-base-cased-finetune-yelp"
# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
# output_dir 保存训练过程中产生的模型和其他输出的目录。
# num_train_epochs: 训练的轮次数。
# per_device_train_batch_size: 每个设备上的训练批量大小
# per_device_eval_batch_size: 每个设备上的评估批量大小。
# warmup_steps: 预热步骤的数量，在这些步骤中学习率逐渐增加。
# weight_decay: 权重衰减的比例，用于防止过拟合
# logging_dir: 存储日志文件的目录。
# logging_steps: 每多少步记录一次日志。
# save_steps: 每多少步保存一次模型。
# eval_steps: 每多少步进行一次评估。
# evaluation_strategy: 评估策略，可以是 "no"、"steps"、"epoch" 等。
# learning_rate: 初始学习率。
# load_best_model_at_end: 训练结束时是否加载最佳模型。
# metric_for_best_model: 用于评估最佳模型的指标。
# greater_is_better: 在选择最佳模型时，指标的较大值是否表示更好的模型。
# save_total_limit: 保存的模型总数限制，超过此限制将删除旧模型。
# seed: 随机种子，用于确保结果的可重复性。
# fp16: 是否使用半精度浮点数训练（仅在支持的硬件上有效）。

training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=8,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [20]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_nam

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [20]:
import numpy as np
import evaluate

In [21]:
metric = evaluate.load("accuracy")

接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [22]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [23]:
from transformers import TrainingArguments, Trainer

In [26]:
model_dir = "models/bert-base-cased-finetune-yelp"
training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [28]:
!nvidia-smi

Mon Jan 29 13:53:32 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-DGXS...  Off  | 00000000:07:00.0  On |                    0 |
| N/A   38C    P0    53W / 300W |   1863MiB / 32768MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  Off  | 00000000:08:00.0 Off |                    0 |
| N/A   38C    P0    38W / 300W |      4MiB / 32768MiB |      0%      Default |
|       

In [29]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,1.0563,1.07775,0.539
2,0.9162,0.922774,0.598
3,0.7799,0.931847,0.594




TrainOutput(global_step=471, training_loss=1.0160937987568526, metrics={'train_runtime': 316.2455, 'train_samples_per_second': 94.863, 'train_steps_per_second': 1.489, 'total_flos': 7893544273920000.0, 'train_loss': 1.0160937987568526, 'epoch': 3.0})

In [30]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [31]:
trainer.evaluate(small_test_dataset)



{'eval_loss': 1.0364636182785034,
 'eval_accuracy': 0.52,
 'eval_runtime': 0.6256,
 'eval_samples_per_second': 159.858,
 'eval_steps_per_second': 6.394,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [None]:
trainer.save_model(model_dir)

In [None]:
trainer.save_state()