# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。


## 下载数据集

In [1]:
from datasets import load_dataset

In [3]:
# 65w条训练数据 5w条测试数据

dataset = load_dataset('yelp_review_full')

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [5]:
dataset['train'][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [6]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML
import numpy as np

In [7]:
def show_random_elements(dataset, num_examples=10):
    # '如果随机抽取的长度如果大于dataset长度则报错'
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    # picks = []
    # for _ in range(num_examples):
    #     # 随机在0-数据长度之间选择一个
    #     pick = random.randint(0,dataset.num_rows-1)
    #     # 如果在picks中出现过，则从新选择一个,直到没有在picks中出现过
    #     while pick in picks:
    #         pick = random.randint(0,dataset.num_rows-1)
    #     picks.append(pick)
    picks = np.random.choice(dataset.num_rows-1, size=num_examples, replace=False).tolist()
    df = pd.DataFrame(dataset[picks])
    
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i:typ.names[i])
    display(HTML(df.to_html()))

In [8]:
show_random_elements(dataset['train'])

Unnamed: 0,label,text
0,1 star,Horrible preschool. Teachers are lazy and the kids do not learn much. We pulled our son out this year. We realize now why they lost so many kids last year. School has bad leadership and fired teachers because of low enrollment.
1,1 star,"I been waiting for this place to open for 4 months, and I am severely disappointed. The service was terrible, we waited 45 minutes and never got our food, but noticed people who ordered the same thing get theirs(sounds like the order process needs some work) \n\nI asked for a refund and the manager didn't apologize and didn't say anything, but to add insult to injury the manager charged us for the drinks!!!!(sounds like the manager needs to learn some customer service 101) \n\nOn the way out I overheard people talking that they been waiting 15 minutes for 1/2 a sandwich and some people got the wrong order but didn't complain because they been waiting long as well and were just hungry. Please Mind you that this place wasn't that busy. \n\nI plan on never eating here again."
2,5 stars,"I heard of Liberty Tech and Tire through a facebook page called \""Huntridge Neighborhood\"", I did not realize that the business had change from a European parts establishment to Liberty Tech and Tire (repair shop). Because of the outstanding rave about Harun and Liberty Tech and Tire, I took a chance and stopped by for an oil change... I knew that my brakes were not in the best shape but wasn't ready to fork out the money to have them replaced just yet. \nWell needless to say, due to the great deal that Huran offered me, they changed the booth drums and disc on the back axle for only $99.00 plus tax and he threw in the oil change with the deal. That's not only great customer service but a real neighborhood business that definitely gets my support and business."
3,5 stars,"OMG! Delicious, decadent, and worth every penny.\n\nWe started with the chop and heirloom tomato salads. Theynwere excellent. Foie gras torchon was also very good as an app.\n\nFor dinner my wife and I shared the chili rubbed bone-in double rib eye. We ordered this based on the review from the Food Network show \""The Best Thing I Ever Ate.\"". It did not disappoint! They chili rub made very a very tasty crust, and the onions, jalapenos, and peppers were a nice twist. Definitely not your typical steakhouse fare.\n\nWe will definitely go back."
4,4 stars,"Great sushi restaurant inside of Red Rock Casino. Kinda pricey but a great place for a date night if you're wanting to splurge a little. The \""Top 6 Roll\"" is my favorite and my husband loves the \""Jeffrey Roll.\"" Great selection of sake if you're into that...my hubby always tries something different each time we go. Great selection of specials that are $8 and under to allow you to order a variety of things without killing your bill.\n\nThey used to have these to-die-for mussels in a spicy coconut broth but they got rid of them because not enough people were ordering them. Wish they would bring them back. \n\nDecor is super modern and trendy, service has always been great and it's open pretty late on Fridays and weekends."
5,4 stars,"I loved this car wash..once I figured out how to get into it (easiest to enter from 44th street). No Cheesecake Factory menu that takes 20 minutes to read and figure out what the difference is between the $4 wash, the $7 wash, the $12 wash...you get the idea.\n\nI took my husband's vehicle in while he was out of town. I know, I'm nice like that. $25 for an SUV complete wash. This is a true hand wash. No conveyor belts or automated water squirter things. Guys with hoses who do a really good job. They use compressed air to dry, and I watched them go over and over the car to make sure nothing was missed. I was hard-pressed to find any water streaks. It can be slow, but it's worth it.\n\nThe only downside is the waiting area is outside (covered), so I don't anticipate his vehicle's next wash until November. Luckily, my 10 year old vehicle has no aversions to machine washes. But then again, It would be lucky to get washed before November anyway."
6,3 stars,"I want to give this place 5 stars, since I've recommended it to all sorts of people. But that was based upon an early experience I've simply not been able to replicate.\n\nYou see, when you say 'Rhumbar' to me, I hear Caribbean Rum Bar. The first time I saw this place it was new. Semi-empty. Afternoon on a warmer Nevada day. Smartly selected music (like Stephen Marley's 'Mind Control' and a latin cover of a U2 song) was carefully piped in... but not in a way that upstaged my Cruzan Rum inspired drink.\n\n(Let me back up. Any 'rum' place which won't serve Cruzan Rum, in my opinion, isn't a rum place at all. Bacardi? Bring Advil. Cruzan Rum? Leave the weed behind.)\n\nAnd so the first time I was here I had an delightful drink with fresh fruit in it. Overpriced? In the real world, kinda, on the strip, not at all. It was a groovy place 'to chill' as the kids say and you could simply hang for hours if you wanted. Understand you're outside in the Mirage palms and waterfalls and such. It's simply lovely.\n\nAnd so I've tried to come back to this place ever since this experience. And some idiot has given it a semi-clubby make over. LOUD DJ. KAREOKE. SPORTS BAR BIG TVS. \n\nWTF, Rhumbar? \n\nMirage: you now have two DEAFENING DANCE CLUBS on premise. That INK thing and now 1 OAK. Let the bouncer boys and half naked girls 'chill' in those subwoofered strobe holes. But please... PLEASE... let the 'rest of us' have a place to hang out with only moderate mayhem?\n\nThat's why I've giving the Rhumbar 3 stars. It's identity crisis is inexplicable. If Mirage really wants people to see a big party outside (which may be a smart move) then do this for me. At night, when the pool closes, re-open it as 'The Japonais Lounge', with new lounge chairs and hanging lamps. A place to have a drink and retain your ear drums. I'd love it if you had to be over 40 to get in -- hee hee."
7,2 star,"It's never a good sign when you wait 15 minutes for a server to come to you. I was greeted with \""What would you like to drink\"" No hello, welcome, sorry for your wait. \n\nFood was nothing to write home about. I had the California chicken sandwich and it was very salty. \n\nThis was my second time here and I won't return!"
8,5 stars,Great local place!! Check it out
9,1 star,"My husband and I along with his family were walking around after an awesome lunch at Noodle no. 9 and came back to the Bellagio to go hang out at the Bellagio Cafe for a little while before heading out to the airport to catch our flights back home. As we were walking through the lobby of the Bellagio, they were taking down their garden ornaments, and there was places that were cut off and two men who were security funneling and stopping traffic through roped off areas that people were walking in and out of.\n\nAs we were walking through the tall man said to us \""Do YOU PEOPLE not understand when a man tells you to stop that you have to stop?!\"" -- to which I didn't ever remember the other guy telling us we had to stop, and if he did it was an honest mistake as I was at the front of our group and I wasn't particularly feeling too well as I was a bit anemic at the time. He continued to be rude to us as I walked past and I told him that he \""didn't have to be so rude\"" and we continued onward. The use of the phrase \""you people\"" particularly bothered me because my husband and his mother are Korean, and his younger sister is half Korean, while her father (my husband's step-dad) is a white male.\n\nJust the way the guy said it to us, and REPEATED it over and over again as we kept walking, each time more loud and more rude didn't sit well with me. So I scuttled ahead to find a place where I could sit down, because I was so upset at just how rude that guy was. People make honest mistakes.. if we walked through and the other guy told us to stop which somehow all 5 of us seemed to have not heard, then we would have apologized, and that should have been the end of it. We were all so upset with how this guy treated us, that we decided to head back to the hotel lobby and just leave for the airport -- well to do so, we had to go back where that guy was.\n\nAs we were approaching people were allowed to walk through the pathway and we were going with the flow of traffic. He then stopped us again, even though people on the other side were being let through (like traffic on a two-way road) -- which made absolutely no sense. It was pretty clear that he was singling us out because of the indecent before hand. My husband's father told the guy that the situation was ridiculous, and continued to walk forward as people were still being allowed to walk through on the left side of us. Once he did that the \""security guard\"" (as he called himself) physically put his hands on Jason's father and pushed him. My husband's father then told him that his actions were uncalled for and was physical assault to which the guard responded by pushing my husband's father further back with his chest and getting up in his face repeating that he was a security gaurd, and he had the right to put his hands on him and kept telling us to \""look at the badge\"" meaning his name tag with his name Michael on it and where he was from - San Bernardino, CA. My husband's mother was appalled and told him to get his hands off of her husband, and the guy touched me telling me to get back even though I never even moved from my original position, nor did I try to even start a confrontation.\n\nWe had done nothing wrong. We were on a FAMILY vacation in Vegas. None of us have ever been involved in any crimes, nor were we being aggressive in any manor. It was uncalled for and made for a horrible ending to our vacation.\n\nWe then waited to file a complaint with the head of security, in which we got an apology -- but he said the manor would be looked into. And the more we had talked to him, the more it seemed apparent that they probably had other incidents happen with Michael. Unfortunately we weren't able to file a formal complaint on paper with the hotel as we had to leave to catch our flights back home.\n\nNot only that, but in the beginning of our vacation, my husband and I went to check into our room with our confirmation email printed out (which had 2 confirm numbers for each room that his father had reserved) and the lady at the check out desk told us that we couldn't check in because it was only under my husband's father's name -- although we told her we had ANOTHER confirmation number, that also had my husband under another room's list and the woman was rude and said \""I don't care, that's only a piece of paper. It doesn't mean anything.\"" and when we went to another teller, they were more than nice to hear us out and we were able to get our room.\n\nOther than those two incidents, we had nice clean rooms and had a good time otherwise. However since the security incident happened the way it did, never again will we be staying with the Bellagio...."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [9]:
from transformers import AutoTokenizer

In [10]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [11]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

In [12]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [13]:
show_random_elements(tokenized_datasets['train'], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,2 star,"This cafe is located on the main casino floor of Binions Hotel and Casino in Fremont Street, Las Vegas.\n\nI had the pulled pork sandwich, which was more like a bbq sauce sandwich with pulled pork garnish. But hey, at least they toasted the bun. \n\nThe price seemed a little stiff at close to $8. But the coffee is good and strong.\n\nAll in all, not a place I'm likely to return to, unless it's for a fast cup of joe.","[101, 1188, 17287, 1110, 1388, 1113, 1103, 1514, 14330, 1837, 1104, 21700, 5266, 4556, 1105, 14773, 1107, 13359, 26642, 1715, 117, 5976, 6554, 119, 165, 183, 165, 183, 2240, 1125, 1103, 1865, 19915, 14327, 117, 1134, 1108, 1167, 1176, 170, 171, 1830, 4426, 14313, 14327, 1114, 1865, 19915, 176, 1813, 21816, 119, 1252, 23998, 117, 1120, 1655, 1152, 17458, 1174, 1103, 171, 3488, 119, 165, 183, 165, 183, 1942, 4638, 3945, 1882, 170, 1376, 11111, 1120, 1601, 1106, 109, 129, 119, 1252, 1103, 3538, 1110, 1363, 1105, 2012, 119, 165, 183, 165, 183, 1592, 2339, 1107, 1155, 117, 1136, 170, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [14]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [15]:
small_train_dataset.num_rows, small_eval_dataset.num_rows

(1000, 1000)

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [16]:
from transformers import AutoModelForSequenceClassification

In [17]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [18]:
from transformers import TrainingArguments

In [19]:
model_dir = "models/bert-base-cased-finetune-yelp"
# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [20]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_nam

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [21]:
import numpy as np
import evaluate

In [22]:
metric = evaluate.load("accuracy")

接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [23]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [24]:
from transformers import TrainingArguments, Trainer

In [25]:
training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [26]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [27]:
trainer.device()

AttributeError: 'Trainer' object has no attribute 'device'

In [28]:
!nvidia-smi

Sun Jan 28 21:35:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.70                 Driver Version: 537.70       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
| 30%   29C    P2              40W / 200W |   1414MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 189
  Number of trainable parameters = 108314117


Epoch,Training Loss,Validation Loss


In [None]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [None]:
trainer.evaluate(small_test_dataset)

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [None]:
trainer.save_model(model_dir)

In [None]:
trainer.save_state()