# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。


## 下载数据集

In [2]:
from datasets import load_dataset

In [3]:
# 65w条训练数据 5w条测试数据

dataset = load_dataset('yelp_review_full')

Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [54]:
dataset['train'][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [36]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML
import numpy as np

In [46]:
def show_random_elements(dataset, num_examples=10):
    # '如果随机抽取的长度如果大于dataset长度则报错'
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    # picks = []
    # for _ in range(num_examples):
    #     # 随机在0-数据长度之间选择一个
    #     pick = random.randint(0,dataset.num_rows-1)
    #     # 如果在picks中出现过，则从新选择一个,直到没有在picks中出现过
    #     while pick in picks:
    #         pick = random.randint(0,dataset.num_rows-1)
    #     picks.append(pick)
    picks = np.random.choice(dataset.num_rows-1, size=num_examples, replace=False).tolist()
    df = pd.DataFrame(dataset[picks])
    
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i:typ.names[i])
    display(HTML(df.to_html()))

In [49]:
show_random_elements(dataset['train'])

Unnamed: 0,label,text
0,4 stars,"Very cool little spot in the Wynn. We sat at the bar and had some drinks with Bill. He was a trip and a great bartender. The drinks were really heavy pour.\n\nWe split an order of the Prosciutto and Arugula pizza and the Lasagna. The pizza was probably the best thing we had there though. Crispy crust on the bottom and really tender on the top. The lasagna wasn't bad, it just wasn't as great as we had hoped. It had chopped up Italian sausage on the top that sorta ruined it for us. It wasn't like crumbled up sausage, it was like cooked in the casing and \n\nVery friendly atmosphere and pretty good food."
1,5 stars,"Great guy, great business! I've never met someone more competent with catering services. Organized, honest, and experienced. Can't go wrong with Classic Catering."
2,1 star,"Don't pay them ahead of time. No pun intended, I paid him $2000 to do a tankless hot water heater and he said he would be back the next day. It's been five months now and he hasn't even called me back message after message. Not to be trusted."
3,4 stars,"Top notch boofay of boofays. Sushi and seafood is amazing. Brunch is always amazing here. The desserts are the prime reason to go here. I can't get enough. The only thing to improve to make this a 5 is the staff. Although they work hard it's certain they have too many tables to attend to. Certainly nice. Just runnin' their azz off to get everyone and pull plates away fast. we had some stack up on different visits. Still, I love this buffet."
4,5 stars,"Everything that can be said has been said about this wonderful place, so I will keep it short. The tiramisu gelato was rich, creamy, and otherworldly....possibly the best tiramisu I've ever had, but in frozen and transcendent form. The $8 72% Dark Chocolate bar I bought for the lady? Smooth, rich, and perfect. The coffee......I guess it was coffee, but honestly it tasted chocolate infused......heavenly.\n\nMy buddy picked up a bag of the \""hot chocolate\"" for his mother....as opposed to standard powder/mix, these were miniature balls of chocolate designed to be dissolved in milk. Incredible.\n\nWhile the lines can get long and inconvenient, who cares when you can watch the mesmerizing chocolate fountain and stare at the incredible fondant cakes surrounding you?\n\nI'll stop by for a crepe next time, for sure."
5,3 stars,"He is very talented and funny, but some of his \""edgy\"" jokes make it not as family friendly as it could be. (PG-13) Really not necessary to make for a great act. I think he feels like he needs it for Vegas... he even made a comment to that point, but really he could have kept the act totally clean, and it would have been just as fun. Overall entertaining show. The Paranormal Mentalist show at Bally's was a little simpler with less props, but I thought it was a little better for the kids."
6,5 stars,"AWM is probably one of my favorite places on earth. They have everything I need: wine, cheese & chocolate. But if you have a hankering for something else they have other good stuff too like tasty toasted sandwiches, local brews, and excellent cocktails. The owner and staff are amazing and are really what makes this place special. They will make you a killer cocktail on request, provide samples to ensure you get just what you want, and engage in stimulating bar conversation as needed. AWM is unpretentious, yet unique; local, yet cosmopolitan; your neighborhood bar, yet also a little bit trendy. Obviously, I'm in love.\nIf you don't think this place deserves 5 stars we probably can't be friends."
7,1 star,"My wife and I visited for our anniversary, hoping for a great seafood dinner. We left incredibly disappointed overall. We were seated right away but it took nearly 10 minutes for our server to greet us and take our drink order, two iced teas. It then took about 10 minutes just to bring us our drinks, despite it not being terribly busy. Our server then left and took 10 more minutes before coming back to take our order and was not very friendly, looking like this was the last place she wanted to be. \n\nWe ordered stuffed mushrooms as an appetizer, and placed our entree orders, in fear we would have to wait forever. I ordered the halibut and my wife ordered a shrimp and scallop scampi. Our appetizer was decent, not great. We opted for soup, clam chowder, for our entree side which was actually pretty good. \n\nOur dinners then came and things went downhill quickly. My plate included halibut, seafood pasta, and vegetables. The vegetables were clearly frozen and out of a bag. Very disappointing already. The seafood pasta was crab (possibly krab) mixed with rotini noodles. It was OK at best. The halibut was a nice sized portion but did not taste good at all. It tasted old if that even makes sense. My wife's dish was better than mine, but very salty. The scallops were rubbery and not cooked well.\n\nAll of this cost us $60 without tip or any drinks from the bar. If the food were delicious I'd be fine with this, but sadly you can get a better meal at red lobster *cringes* or if you were to cook at home. The service was poor which did not help any. For comparison, I went to Four Peaks Brewing today, ordered a stuffed rainbow trout entree for $16 (half what I paid at Seafood Market) and it was leaps and bounds better...from a brew pub! That should be embarrassing to Seafood Market. I definitely would not return here."
8,4 stars,That's cool. Had a wonderful experience in Mirage! Nice beds... Nice people!
9,2 star,"The sushi was pretty good, a bit expensive though. The atmosphere is alright if you don't mind tourists in the midst of their Vegas vacation.. I found the staff to be very rude, I was asked if I was ready to order at least 10 times in the first 15 minutes I was there.. .It made me feel very uncomfortable and unwanted, I ended up ordering something random just to get them off my case... Then I waited 25 minutes for my bill, after they didn't bring the bill.. I dropped my money on the table and walked out.\nI found the staff to be pretentious and rude...and the sushi was somewhere below par.."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [73]:
from transformers import AutoTokenizer

In [75]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [76]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

In [77]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [81]:
show_random_elements(tokenized_datasets['train'], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,2 star,We ventured to Feedbag tonight and were not impressed with the food. The service was very good and portion sizes were nice but the instant macaroni and cheese and instant mashed potatoes left us disappointed.,"[101, 1284, 22542, 1106, 11907, 1174, 17097, 3568, 1105, 1127, 1136, 7351, 1114, 1103, 2094, 119, 1109, 1555, 1108, 1304, 1363, 1105, 3849, 10855, 1127, 3505, 1133, 1103, 6879, 23639, 14452, 2605, 1105, 9553, 1105, 6879, 12477, 10680, 15866, 1286, 1366, 9333, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [82]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [84]:
small_train_dataset.num_rows, small_eval_dataset.num_rows

(1000, 1000)

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [85]:
from transformers import AutoModelForSequenceClassification

In [90]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [91]:
from transformers import TrainingArguments

In [92]:
model_dir = "models/bert-base-cased-finetune-yelp"
# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [93]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_nam

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [94]:
import numpy as np
import evaluate

ModuleNotFoundError: No module named 'evaluate'