# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][111]

{'label': 2,
 'text': "As far as Starbucks go, this is a pretty nice one.  The baristas are friendly and while I was here, a lot of regulars must have come in, because they bantered away with almost everyone.  The bathroom was clean and well maintained and the trash wasn't overflowing in the canisters around the store.  The pastries looked fresh, but I didn't partake.  The noise level was also at a nice working level - not too loud, music just barely audible.\\n\\nI do wish there was more seating.  It is nice that this location has a counter at the end of the bar for sole workers, but it doesn't replace more tables.  I'm sure this isn't as much of a problem in the summer when there's the space outside.\\n\\nThere was a treat receipt promo going on, but the barista didn't tell me about it, which I found odd.  Usually when they have promos like that going on, they ask everyone if they want their receipt to come back later in the day to claim whatever the offer is.  Today it was one of th

In [5]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [7]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [16]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,3 stars,"If you go to Cave Creek you must go to the Buffalo Chip, Harold's , hideaway etc. just great local restaurants in a great town north of Phoenix/Scottsdale. I like the Chip more for drinking dancing and rodeo than the food. But it is okay for a little grub But it is fun. So try it"
1,3 stars,"Chuy's was pretty good. Seems like all of their locations are pretty similar. I think they'd be best known for their cheap margaritas... $2 pints, $4 small pitchers. They aren't too strong and come from a mediocre mix, but if you're looking for something sweet and cheap its a good bet.\n\nFree serve-yourself chips and salsa are pretty good. \n\nAs for food... pretty decent prices and okay to good food. I wouldn't say its authentic Mexican, actually I think the menu is a bit confused. But the food is good overall. It's a place we'll go to every month or so."
2,3 stars,"I ended up eating at Taggia while staying at Firesky so it was a choice of convenience. I've had the food from here several times using room service and it's never anything to complain about. It was the same story the day I had lunch here. I had an organic greens salad and shared the margherita and goat cheese pizzas with my fellow lunchers. All of the food was good - the goat cheese pizza in particular with its thin, crispy crust.\n\nUnfortunately the day we ate here our service was MIA. We were told we could seat ourselves so we did. After about 10 minutes someone came by to take our drink order and maybe 10 minutes later our waters arrived. Well 2 out of 3 of them did anyway. Then we ordered two salads and two pizzas to share. One pizza came first. WTH? Where were the salads? Or the other pizza? The salads showed up a few minutes later and then our server realized that she had forgotten our second pizza. No biggie since we had salads and one pizza to eat. But the service was lackluster with a L. Like Andrea R says, I wouldn't go out of my way to eat here, but when in the area it's a good option to have."
3,2 star,"I recently had a work luncheon at Ricardo's, I had been before years ago and it was extremely unmemorable. This visit would be more memorable but for the wrong reasons. \n\nWhen given the choice, I prefer to order off the menu than choose a buffet. But the whole group went to the buffet and I didn't want to be the oddball. I had two carne asada tacos, cheese enchilada and chips & salsa. The enchilada was bland the only hint of flavor was the acidity from the tomatoes. The salsa, too, was bland and watery. The chips were pretty generic. The first taco was ok, a bit bland, but tender. The second was filled with grizzly meat. It really turned my stomach. Fortunately, the service was friendly and they were able to accomodate our large group."
4,4 stars,"We had a great time at this resort over the long weekend. The staff was super friendly, especially Adam, David and Cassie. Great job!!! And our suite was perfect to accommodate three women with lots of bags, make-up and shoes. The Hole in the Wall restaurant had a really good breakfast, friendly staff and an outdoors patio. Not so for the Rico Restaurant. They were a bit rude, overwhelmed and obviously didn't want our business. We also floated down the Lazy River, it was definitely Lazy...pretty slow but perfect temp. All in all, I'll be back."
5,1 star,"Im an owner with no kids, this place is not for my husband and I.. The element here is all about families and cooking in and playing in the pool from the moment it opens.\n\nThe restaurant bar is a bit of a joke, and the pressure to buy more points makes a relaxing vacation more stressful. We were an original owner and saw most of it built.\n\nWe noticed that they no longer offer a shuttle which is a mistake for those that want to go to the strip and not have to worry about driving. But after this weekend I see that they don't need to offer the shuttle because more than half the people there don't plan on leaving the facility at all.\n\nThe guests that we ran into all seemed to be there on a free vacation offer. They were tatted up and pushing baby strollers... and screaming to each other WAIT TIL YOU SEE THE POOL....\n\nMy hubby and myself both looked at each other and said OUT LOUD, we don't think we will be coming back here again ever to this location.\n\nWe came home and looked into selling it all together, but then thought maybe we would try another location that Diamond Resorts has to offer before we do so..\n\nSo bottom line, if you have kids and love the pool and slides and pool some more.. this is for you.. If your looking for a weekend with the hubby or friends in Vegas to relax and to enjoy what Vegas is all about... this resort is not for you.."
6,3 stars,"Booked a room here through Priceline for the Tuesday before Thanksgiving. Actually booked it on the drive in from Las Vegas through my cell phone, which was pretty sweet. Paid $25 + tax, so you can't beat the price. We had a hard time finding it as Google Maps was wrong about it's location, but you can't blame the hotel for that.\n\nWhat I can blame the hotel for is not giving me a king size bed. Priceline had booked me with 2 doubles, and in my experience I am always able to switch unless they're sold out. The front desk clerk told me they were indeed not sold out, but it was their policy not to let Priceline users switch rooms. So much for considering staying there the rest of Thanksgiving week.\n\nI was going to let it go, but then at 9:30am the maid knocked on our door and woke me up 2 hours earlier than I had planned (we got to sleep at 5am, give me a break). I ended up coming out of my room and saw that my do not disturb sign was on, so she must have chosen to ruin my day for fun. Tried to fall back asleep but was then kept up by the sounds of what looked to be a loud garbage truck parked right outside of our room.\n\nI give up on sleeping. At least they have solid free Internet so I can Yelp this hotel. Courtyard, you're lucky you're getting 3 stars from me."
7,2 star,"Been to 4 Cirques and this is the least favorite. Sets and costumes are absolutely amazing but the acts were very unimpressive compared to the older ones we've seen. \nThe only exception was the opening act with the two twin men on ropes that swung out into the audience. They weren't in the book at the shop so I think they added them in later to spice up the program. Breathtaking!\n\nPros- Art direction, sets, lights\n\nCons- Acts seen before and ANNOYING clowns"
8,2 star,"KOOLAID KID reminded me of home...nice touch..i know I know..this is suppose to be about the chicken & waffles, but I must say quenching my thirst is very important to me..so back to the food..it was just that chicken(no flavor) & waffles(nothing special)..mac & cheese was very nice...and the new building was very very nice..okay that's all"
9,1 star,Just called this location and I live 1.8 miles away. I asked them to deliver and they informed me that they would not deliver to my house because it was a couple hundred yards out of the map plan. They asked me to call the power and southern store. This store advised me that they could not deliver because jimmy johns has a two mile radius they can deliver to. Called this store back and they once again decided to tell me even though I was in the two mile radius they did not want to deliver to me and my only option was for pickup. I will never eat at this location. I know the owners at Firehouse Subs and they go out of the way and this location is just lazy. Not getting my money jimmy johns no matter how fast you are. Laziness is worse


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
show_random_elements(tokenized_datasets["train"], num_examples=1)
tokenized_datasets["train"].num_rows

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,5 stars,"What a place! It was my first time at Aria and I simply loved it! We decided to eat at Javier's since the decor was so inviting. You have to see it to believe it. The food is super fresh and extremely tasty and the service is top notch. \n\nI wasn't feeling too hungry so I got the ceviche with octopus on recommendation from our server and it was probably the best I have ever eaten. The chips are fresh and the salsa is spicy!\n\nThe design detail of this place is outstanding. You have to go there to take a look even if you're not close by, it's well worth the cab ride. I hope to go back and have a full meal.","[101, 1327, 170, 1282, 106, 1135, 1108, 1139, 1148, 1159, 1120, 12900, 1105, 146, 2566, 3097, 1122, 106, 1284, 1879, 1106, 3940, 1120, 14280, 112, 188, 1290, 1103, 1260, 19248, 1108, 1177, 16067, 119, 1192, 1138, 1106, 1267, 1122, 1106, 2059, 1122, 119, 1109, 2094, 1110, 7688, 4489, 1105, 4450, 27629, 13913, 1105, 1103, 1555, 1110, 1499, 23555, 119, 165, 183, 165, 183, 2240, 1445, 112, 189, 2296, 1315, 7555, 1177, 146, 1400, 1103, 172, 21075, 1162, 1114, 184, 5822, 17298, 1113, 13710, 1121, 1412, 9770, 1105, 1122, 1108, 1930, 1103, 1436, 146, 1138, 1518, 8527, 119, 1109, 13228, 1132, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


650000

### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [14]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [17]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [12]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [24]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [20]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [21]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [24]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.2286,1.240987,0.47
2,0.9117,0.968127,0.577
3,0.663,0.979953,0.583


TrainOutput(global_step=189, training_loss=1.0004738197124825, metrics={'train_runtime': 351.8582, 'train_samples_per_second': 8.526, 'train_steps_per_second': 0.537, 'total_flos': 789354427392000.0, 'train_loss': 1.0004738197124825, 'epoch': 3.0})

In [25]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [26]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 1.111964464187622,
 'eval_accuracy': 0.53,
 'eval_runtime': 2.9464,
 'eval_samples_per_second': 33.94,
 'eval_steps_per_second': 4.412,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [27]:
trainer.save_model(model_dir)

In [21]:
trainer.save_state()

In [23]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [33]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [3]:
dataset["train"][111]

{'label': 2,
 'text': "As far as Starbucks go, this is a pretty nice one.  The baristas are friendly and while I was here, a lot of regulars must have come in, because they bantered away with almost everyone.  The bathroom was clean and well maintained and the trash wasn't overflowing in the canisters around the store.  The pastries looked fresh, but I didn't partake.  The noise level was also at a nice working level - not too loud, music just barely audible.\\n\\nI do wish there was more seating.  It is nice that this location has a counter at the end of the bar for sole workers, but it doesn't replace more tables.  I'm sure this isn't as much of a problem in the summer when there's the space outside.\\n\\nThere was a treat receipt promo going on, but the barista didn't tell me about it, which I found odd.  Usually when they have promos like that going on, they ask everyone if they want their receipt to come back later in the day to claim whatever the offer is.  Today it was one of th

In [34]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [35]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,1 star,"The waitress was awesome however the cook had messed up an order numerous times for a turkery burger with no cheese. They had brought out a burger with cheese, requested to put the burger on a plate. When they brought it out for a third time, it was the same burger, there was edges of the cheese still left on the meat, they had just scrapped the cheese off. When asked to speak to the cook/owner they refused to come out to speak to us. They didn't charge for the meal, however if someone was seriously allergic to cheese, this could have been a disaster. Will definitely not go back to this restaurant for sure."
1,1 star,"It's hard to imagine only 1 star due to this hotel being above average in cost. \nWhen I'm paying for a nicer hotel, I expect at least B+ service. We left with the impression that this hotel is all about squeezing every dollar out of you and they're not worried about feedback since there are 1500 more people checking in every day. \n\n1. Lazy river. They advertise this big time but they don't tell you they charge $24 for an inner tube. If not free, they should at least rent them for a few bucks, right? TIP: Go to CVS on the strip and get the same inner tube for $4. \n\n2. They advertise the shark reef but it costs $18 for this mildly entertaining adventure.\n\n3. No coffee in the rooms. Ok, my math says they save around $1,000,000 per year by not having coffee in 4000 rooms. Fine. At least have enough coffee shops in the hotel. There's only Starbucks and one other coffee place. I don't even mind the extra markup. However I waited 30 - 45 minutes in the morning for a cup of coffee with a bunch of angry people. They're also grumbling about the lazy river while waiting for coffee. \n\n4. After driving hours to get there, we had to wait 45 minutes to valet our car. The valet parker told me it's like that all summer long. Translation: They're too cheap to hire 2 more valet guys to make it quicker. When I left and came back that night, another 30 minutes in the valet line. Outrageous.\n\n5. Problems with our room: Bathtub was dirty and 2 lights didn't work. \n\nAny time you try to give feedback they say the same thing: Sorry, we have 1500 people checking in today.\n\nOn day two, I phoned guest services to tell them I'm not happy and I may check out a day early. Their response: \""Ok sir, I have you leaving on the 27th now. Any thing else?\"" They didn't even ask what the problem was. \n\nI filled out an email survey stating all of the above and asked them to contact me. NO RESPONSE. My wife is so easy to please and she encouraged me to check out early. We came across dozens of guests that were having a similar experience. Keep ignoring us but eventually it will catch up with this hotel. \n\nI will never stay here again and it's a shame. If they just cared a little, it would be a great place. The wave pool and lazy river are fun for kids."
2,1 star,The service was terrible. I would go back again because the food was good but would not go back just because of the service.
3,5 stars,"What an awesome way to start the night. Just before going to work i came into just for appetizers and a few cocktails. The staff was great, the ambiance was phenominal and the drink prices were decent. i defiantly found a good place to bring dates or even a friend visiting. plus Light Nightclub is next door !!!!"
4,4 stars,"This location has gotten much better lately. They have the best balloon guy here. He does great balloon- items. Kids got a motorcycle and a helicopter today. The service has gotten much better. They are actually starting to remember to bring silverware instead of making the customers ask for it. The only thing that is better at other locations is that the rolls are usually overdone at this location, which is why I only gave 4 stars. Keep up the good work, Gilbert and Germann location. Just take the bread out a bit earlier so it doesn't overcook and this location would be a 5-star location"
5,3 stars,The spicy tofu soup was okay. The sides were great. The fresh noodle soup was bland. Service was nice.
6,2 star,"Pho here was ok, broth was to salty for my taste even after squeezing in half a lemon. Portions were decent, the server was very nice and pleasant. The middle asian man taking our order was kinda sad looking... Didn't even crack a smile.."
7,4 stars,The best Philly I've ever had and I am not a huge red meat fan. The sandwiches are huge my 2 teenage boys have a hard time finishing these so who cares if they're a bit pricey.\nDang now I'm craving one!
8,2 star,"This place is another Vegas night club that is living off its reputation. \n\nFirst, the dance floor is way too crowded for any real dancing. I could barely move without bumping into other people. Second, my girlfriend and I waited in line way too long while some average looking girls at best cut to the front of the line. If you are looking for a place to dance and have a few drinks without a bunch of hassles don't come here."
9,2 star,20 years I waited for this slop. Everything was lukewarm at best. Desserts that were supposed to be cold were warm. My 17 dollar cocktail gave me a good buzz for 5 minutes so 1 star for that. 2nd star is for service. Remember people...its a buffet. Lower your expectations or you will be slapped in the face by disappointment like myself.


In [36]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initi

In [28]:
from transformers import TrainingArguments, Trainer
model_dir = "models/bert-base-cased-finetune-yelp"
training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  save_strategy="steps",           # 设置保存策略为按步骤保存
                                  save_steps=500,                  # 每500步保存一次检查点
                                  save_total_limit=5,              # 最多保存5个检查点，超过限制时会删除旧的检查点
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

In [30]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [31]:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [26]:
trainer.evaluate(test_dataset)

{'eval_loss': 1.111964464187622,
 'eval_accuracy': 0.53,
 'eval_runtime': 2.9464,
 'eval_samples_per_second': 33.94,
 'eval_steps_per_second': 4.412,
 'epoch': 3.0}

In [None]:
#重新开始训练：

In [15]:
model_dir = "models/bert-base-cased-finetune-yelp"
trainer.save_model(model_dir)

In [22]:
trainer.save_state()

In [38]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("models/bert-base-cased-finetune-yelp/checkpoint-40500", num_labels=5)

In [39]:
from transformers import TrainingArguments, Trainer
model_dir = "models/bert-base-cased-finetune-yelp"
training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  save_strategy="steps",           # 设置保存策略为按步骤保存
                                  save_steps=500,                  # 每500步保存一次检查点
                                  save_total_limit=5,              # 最多保存5个检查点，超过限制时会删除旧的检查点
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

In [40]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [43]:
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [None]:
trainer.train(resume_from_checkpoint="models/bert-base-cased-finetune-yelp/checkpoint-40500")

Epoch,Training Loss,Validation Loss,Accuracy
1,0.7236,0.772914,0.66964


In [None]:
notebook出现了不同步的情况，从保存的check point来看，训练完成了，现在以保存的最后一个checkpoint开始重新训练：

In [47]:
trainer.train(resume_from_checkpoint="models/bert-base-cased-finetune-yelp/checkpoint-121500")

Epoch,Training Loss,Validation Loss,Accuracy
3,0.4874,0.737763,0.68832


TrainOutput(global_step=121875, training_loss=0.001603093993358123, metrics={'train_runtime': 2068.9667, 'train_samples_per_second': 942.499, 'train_steps_per_second': 58.906, 'total_flos': 5.1308879758535885e+17, 'train_loss': 0.001603093993358123, 'epoch': 3.0})

In [None]:
结论： 训练完整的训练集，Acc提高到0.688320