# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [12]:
import torch

model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"

# 检查是否可以访问 CUDA
print("CUDA is available:", torch.cuda.is_available())

if torch.cuda.is_available():
    print("GPU device name:", torch.cuda.get_device_name(0))
    print("Number of GPUs available:", torch.cuda.device_count())
else:
    print("No GPU available, using CPU")

CUDA is available: True
GPU device name: NVIDIA GeForce RTX 4090
Number of GPUs available: 1


In [13]:
from datasets import load_dataset

dataset = load_dataset(datasets_path+"yelp_review_full")

In [14]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [16]:
dataset["train"][112]

{'label': 1,
 'text': "I'm not a huge fan of this location. I think that it was oddly built- the small, alley-like front makes it difficult to get past people on your way to/from the bathroom or tables when it's busy. And there's hardly any tables to sit at. Furthermore, people tend to clog the front on their way in, which makes things particularly difficult (especially in the winter weather). The staff were pretty impersonal to be, but maybe that was due to the high traffic of the place and the time I was there. And the coffee that I had was cold- I'm sure it was probably the bottom of the batch. I'd probably only walk in here again if someone else suggested it before or after a movie or while we were shopping in the area."}

In [17]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [18]:
def show_random_elements(dataset, num_examples=20):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [19]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,3 stars,"If you go to Cave Creek you must go to the Buffalo Chip, Harold's , hideaway etc. just great local restaurants in a great town north of Phoenix/Scottsdale. I like the Chip more for drinking dancing and rodeo than the food. But it is okay for a little grub But it is fun. So try it"
1,3 stars,"Chuy's was pretty good. Seems like all of their locations are pretty similar. I think they'd be best known for their cheap margaritas... $2 pints, $4 small pitchers. They aren't too strong and come from a mediocre mix, but if you're looking for something sweet and cheap its a good bet.\n\nFree serve-yourself chips and salsa are pretty good. \n\nAs for food... pretty decent prices and okay to good food. I wouldn't say its authentic Mexican, actually I think the menu is a bit confused. But the food is good overall. It's a place we'll go to every month or so."
2,3 stars,"I ended up eating at Taggia while staying at Firesky so it was a choice of convenience. I've had the food from here several times using room service and it's never anything to complain about. It was the same story the day I had lunch here. I had an organic greens salad and shared the margherita and goat cheese pizzas with my fellow lunchers. All of the food was good - the goat cheese pizza in particular with its thin, crispy crust.\n\nUnfortunately the day we ate here our service was MIA. We were told we could seat ourselves so we did. After about 10 minutes someone came by to take our drink order and maybe 10 minutes later our waters arrived. Well 2 out of 3 of them did anyway. Then we ordered two salads and two pizzas to share. One pizza came first. WTH? Where were the salads? Or the other pizza? The salads showed up a few minutes later and then our server realized that she had forgotten our second pizza. No biggie since we had salads and one pizza to eat. But the service was lackluster with a L. Like Andrea R says, I wouldn't go out of my way to eat here, but when in the area it's a good option to have."
3,2 star,"I recently had a work luncheon at Ricardo's, I had been before years ago and it was extremely unmemorable. This visit would be more memorable but for the wrong reasons. \n\nWhen given the choice, I prefer to order off the menu than choose a buffet. But the whole group went to the buffet and I didn't want to be the oddball. I had two carne asada tacos, cheese enchilada and chips & salsa. The enchilada was bland the only hint of flavor was the acidity from the tomatoes. The salsa, too, was bland and watery. The chips were pretty generic. The first taco was ok, a bit bland, but tender. The second was filled with grizzly meat. It really turned my stomach. Fortunately, the service was friendly and they were able to accomodate our large group."
4,4 stars,"We had a great time at this resort over the long weekend. The staff was super friendly, especially Adam, David and Cassie. Great job!!! And our suite was perfect to accommodate three women with lots of bags, make-up and shoes. The Hole in the Wall restaurant had a really good breakfast, friendly staff and an outdoors patio. Not so for the Rico Restaurant. They were a bit rude, overwhelmed and obviously didn't want our business. We also floated down the Lazy River, it was definitely Lazy...pretty slow but perfect temp. All in all, I'll be back."
5,1 star,"Im an owner with no kids, this place is not for my husband and I.. The element here is all about families and cooking in and playing in the pool from the moment it opens.\n\nThe restaurant bar is a bit of a joke, and the pressure to buy more points makes a relaxing vacation more stressful. We were an original owner and saw most of it built.\n\nWe noticed that they no longer offer a shuttle which is a mistake for those that want to go to the strip and not have to worry about driving. But after this weekend I see that they don't need to offer the shuttle because more than half the people there don't plan on leaving the facility at all.\n\nThe guests that we ran into all seemed to be there on a free vacation offer. They were tatted up and pushing baby strollers... and screaming to each other WAIT TIL YOU SEE THE POOL....\n\nMy hubby and myself both looked at each other and said OUT LOUD, we don't think we will be coming back here again ever to this location.\n\nWe came home and looked into selling it all together, but then thought maybe we would try another location that Diamond Resorts has to offer before we do so..\n\nSo bottom line, if you have kids and love the pool and slides and pool some more.. this is for you.. If your looking for a weekend with the hubby or friends in Vegas to relax and to enjoy what Vegas is all about... this resort is not for you.."
6,3 stars,"Booked a room here through Priceline for the Tuesday before Thanksgiving. Actually booked it on the drive in from Las Vegas through my cell phone, which was pretty sweet. Paid $25 + tax, so you can't beat the price. We had a hard time finding it as Google Maps was wrong about it's location, but you can't blame the hotel for that.\n\nWhat I can blame the hotel for is not giving me a king size bed. Priceline had booked me with 2 doubles, and in my experience I am always able to switch unless they're sold out. The front desk clerk told me they were indeed not sold out, but it was their policy not to let Priceline users switch rooms. So much for considering staying there the rest of Thanksgiving week.\n\nI was going to let it go, but then at 9:30am the maid knocked on our door and woke me up 2 hours earlier than I had planned (we got to sleep at 5am, give me a break). I ended up coming out of my room and saw that my do not disturb sign was on, so she must have chosen to ruin my day for fun. Tried to fall back asleep but was then kept up by the sounds of what looked to be a loud garbage truck parked right outside of our room.\n\nI give up on sleeping. At least they have solid free Internet so I can Yelp this hotel. Courtyard, you're lucky you're getting 3 stars from me."
7,2 star,"Been to 4 Cirques and this is the least favorite. Sets and costumes are absolutely amazing but the acts were very unimpressive compared to the older ones we've seen. \nThe only exception was the opening act with the two twin men on ropes that swung out into the audience. They weren't in the book at the shop so I think they added them in later to spice up the program. Breathtaking!\n\nPros- Art direction, sets, lights\n\nCons- Acts seen before and ANNOYING clowns"
8,2 star,"KOOLAID KID reminded me of home...nice touch..i know I know..this is suppose to be about the chicken & waffles, but I must say quenching my thirst is very important to me..so back to the food..it was just that chicken(no flavor) & waffles(nothing special)..mac & cheese was very nice...and the new building was very very nice..okay that's all"
9,1 star,Just called this location and I live 1.8 miles away. I asked them to deliver and they informed me that they would not deliver to my house because it was a couple hundred yards out of the map plan. They asked me to call the power and southern store. This store advised me that they could not deliver because jimmy johns has a two mile radius they can deliver to. Called this store back and they once again decided to tell me even though I was in the two mile radius they did not want to deliver to me and my only option was for pickup. I will never eat at this location. I know the owners at Firehouse Subs and they go out of the way and this location is just lazy. Not getting my money jimmy johns no matter how fast you are. Laziness is worse


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。
### google-bert/bert-base-cased
#### BERT base model (cased)
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English.

Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

#### Model description
BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

#### Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.

下面使用填充到最大长度的策略，处理整个数据集：

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path+"bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [21]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,2 star,"Just an update, just adding another star for the apology they gave us.....","[101, 2066, 1126, 11984, 117, 1198, 5321, 1330, 2851, 1111, 1103, 13382, 1152, 1522, 1366, 119, 119, 119, 119, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [22]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [23]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_path+"bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [24]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [25]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_le

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [2]:
pip list | grep evaluate

evaluate                  0.4.1
Note: you may need to restart the kernel to use updated packages.


In [11]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

# Prepare and tokenize dataset
dataset = load_dataset(datasets_path+"yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained(model_path+"bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))

# Setup evaluation 
metric = evaluate.load("metrics/accuracy/"+"accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Load pretrained model and evaluate model after each epoch
model = AutoModelForSequenceClassification.from_pretrained(model_path+"bert-base-cased", num_labels=5)
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.591947,0.255
2,No log,1.477073,0.36
3,No log,1.407272,0.45


TrainOutput(global_step=75, training_loss=1.507139383951823, metrics={'train_runtime': 15.1283, 'train_samples_per_second': 39.661, 'train_steps_per_second': 4.958, 'total_flos': 157870885478400.0, 'train_loss': 1.507139383951823, 'epoch': 3.0})

In [26]:
import numpy as np
import evaluate

metric = evaluate.load("metrics/accuracy/"+"accuracy.py")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [27]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [28]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.6175,1.518585,0.329
2,1.255,1.125827,0.497
3,0.9414,1.004606,0.584


TrainOutput(global_step=189, training_loss=1.2967164125392046, metrics={'train_runtime': 60.5968, 'train_samples_per_second': 49.508, 'train_steps_per_second': 3.119, 'total_flos': 789354427392000.0, 'train_loss': 1.2967164125392046, 'epoch': 3.0})

In [32]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [33]:
trainer.evaluate(small_test_dataset)

{'eval_loss': 1.058894395828247,
 'eval_accuracy': 0.51,
 'eval_runtime': 0.5723,
 'eval_samples_per_second': 174.733,
 'eval_steps_per_second': 22.715,
 'epoch': 3.0}

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [34]:
trainer.save_model(model_dir)

In [35]:
trainer.save_state()

In [23]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少

In [1]:
pwd

'/root/dataDisk/code-workspace/LLM-quickstart/transformers'

In [None]:
### per_device_train_batch_size =16
model = AutoModelForSequenceClassification.from_pretrained(model_path+"bert-base-cased", num_labels=5)
training_args = TrainingArguments(output_dir="test_trainer", 
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=30,
                                  save_total_limit=3)

Thu Jun 13 03:03:29 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:12:00.0 Off |                  Off |
| 30%   51C    P2             300W / 450W |  12205MiB / 24564MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

 [121875/121875 02:09, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy
3	0.518700	0.754804	0.688700
 [1250/1250 00:50]



In [None]:
### per_device_train_batch_size =24
Thu Jun 13 03:18:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:12:00.0 Off |                  Off |
| 74%   57C    P2             391W / 450W |  17347MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

In [None]:
Thu Jun 13 03:21:44 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:12:00.0 Off |                  Off |
| 72%   59C    P2             388W / 450W |  22547MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"

# Prepare and tokenize dataset
dataset = load_dataset(datasets_path+"yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained(model_path+"bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))

# Setup evaluation 
metric = evaluate.load("metrics/accuracy/"+"accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Load pretrained model and evaluate model after each epoch
model = AutoModelForSequenceClassification.from_pretrained(model_path+"bert-base-cased", num_labels=5)
training_args = TrainingArguments(output_dir="test_trainer_24", 
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=32,
                                  num_train_epochs=3,
                                  logging_steps=30,
                                  save_total_limit=3)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=all_train_dataset,
    eval_dataset=all_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

trainer.evaluate(all_eval_dataset)

trainer.save_model("test_trainer_24")

trainer.save_state()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7697,0.729502,0.6794
2,0.6355,0.703414,0.6933
3,0.5605,0.738425,0.6933


Checkpoint destination directory test_trainer_24/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory test_trainer_24/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [None]:
Thu Jun 13 10:41:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:12:00.0 Off |                  Off |
| 30%   27C    P8              13W / 450W |  22547MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Thu Jun 13 12:09:26 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:12:00.0 Off |                  Off |
| 32%   56C    P2             291W / 450W |  17209MiB / 24564MiB |     82%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

# 模型和数据集的路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_32_liner"

# 加载并标记数据集
# 加载Yelp评论数据集
dataset = load_dataset(datasets_path + "yelp_review_full")  
# 加载BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased") 

# 定义标记函数
def tokenize_function(examples):
    # 对每个样本进行tokenization，填充和截断
    return tokenizer(examples["text"], padding="max_length", truncation=True)  

# 对数据集进行标记
tokenized_datasets = dataset.map(tokenize_function, batched=True) 

# 准备训练集和验证集
all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))  # 从训练集中随机选择65万个样本
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))  # 从测试集中随机选择1万个样本用于验证

# 评估函数
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")  # 加载准确率评估函数

def compute_metrics(eval_pred):
    logits, labels = eval_pred  # 分离预测值和真实标签
    predictions = np.argmax(logits, axis=-1)  # 取预测值的最大概率对应的标签
    return metric.compute(predictions=predictions, references=labels)  # 计算准确率

# 加载预训练模型
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)  # 加载BERT模型，并指定输出标签的数量

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy="epoch",  # 每个epoch后进行评估
    save_strategy="epoch",  # 每个epoch后保存模型
    learning_rate=3e-5,  # 学习率
    per_device_train_batch_size=32,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=32,  # 每个设备上的评估批次大小
    num_train_epochs=5,  # 训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=2,  # 梯度累积步骤，模拟更大的batch size
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="linear",  # 使用线性学习率调度器
    warmup_steps=500  # 热身步数，逐步增加学习率
)

# 创建Trainer实例
trainer = Trainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练集
    eval_dataset=all_eval_dataset,  # 验证集
    compute_metrics=compute_metrics,  # 评估函数
)

# 训练模型
trainer.train()

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和训练状态
trainer.save_model(result_output_dir)
trainer.save_state()


Map: 100%|██████████| 50000/50000 [00:12<00:00, 3963.62 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
0,0.7073,0.71293,0.6833
2,0.5613,0.743686,0.6897
4,0.4021,0.889443,0.6793


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

# 模型和数据集的路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_32_liner"

# 加载并标记数据集
# 加载Yelp评论数据集
dataset = load_dataset(datasets_path + "yelp_review_full")  
# 加载BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased") 

# 定义标记函数
def tokenize_function(examples):
    # 对每个样本进行tokenization，填充和截断
    return tokenizer(examples["text"], padding="max_length", truncation=True)  

# 对数据集进行标记
tokenized_datasets = dataset.map(tokenize_function, batched=True) 

# 准备训练集和验证集
all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))  # 从训练集中随机选择65万个样本
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))  # 从测试集中随机选择1万个样本用于验证

# 评估函数
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")  # 加载准确率评估函数

def compute_metrics(eval_pred):
    logits, labels = eval_pred  # 分离预测值和真实标签
    predictions = np.argmax(logits, axis=-1)  # 取预测值的最大概率对应的标签
    return metric.compute(predictions=predictions, references=labels)  # 计算准确率

# 加载预训练模型
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)  # 加载BERT模型，并指定输出标签的数量

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy="epoch",  # 每个epoch后进行评估
    save_strategy="epoch",  # 每个epoch后保存模型
    learning_rate=3e-5,  # 学习率
    per_device_train_batch_size=32,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=32,  # 每个设备上的评估批次大小
    num_train_epochs=5,  # 训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=2,  # 梯度累积步骤，模拟更大的batch size
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="linear",  # 使用线性学习率调度器
    warmup_steps=500  # 热身步数，逐步增加学习率
)

# 创建Trainer实例
trainer = Trainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练集
    eval_dataset=all_eval_dataset,  # 验证集
    compute_metrics=compute_metrics,  # 评估函数
)

# 训练模型
trainer.train()

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和训练状态
trainer.save_model(result_output_dir)
trainer.save_state()


Map: 100%|██████████| 50000/50000 [00:12<00:00, 3963.62 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
0,0.7073,0.71293,0.6833
2,0.5613,0.743686,0.6897
4,0.4021,0.889443,0.6793


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [None]:
from datasets import load_dataset  # 导入用于加载数据集的函数
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer  # 导入所需的transformers类
import numpy as np  # 导入numpy用于数组操作
import evaluate  # 导入用于评估的库

# 模型和数据集的路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_32_liner_1"

# 加载并标记数据集
dataset = load_dataset(datasets_path + "yelp_review_full")  # 加载Yelp评论数据集
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased")  # 加载BERT tokenizer

# 定义标记函数
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)  # 对每个样本进行tokenization，填充和截断

# 对数据集进行标记
tokenized_datasets = dataset.map(tokenize_function, batched=True)  # 对整个数据集进行标记

# 准备训练集和验证集
all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))  # 从训练集中随机选择65万个样本
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))  # 从测试集中随机选择1万个样本用于验证

# 评估函数
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")  # 加载准确率评估函数

def compute_metrics(eval_pred):
    logits, labels = eval_pred  # 分离预测值和真实标签
    predictions = np.argmax(logits, axis=-1)  # 取预测值的最大概率对应的标签
    return metric.compute(predictions=predictions, references=labels)  # 计算准确率

# 加载预训练模型
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)  # 加载BERT模型，并指定输出标签的数量

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy="epoch",  # 每个epoch后进行评估
    save_strategy="epoch",  # 每个epoch后保存模型
    learning_rate=2e-5,  # 降低学习率
    per_device_train_batch_size=32,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=32,  # 每个设备上的评估批次大小
    num_train_epochs=10,  # 增加训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=2,  # 梯度累积步骤，模拟更大的batch size
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="linear",  # 使用线性学习率调度器
    warmup_steps=500,  # 热身步数，逐步增加学习率
    logging_dir="logs",  # 日志记录路径
    report_to="tensorboard"  # 使用TensorBoard记录训练过程
)

# 创建Trainer实例
trainer = Trainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练集
    eval_dataset=all_eval_dataset,  # 验证集
    compute_metrics=compute_metrics,  # 评估函数
)

# 训练模型
trainer.train()

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和训练状态
trainer.save_model(result_output_dir)
trainer.save_state()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
0,0.7111,0.72244,0.6795


In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, IntervalStrategy
import numpy as np
import evaluate
from transformers.optimization import AdamW
from torch.optim import AdamW
import torch
from transformers.trainer_callback import TrainerCallback, TrainerState, TrainerControl

# 自定义回调函数，在每个epoch结束后输出结果
class LoggingCallback(TrainerCallback):
       def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        print(f"Epoch {state.epoch}:")
        # 检查并打印训练损失
        if len(state.log_history) > 1 and 'loss' in state.log_history[-2]:
            print(f"  Training Loss = {state.log_history[-2]['loss']}")
        else:
            print("  Training Loss not available")
        # 检查并打印验证损失
        if 'eval_loss' in state.log_history[-1]:
            print(f"  Validation Loss = {state.log_history[-1]['eval_loss']}")
        else:
            print("  Validation Loss not available")
        # 检查并打印准确率
        if 'eval_accuracy' in state.log_history[-1]:
            print(f"  Accuracy = {state.log_history[-1]['eval_accuracy']}")
        else:
            print("  Accuracy not available")

# 自定义 Trainer 类，使用 AdamW 优化器
class CustomTrainer(Trainer):
    def create_optimizer(self):
        self.optimizer = AdamW(self.model.parameters(), lr=self.args.learning_rate, eps=self.args.adam_epsilon)
        return self.optimizer

# 设置模型和数据集路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_adamW"

# 准备和标记数据集
dataset = load_dataset(datasets_path + "yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))

# 设置评估指标
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 加载预训练模型，并指定输出标签的数量
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy=IntervalStrategy.EPOCH,  # 每个epoch后进行评估
    save_strategy=IntervalStrategy.EPOCH,  # 每个epoch后保存模型
    learning_rate=2e-5,  # 学习率
    per_device_train_batch_size=42,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=42,  # 每个设备上的评估批次大小
    num_train_epochs=3,  # 训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=1,  # 梯度累积步骤，减少一半以保持总批次大小相同
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="linear",  # 使用线性学习率调度器
    warmup_steps=500,  # 热身步数，逐步增加学习率
    logging_dir='./logs',  # 日志记录路径
    report_to="none"  # 不使用TensorBoard记录日志
)

# 创建自定义 Trainer 实例
trainer = CustomTrainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练数据集
    eval_dataset=all_eval_dataset,  # 验证数据集
    compute_metrics=compute_metrics,  # 评估函数
    callbacks=[LoggingCallback()]  # 添加自定义回调函数
)

# 开始训练
trainer.train()

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和状态
trainer.save_model(result_output_dir)
trainer.save_state()


  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.7054,0.708036,0.6856
2,0.6235,0.705144,0.6906
3,0.5421,0.731232,0.6959


Epoch 1.0:
  Training Loss = 0.7054
  Validation Loss = 0.7080358266830444
  Accuracy = 0.6856


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Epoch 2.0:
  Training Loss = 0.6235
  Validation Loss = 0.7051438689231873
  Accuracy = 0.6906


IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Epoch 3.0:
  Training Loss = 0.5421
  Validation Loss = 0.7312320470809937
  Accuracy = 0.6959


Epoch 3.0:


KeyError: 'loss'

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, IntervalStrategy
import numpy as np
import evaluate
from transformers.optimization import AdamW
from torch.optim import AdamW
import torch
from transformers.trainer_callback import TrainerCallback, TrainerState, TrainerControl

# 自定义回调函数，在每个epoch结束后输出结果
class LoggingCallback(TrainerCallback):
       def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        print(f"Epoch {state.epoch}:")
        # 检查并打印训练损失
        if len(state.log_history) > 1 and 'loss' in state.log_history[-2]:
            print(f"  Training Loss = {state.log_history[-2]['loss']}")
        else:
            print("  Training Loss not available")
        # 检查并打印验证损失
        if 'eval_loss' in state.log_history[-1]:
            print(f"  Validation Loss = {state.log_history[-1]['eval_loss']}")
        else:
            print("  Validation Loss not available")
        # 检查并打印准确率
        if 'eval_accuracy' in state.log_history[-1]:
            print(f"  Accuracy = {state.log_history[-1]['eval_accuracy']}")
        else:
            print("  Accuracy not available")

# 自定义 Trainer 类，使用 AdamW 优化器
class CustomTrainer(Trainer):
    def create_optimizer(self):
        self.optimizer = AdamW(self.model.parameters(), lr=self.args.learning_rate, eps=self.args.adam_epsilon)
        return self.optimizer

# 设置模型和数据集路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_adamW/"

# 准备和标记数据集
dataset = load_dataset(datasets_path + "yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))

# 设置评估指标
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 加载预训练模型，并指定输出标签的数量
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy=IntervalStrategy.EPOCH,  # 每个epoch后进行评估
    save_strategy=IntervalStrategy.EPOCH,  # 每个epoch后保存模型
    learning_rate=2e-5,  # 学习率
    per_device_train_batch_size=42,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=42,  # 每个设备上的评估批次大小
    num_train_epochs=5,  # 训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=1,  # 梯度累积步骤，减少一半以保持总批次大小相同
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="linear",  # 使用线性学习率调度器
    warmup_steps=500,  # 热身步数，逐步增加学习率
    logging_dir='./logs',  # 日志记录路径
    report_to="none"  # 不使用TensorBoard记录日志
)

# 创建自定义 Trainer 实例
trainer = CustomTrainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练数据集
    eval_dataset=all_eval_dataset,  # 验证数据集
    compute_metrics=compute_metrics,  # 评估函数
    callbacks=[LoggingCallback()]  # 添加自定义回调函数
)

# 开始训练
trainer.train(result_output_dir+"checkpoint-30954")

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和状态
trainer.save_model(result_output_dir)
trainer.save_state()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
3,0.5749,0.733922,0.6957
4,0.5337,0.777048,0.6889
5,0.4521,0.837755,0.6881


Epoch 3.0:
  Training Loss = 0.5749
  Validation Loss = 0.7339217066764832
  Accuracy = 0.6957


Checkpoint destination directory test_trainer_adamW/checkpoint-46431 already exists and is non-empty.Saving will proceed but saved results may be invalid.


Epoch 4.0:
  Training Loss = 0.5337
  Validation Loss = 0.7770480513572693
  Accuracy = 0.6889
Epoch 5.0:
  Training Loss = 0.4521
  Validation Loss = 0.8377554416656494
  Accuracy = 0.6881


Epoch 5.0:
  Training Loss not available
  Validation Loss = 0.7051438689231873
  Accuracy = 0.6906


In [None]:
Fri Jun 14 12:26:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:12:00.0 Off |                  Off |
| 31%   56C    P2             295W / 450W |  23245MiB / 24564MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, IntervalStrategy
import numpy as np
import evaluate
from transformers.optimization import AdamW
import torch
from transformers.trainer_callback import TrainerCallback, TrainerState, TrainerControl

# 自定义回调函数，在每个epoch结束后输出结果
class LoggingCallback(TrainerCallback):
    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        print(f"Epoch {state.epoch}:")
        # 检查并打印训练损失
        if len(state.log_history) > 1 and 'loss' in state.log_history[-2]:
            print(f"  Training Loss = {state.log_history[-2]['loss']}")
        else:
            print("  Training Loss not available")
        # 检查并打印验证损失
        if 'eval_loss' in state.log_history[-1]:
            print(f"  Validation Loss = {state.log_history[-1]['eval_loss']}")
        else:
            print("  Validation Loss not available")
        # 检查并打印准确率
        if 'eval_accuracy' in state.log_history[-1]:
            print(f"  Accuracy = {state.log_history[-1]['eval_accuracy']}")
        else:
            print("  Accuracy not available")

# 自定义 Trainer 类，使用 AdamW 优化器
class CustomTrainer(Trainer):
    def create_optimizer(self):
        self.optimizer = AdamW(self.model.parameters(), lr=self.args.learning_rate, eps=self.args.adam_epsilon)
        return self.optimizer

# 设置模型和数据集路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_adamW_1/"

# 准备和标记数据集
dataset = load_dataset(datasets_path + "yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))

# 设置评估指标
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 加载预训练模型，并指定输出标签的数量
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy=IntervalStrategy.EPOCH,  # 每个epoch后进行评估
    save_strategy=IntervalStrategy.EPOCH,  # 每个epoch后保存模型
    learning_rate=2e-6,  # 学习率
    per_device_train_batch_size=128,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=128,  # 每个设备上的评估批次大小
    num_train_epochs=10,  # 训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=2,  # 梯度累积步骤
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="cosine",  # 使用余弦学习率调度器
    warmup_steps=1000,  # 热身步数，逐步增加学习率
    logging_dir='./logs',  # 日志记录路径
    report_to="none"  # 不使用TensorBoard记录日志
)

# 创建自定义 Trainer 实例
trainer = CustomTrainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练数据集
    eval_dataset=all_eval_dataset,  # 验证数据集
    compute_metrics=compute_metrics,  # 评估函数
    callbacks=[LoggingCallback()]  # 添加自定义回调函数
)

# 开始训练
trainer.train(result_output_dir+"checkpoint-46430")

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和状态
trainer.save_model()
trainer.save_state()


  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
6,0.6724,0.719761,0.6881
8,0.6517,0.719231,0.6893
9,0.6505,0.719111,0.6885


Epoch 6.999935387995089:
  Training Loss = 0.6724
  Validation Loss = 0.7197611927986145
  Accuracy = 0.6881




Epoch 8.0:
  Training Loss = 0.6665
  Validation Loss = 0.7198094129562378
  Accuracy = 0.6867




Epoch 8.99993538799509:
  Training Loss = 0.6517
  Validation Loss = 0.7192310690879822
  Accuracy = 0.6893




Epoch 9.999741551980359:
  Training Loss = 0.6505
  Validation Loss = 0.7191110253334045
  Accuracy = 0.6885




Epoch 9.999741551980359:
  Training Loss not available
  Validation Loss = 0.718285858631134
  Accuracy = 0.6872


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, IntervalStrategy
import numpy as np
import evaluate
from transformers.optimization import AdamW
import torch
from transformers.trainer_callback import TrainerCallback, TrainerState, TrainerControl

# 自定义回调函数，在每个epoch结束后输出结果
class LoggingCallback(TrainerCallback):
    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        print(f"Epoch {state.epoch}:")
        # 检查并打印训练损失
        if len(state.log_history) > 1 and 'loss' in state.log_history[-2]:
            print(f"  Training Loss = {state.log_history[-2]['loss']}")
        else:
            print("  Training Loss not available")
        # 检查并打印验证损失
        if 'eval_loss' in state.log_history[-1]:
            print(f"  Validation Loss = {state.log_history[-1]['eval_loss']}")
        else:
            print("  Validation Loss not available")
        # 检查并打印准确率
        if 'eval_accuracy' in state.log_history[-1]:
            print(f"  Accuracy = {state.log_history[-1]['eval_accuracy']}")
        else:
            print("  Accuracy not available")

# 自定义 Trainer 类，使用 AdamW 优化器
class CustomTrainer(Trainer):
    def create_optimizer(self):
        self.optimizer = AdamW(self.model.parameters(), lr=self.args.learning_rate, eps=self.args.adam_epsilon)
        return self.optimizer

# 设置模型和数据集路径
model_path = "/root/dataDisk/hf/hub/models/"
datasets_path = "/root/dataDisk/hf/hub/datasets/"
result_output_dir = "test_trainer_adamW_2/"

# 准备和标记数据集
dataset = load_dataset(datasets_path + "yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained(model_path + "bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

all_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(650000))
all_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))

# 设置评估指标
metric = evaluate.load("metrics/accuracy/" + "accuracy.py")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# 加载预训练模型，并指定输出标签的数量
model = AutoModelForSequenceClassification.from_pretrained(model_path + "bert-base-cased", num_labels=5)

# 设置训练参数
training_args = TrainingArguments(
    output_dir=result_output_dir,  # 模型输出路径
    evaluation_strategy=IntervalStrategy.EPOCH,  # 每个epoch后进行评估
    save_strategy=IntervalStrategy.EPOCH,  # 每个epoch后保存模型
    learning_rate=1e-6,  # 学习率
    per_device_train_batch_size=32,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=32,  # 每个设备上的评估批次大小
    num_train_epochs=5,  # 训练的epoch数量
    logging_steps=30,  # 每30步记录一次日志
    save_total_limit=3,  # 最多保存3个模型检查点
    load_best_model_at_end=True,  # 在训练结束时加载表现最好的模型
    gradient_accumulation_steps=4,  # 梯度累积步骤
    fp16=True,  # 启用混合精度训练，减少显存占用
    lr_scheduler_type="cosine_with_restarts",  # 余弦学习率调度器
    warmup_steps=1000,  # 热身步数，逐步增加学习率
    logging_dir='./logs',  # 日志记录路径
    report_to="none"  # 不使用TensorBoard记录日志
)

# 创建自定义 Trainer 实例
trainer = CustomTrainer(
    model=model,  # 模型
    args=training_args,  # 训练参数
    train_dataset=all_train_dataset,  # 训练数据集
    eval_dataset=all_eval_dataset,  # 验证数据集
    compute_metrics=compute_metrics,  # 评估函数
    callbacks=[LoggingCallback()]  # 添加自定义回调函数
)

# 开始训练
trainer.train()

# 评估模型
trainer.evaluate(all_eval_dataset)

# 保存模型和状态
trainer.save_model()
trainer.save_state()


  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /root/dataDisk/hf/hub/models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy
0,0.8974,0.871679,0.6117


Epoch 0.9999015457320075:
  Training Loss = 0.8974
  Validation Loss = 0.8716785907745361
  Accuracy = 0.6117


