<font size="5">YelpReviewFull数据集微调</font>

yelp_review_full数据集：  
Yelp（美国点评网站）的全量评论数据集，包含约 65 万条用户对商家的评论，每条评论带有 1-5 星的评分标签，常用于自然语言处理任务（如情感分析、文本分类等）。

<font size="4">1 导入数据</font>

In [2]:
import os

os.environ['http_proxy'] = 'http://127.0.0.1:1087'
os.environ['https_proxy'] = 'http://127.0.0.1:1087'

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

<font size="4">2 预处理数据</font>

In [6]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [4]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []  # 取出的索引值的列表
    for _ in range(num_examples):  # 取num_examples个数
        pick = random.randint(0, len(dataset) - 1)  # 在索引范围内取一个随机数
        while pick in picks:  # 如果取的索引值之前取过，则重新取一个
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)  # 添加

    df = pd.DataFrame(dataset[picks])  # 转化为df
    for column, typ in dataset.features.items():  # dataset.features: {'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None), 'text': Value(dtype='string', id=None)}
        if isinstance(typ, datasets.ClassLabel):  # 找到'label'列
            df[column] = df[column].transform(lambda i: typ.names[i])  # 将label中的0-4的数字转化为'1-5 star'标签
    display(HTML(df.to_html()))  # 打印



In [7]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,3 stars,"Three and a half stars.\nNow we only ate here once, the food was really good and the service was great. We discussed our experience after the meal and honestly we have been more impressed with Doan and Ben Thanh so no I would NOT rate Lang Van the best hands down yada yada.\n\nBut this was only one single meal and the cooking was certainly tasty enough to be in the ballpark (if not in the lead). I wouldn't lose sleep over which one is \""the best\"" in this case. If you like Vietnamese then you'll enjoy dining at any of those three restaurants, we do."
1,5 stars,"Just had the BBQ on the terrace offered on Sundays during the summer. It comes with an appetizer,salad, soup, shellfish buffet, an entree with sides and dessert. The food quality was fantastic (Kobe sliders, duck burgers, Olympia oysters, king crabs). There are drink specials for $8. The cost is $49 which is expensive but I thought it was worth every penny. The restaurant is beautiful and check out the bar. Quite a change from the Marquessa. We'll be back to try the regular menu soon."
2,2 star,"My and my husband's contracts are up and we're torn between two phones right now. I went here to check them out and ask my short list of questions. \n\nThey took my name and it went in queue on their tv screen showing the next 5 people up for service. This is a pretty good system - I like knowing where I am on the list and, in the meantime, I was able to play around with the phones. \n\nWhile I was on deck, the employee at the door taking names asked me if I had decided on one yet. Since he wasn't doing anything I asked him a couple of questions about the two phones I was interested in and also solicited his opinion. He was glad to help. We eventually asked our salesperson some of the same questions; he gave us totally different answers! And the questions we got differing answers for were like \""what's the difference between the two in battery time?\"" etc. Uhhh, someone is telling us something wrong here and shouldn't every employee either know the product or find out the answer if they're not sure, instead of just giving us some bogus answer? \n\nI think I'll just use the store to browse the phones and I'll compare and order them online."
3,5 stars,"Great hotel. I like it how they don't have casino in the hotel. It's clean and more for family with kids. No crazy partying noise at the hotel which is good. One thing though, everytime we come back from somewhere by taxi, door men are \""checking in? checking in?\"" even though we were staying for over a week (for business). That was annoying! It's such a nice hotel but that \""checking in? checking in?\"" sounds so cheap. Otherwise, oh one more thing, house keeper threw away a bag of my personal item in a bathroom. We talked to a front desk and they said \""we never hear any negative comments like that\"" that's it. So that was annoying but over all it's a great hotel."
4,1 star,Crappy service. Came in to order one sandwich. Told me 30 min wait because they are in middle of a big order. Employees goofing off while making sandwiches. Should have one person working counter orders.
5,3 stars,"Took a few clients here on business. We had the tasting menu, and while the Foie Gras Custard 'Br\u00fbl\u00e9e' was one of the best things I've eaten, I was pretty underwhelmed by the rest of the dishes. When you are dropping $200 (or more) per person on food and wine, I want to be blown away. This place gets 3 stars for the fantastic service, though."
6,1 star,"I wouldn't give them any stars at all. The day we moved in the apartment was dirty carpet stained and scratches on the walls. We reported that to the office and they said that maintenance was going to come in to take care of that but they never showed up. A week ago our neighbor caught his town home on fire, and ours suffered a lot of water and smoke damages. We didn't heard from the property manager until we contacted them to see what was the plan and when the out house was going to be ready for us to move back. So far it's been a week and our house is a complete disaster. we really like the home and the location but they lack customer service skills. We moved from Texas and we were not familiar with the area and didn't have a lot of time to find a place since my husbands job needed him here ASAP. I am pretty sure that we can find something much better for the same price... My advise is to look somewhere else!!!"
7,4 stars,"Excellent service, items delivered on time. The initial delivery was two chairs short, but with one phone call, the missing items were delivered promptly and with a smile. Our guests thought our party decor looked great. I wouldn't hesitate to use RSVP again."
8,3 stars,just ok.
9,5 stars,"This place is like a local Etsy. If you're looking for a gifts, house decor or other locally made goods, come here first. My favorite part of the store is the hand-picked vintage furniture. Such cute stuff all around and the prices are super reasonable. The owner is a warm and cheerful and has done an amazing job with this store. It's now my \""go to\"" place for gifts (especially for babies) and household goods. Stop by and check out this gem for yourself! And if you follow them on Instagram or Facebook you'll be the first to know about new arrivals."


分词器：将原始文本转换为模型可理解的数字形式（即 "token ID"），同时处理文本的分割、标点符号等细节（符合模型的输入要求）。  
bert-base-cased：预训练模型，"bert" 指 BERT 模型，"base" 表示基础版本（参数量较小），"cased" 表示该模型区分大小写（如 "Apple" 和 "apple" 会被视为不同 token）。  

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")  # 自动分词器，能根据预训练模型的名称自动加载对应的分词器（不同模型的分词规则不同）


def tokenize_function(examples):
    """
    examples["text"]：yelp_review_full数据集中的评论文本
    padding="max_length"：当文本长度不足时，补足到指定的最大长度
    max_length=512：指定文本的最大长度为 512（BERT 模型设计的最大输入序列长度就是 512，超过会导致模型无法处理）
    """
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)


tokenized_datasets = dataset.map(tokenize_function, batched=True)  # 将tokenize_function应用到整个数据集



Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [7]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset) # 数量不能超出数据集长度
    picks = []  # 取出的索引值的列表
    for _ in range(num_examples):  # 取num_examples个数
        pick = random.randint(0, len(dataset) - 1)  # 在索引范围内取一个随机数
        while pick in picks:  # 如果取的索引值之前取过，则重新取一个
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)  # 添加

    df = pd.DataFrame(dataset[picks])  # 转化为df
    for column, typ in dataset.features.items():  # 其中dataset.features: {'label': ClassLabel(names=['1 star', '2 star', '3 stars', '4 stars', '5 stars'], id=None), 'text': Value(dtype='string', id=None)}
        if isinstance(typ, datasets.ClassLabel):  # 找到'label'列
            df[column] = df[column].transform(lambda i: typ.names[i])  # 将label中的0-4的数字转化为'1-5 star'标签
    display(HTML(df.to_html()))  # 打印

In [8]:
show_random_elements(tokenized_datasets["train"], num_examples=1)  # 展示元素

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,1 star,"Hertz fails in their customer service. As Club Gold members, we initially asked one of the attendants where we should wait in line. He noted that the Club Gold line was closed, and told us to wait in the general line. After 30 minutes, we were served, only to be moved to the Club Gold line upstairs, who then told us to come back down and wait even more. We waited in three lines to get our car!! At the end of it, they gave us a car where the windows aren't even automatic. I will be sure to take my business elsewhere next time, as Hertz does not understand customer service.","[101, 1430, 5745, 12169, 1107, 1147, 8132, 1555, 119, 1249, 1998, 3487, 1484, 117, 1195, 2786, 1455, 1141, 1104, 1103, 19389, 1116, 1187, 1195, 1431, 3074, 1107, 1413, 119, 1124, 2382, 1115, 1103, 1998, 3487, 1413, 1108, 1804, 117, 1105, 1500, 1366, 1106, 3074, 1107, 1103, 1704, 1413, 119, 1258, 1476, 1904, 117, 1195, 1127, 1462, 117, 1178, 1106, 1129, 1427, 1106, 1103, 1998, 3487, 1413, 8829, 117, 1150, 1173, 1500, 1366, 1106, 1435, 1171, 1205, 1105, 3074, 1256, 1167, 119, 1284, 3932, 1107, 1210, 2442, 1106, 1243, 1412, 1610, 106, 106, 1335, 1103, 1322, 1104, 1122, 117, 1152, 1522, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


token_type_ids（token类型标识）：主要用于区分输入序列中的 “不同句子片段”， 例如：[CLS] I like cats [SEP] Do you like dogs? [SEP] 转化后为：[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]  
attention_mask（注意力掩码）：告诉模型哪些 token 是 “真实有效的文本”标志1，哪些是 “填充的占位符”（padding）标志0，避免模型对无效的填充符号进行关注

<font size="4">3 部分数据微调模型</font>

In [20]:
train_val = tokenized_datasets["train"].shuffle(seed=42)  # shuffle：对训练集进行随机打乱；seed=42：随机种子，确保每次运行代码时打乱的顺序完全一致
train_dataset = train_val.select(range(20000))  # 选择前2w条数据作为训练集，用于模型训练
eval_dataset = train_val.select(range(20000, 22000))  # 选择2k数据作为验证集，用于在训练过程中评估模型性能（如验证准确率、损失等），帮助调整超参数或判断是否过拟合
print(train_dataset.shape, eval_dataset.shape)

(20000, 5) (2000, 5)


In [21]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)  
"""
AutoModelForSequenceClassification: 专门用于序列分类任务（如文本分类、情感分析、主题识别等）的自动模型加载工具
from_pretrained: 加载预训练模型，并初始化分类头
num_labels=5：指定分类任务的类别数量为 5, yelp_review_full数据集的评论标签是1-5 star，共5个类别
"""
print(model.bert.encoder.layer[:2]) 
"""
model：整个序列分类模型，包含两部分:基础 BERT 模型（model.bert）和分类头（model.classifier）
model.bert：即基础BERT模型，核心是一个Transformer 编码器（model.bert.encoder）
model.bert.encoder.layer：BERT 的编码器由多个Transformer 层堆叠而成（bert-base-cased包含 12 层），这是一个存储所有层的列表
[:2]：取列表的前 2 个元素，即打印前两层 Transformer 层
"""


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ModuleList(
  (0-1): 2 x BertLayer(
    (attention): BertAttention(
      (self): BertSelfAttention(
        (query): Linear(in_features=768, out_features=768, bias=True)
        (key): Linear(in_features=768, out_features=768, bias=True)
        (value): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (output): BertSelfOutput(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (intermediate): BertIntermediate(
      (dense): Linear(in_features=768, out_features=3072, bias=True)
      (intermediate_act_fn): GELUActivation()
    )
    (output): BertOutput(
      (dense): Linear(in_features=3072, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
)


In [11]:
from transformers import TrainingArguments, Trainer
"""
TrainingArguments：用于定义模型训练的各种超参数和配置（如训练轮数、学习率、批量大小、日志保存路径等）
Trainer：基于TrainingArguments的高级训练接口，封装了模型训练、验证、保存等完整流程，无需手动编写训练循环（如前向传播、反向传播、参数更新等），简化训练代码
"""
import numpy as np
import evaluate  # 用于模型评估的工具库,内置了多种常见评估指标（如准确率、F1 值、BLEU 等），支持加载自定义指标，方便在训练过程中实时计算模型性能

metric = evaluate.load("/home/cc/projects/llm-quickstart/LLM-quickstart-main/mywork/week2/evaluate-main/metrics/accuracy")  # accuracy: 预测正确的样本数/总样本数

model_dir = "/home/cc/models/finetuned-models/bert-base-cased-finetune-yelp-1"

In [15]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred  # 元组拆包
    predictions = np.argmax(logits, axis=-1)  # 获取最大的
    return metric.compute(predictions=predictions, references=labels)  # 计算 “预测正确的样本数 / 总样本数”，返回一个字典（例如{"accuracy": 0.60}）

In [24]:
import sys
# 将递归深度从默认~1000提高到10000（足够覆盖BERT的层级）
sys.setrecursionlimit(10000)
training_args = TrainingArguments(
    output_dir=model_dir,  # 模型输出路径
    per_device_train_batch_size=16,  # 单卡训练批次大小
    per_device_eval_batch_size=32,  # 单卡评估批次大小
    gradient_accumulation_steps=2,  # 梯度累积步骤（总批大小 = 16 * 2 * 2卡 = 64）
    num_train_epochs=5,  # 训练轮数
    learning_rate=5e-5,  # 学习率
    weight_decay=0.01,  # 权重衰减（L2正则化）
    warmup_ratio=0.05,  # 预热比例（前5%训练步长）
    lr_scheduler_type="linear",  # 学习率调度器类型
    logging_dir="./logs",  # 日志目录
    logging_steps=500,  # 每500步记录日志
    evaluation_strategy="epoch",  # 每个epoch结束评估
    save_strategy="epoch",  # 每个epoch结束保存模型
    save_total_limit=2,  # 最多保留2个检查点
    load_best_model_at_end=True,  # 训练结束时加载最佳模型
    metric_for_best_model="accuracy",  # 根据准确率选择最佳模型
    greater_is_better=True,  # 准确率越高越好
    report_to="tensorboard",  # 使用TensorBoard记录
    fp16=True,  # 启用混合精度训练（利用RTX 4090的Tensor Core）
    dataloader_num_workers=4,  # 数据加载线程数
    remove_unused_columns=False,  # # 禁用自动移除未用列（确保保留所有特征）
    gradient_checkpointing=False,  # 梯度检查点（节省显存）
)
trainer = Trainer(
    model=model,  # 指定要训练的模型
    args=training_args,  # 传入训练配置参数
    train_dataset=train_dataset,  # 训练集
    eval_dataset=eval_dataset,  # 验证集
    compute_metrics=compute_metrics,  # 评估指标计算函数，也就是accuracy
)
print(training_args)

TrainingArguments(
_n_gpu=2,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=True,
group_by_l

In [25]:
trainer.train()  # 训练模型

Epoch,Training Loss,Validation Loss,Accuracy
0,No log,0.904373,0.606
2,0.969500,1.053166,0.604
4,0.267300,1.397879,0.607




TrainOutput(global_step=1560, training_loss=0.5845682676021869, metrics={'train_runtime': 1080.2132, 'train_samples_per_second': 92.574, 'train_steps_per_second': 1.444, 'total_flos': 2.626971534360576e+16, 'train_loss': 0.5845682676021869, 'epoch': 4.99})

指标解释：    
Epoch：训练轮次，指模型完整遍历训练数据集的次数。例如，Epoch 0表示第 1 轮训练（部分框架从 0 开始计数），Epoch 4表示第 5 轮训练  
Training Loss（训练损失）：模型在训练集上的损失值（通常是交叉熵损失，用于衡量模型预测与训练数据真实标签的差距）。数值越小，说明模型在训练集上的拟合效果越好  
Validation Loss（验证损失）：模型在验证集上的损失值，用于衡量模型对未见过的数据（验证集）的预测误差。数值越小，说明模型泛化能力越好  
Accuracy（准确率）：模型在验证集上的预测准确率（正确预测的样本数 / 总样本数），直接反映模型的分类性能。数值越高，模型效果越好  

三轮准确率分别为0.606（60.6%）、0.604（60.4%）、0.607（60.7%），整体波动极小，几乎没有提升  
尽管训练损失在下降，但模型在验证集上的实际分类性能没有改善，始终维持在 60% 左右。结合验证损失上升的趋势，说明模型并未学到真正能泛化的规律，而是在训练集上 “过拟合”，导致在验证集上无法有效预测

TrainOutput解释：  
global_step=1560：总的训练步数（参数更新次数），步数 = 训练样本数 ÷ 每步处理的样本数 × 训练轮次  
training_loss=0.5845682676021869：整个训练过程的平均训练损失（在所有训练 step 上的平均），这个值较低（0.58），说明模型在训练集上的拟合效果较好，对训练数据的 “记忆” 程度较高。  
metrics：   
      train_runtime=1080.2132：总训练时间，单位为秒（约 18 分钟）  
      train_samples_per_second=92.574：每秒处理的训练样本数  
      train_steps_per_second=1.444：每秒完成的训练步数  
      total_flos=2.626971534360576e+16：训练过程中总的浮点运算次数（FLOPs）  
      train_loss=0.5845682676021869：与前面的training_loss一致，即平均训练损失  
      epoch=4.99：实际完成的训练轮次（接近 5 轮），与预设的训练轮次基本吻合（可能因最后一轮未完全处理完所有样本而略小于 5）  

In [26]:
test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))  # 选100个元素作为测试集

In [27]:
trainer.evaluate(test_dataset)  # 评估在测试集上的表现



{'eval_loss': 0.9968435764312744,
 'eval_accuracy': 0.52,
 'eval_runtime': 0.5526,
 'eval_samples_per_second': 180.966,
 'eval_steps_per_second': 3.619,
 'epoch': 4.99}

eval_loss': 0.9968：测试集上的损失值，与之前验证集最终评估的损失接近（验证集约 0.997），说明模型在测试集和验证集上的预测误差水平相当   
eval_accuracy': 0.52：测试集上的准确率为 52%，与验证集最终准确率一致，整体表现一般，未能达到理想的泛化效果  
eval_runtime、eval_samples_per_second：评估效率指标，0.55 秒完成，每秒处理 181 个样本 
epoch': 4.99：基于接近5轮训练后的权重 

In [None]:
trainer.save_model(model_dir)  # 保存模型

In [29]:
trainer.save_state()

<font size="4">4 全部数据微调模型</font>

In [12]:
train_val = tokenized_datasets["train"].shuffle(seed=42)
split_index = int(0.9 * len(train_val))  # 90% 作为训练集，10% 作为验证集
train_dataset = train_val.select(range(split_index))
eval_dataset = train_val.select(range(split_index, len(train_val)))
print(train_dataset.shape, eval_dataset.shape)

(585000, 5) (65000, 5)


In [18]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.7398,0.729362,0.679092
2,0.6686,0.712976,0.692523
3,0.5819,0.736168,0.692631
4,0.4685,0.812186,0.689308
5,0.3543,0.924588,0.684877




TrainOutput(global_step=45705, training_loss=0.5784050744806516, metrics={'train_runtime': 30702.4255, 'train_samples_per_second': 95.269, 'train_steps_per_second': 1.489, 'total_flos': 7.696205667072e+17, 'train_loss': 0.5784050744806516, 'epoch': 5.0})

In [None]:
test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(1000))

In [20]:
trainer.evaluate(test_dataset)



{'eval_loss': 0.7708825469017029,
 'eval_accuracy': 0.671,
 'eval_runtime': 3.4172,
 'eval_samples_per_second': 292.64,
 'eval_steps_per_second': 4.682,
 'epoch': 5.0}