## 1.背景介绍

本实验主要研究 基于 huggingface transformers 库和 deepspeed 进行单机两卡微调方法。

## 2.实验目的
了解 ZeRO 的基本思想，并完成单机两卡并行全参微调。

## 3.硬件要求

两张 GPU（4090、V100、A100等）。


## 4.技术原理

### ZeRO

朴素的数据并行中，每个 DP 组保留完整的模型权重，只起到加速训练的效果，并没有节省显存的作用。

ZeRO 尝试对数据并行进一步优化，从而达到节省显存的目的。

ZeRO 的原理是将模型权重，梯度，优化器状态在不同 DP 组进行切分，在每次计算时，将分布在不同 DP 组的张量聚集起来，计算完毕后便释放到逐个 DP 组存储。增加部分通讯延迟来节省显存。

## 5.实验流程

### 环境配置



In [None]:
!pip install torch modelscope accelerate==0.27.0 deepspeed

### 数据集下载



In [None]:
!modelscope download --dataset liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT --local_dir ./Chinese-DeepSeek-R1-Distill-data-110k-SFT

In [None]:
with open('Chinese-DeepSeek-R1-Distill-data-110k-SFT/distill_r1_110k_sft.jsonl', 'r') as f:
    for count, line in enumerate(f):
        print(line)
        if count == 4:
            break

### 5.1. deepspeed config



train_batch_size : 模型每进行一步更新， 需要多少个 sample，等于 num_gpu * batch_size_per_gpu * accumulation_steps

train_micro_batch_size_per_gpu : 单个 GPU 每次 step 所需要的 sample 数量

stage : 0, 1, 2, 3 分别对应 关闭 ZeRO，仅优化器，优化器 + 梯度，优化器 + 梯度 + 模型参数切分

offload_optimizer : 将优化器存储下放到 CPU / NVME，将计算下放到 CPU 

overlap_comm : 在反向传播过程中做梯度聚合

contiguous_gradients : 将梯度存放在连续空间，减少存储碎片

reduce_bucket_size : 每次 reduce 操作最多参与的元素个数

stage3_prefetch_bucket_size : 在计算前预获取的最大元素个数

stage3_param_persistence_threshold : 不再划分的最小元素个数

stage3_max_live_parameters : 最多多少个元素可以在每个 GPU 上

stage3_max_reuse_distance : 元素释放的最小距离



In [None]:
!pwd
%cd demo5

### 5.1. 包导入以及环境变量设置

In [3]:
from transformers import AutoTokenizer, DataCollatorForSeq2Seq
from datasets import load_dataset
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "6,7"

### 5.2. 数据集加载和处理

In [4]:
dataset_path = "./Chinese-DeepSeek-R1-Distill-data-110k-SFT"
dataset = load_dataset(dataset_path, split="train")
dataset = dataset.shuffle(seed=42).select(range(10))

tokenizer = AutoTokenizer.from_pretrained("/nvme/models/models/Qwen2.5-7B-Instruct")

In [5]:
system_prompt = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
def generate_r1_prompt(prompt, completion):
    input_ids, attention_mask, labels = [], [], []
    instruction = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user",
            "content": prompt
        },
    ]
    response = [
        {
            "role": "assistant",
            "content": completion    
        }
    ]

    full = instruction + response

    tokenized_instruction = tokenizer.apply_chat_template(instruction, tokenize=True, return_dict=True)
    tokenized_full = tokenizer.apply_chat_template(full, tokenize=True, return_dict=True)

    input_ids = tokenized_full["input_ids"]
    attention_mask = tokenized_full["attention_mask"]
    labels = input_ids.copy()
    instruction_length = len(tokenized_instruction["input_ids"])
    labels[:instruction_length] = [-100] * instruction_length
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }
 
dataset = dataset.map(lambda x: generate_r1_prompt(x["instruction"], x["output"]), remove_columns=["instruction", "output"])



In [None]:
print(tokenizer.decode(dataset[0]["input_ids"]))

In [None]:
print(tokenizer.decode(list(filter(lambda x: x != -100, dataset[0]["labels"]))))

### 5.3. 模型加载以及原始模型推理

In [None]:
model = AutoModelForCausalLM.from_pretrained("/nvme/models/models/Qwen2.5-7B-Instruct").to("cuda")

prompt = "1.11和1.9哪个大"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": prompt}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       ).to("cuda")

gen_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_k": 1}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    print("原始模型推理结果：\n", tokenizer.decode(outputs[0], skip_special_tokens=False))


### 5.4. 训练参数设置

In [None]:
training_args = TrainingArguments(
    output_dir="./fine_tuned_qwen",
    per_device_train_batch_size=1,
    num_train_epochs=10,
    save_strategy="no",
    logging_dir="./logs",
    logging_steps=1,
    evaluation_strategy="no",
    save_total_limit=1,
    deepspeed="deepspeed_config.json",
    fp16=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=None,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

### 5.5. 执行训练并查看微调后推理结果

In [None]:
!bash run.sh