# MiniCPM-2B 参数高效微调（LoRA）A100 80G 单卡示例

显存更小的显卡可用 batch size 和 grad_accum 间时间换空间

本 notebook 是一个使用 `OCNLI` 数据集对 MiniCPM-2B 进行 LoRA 微调，使其具备专业的广告生成能力的代码示例。

## 最低硬件需求
- 显存：12GB
- 显卡架构：安培架构（推荐）
- 内存：16GB

## 1. 准备数据集

将数据转换为更通用的格式

In [3]:
# 转换为 ChatML 格式
import os
import shutil
import json

input_dir = "data/ocnli_public"
output_dir = "data/ocnli_public_chatml"
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
os.makedirs(output_dir, exist_ok=True)

for fn in ["train.json", "dev.json"]:
    data_out_list = []
    with open(os.path.join(input_dir, fn), "r") as f, open(os.path.join(output_dir, fn), "w") as fo:
        for line in f:
            if len(line.strip()) > 0:
                data = json.loads(line)
                data_out = {
                    "messages": [
                        {
                            "role": "user",
                            "content": f"请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种？\n句子1: {data['sentence1']}\n句子2：{data['sentence2']}\n"
                        },
                        {
                            "role": "assistant",
                            "content": data["label"],
                        },
                    ]
                }
                data_out_list.append(data_out)
        json.dump(data_out_list, fo, ensure_ascii=False, indent=4)


## 2. 使用 LoRA 进行微调

命令行一键运行

In [22]:
!bash lora_finetune_ocnli.sh

20240315212836


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2024-03-15 21:28:38,758] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-15 21:28:45,799] [INFO] [runner.py:568:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=19888 --enable_each_rank_log=None finetune.py --model_name_or_path MiniCPM-2B-sft-bf16 --output_dir output/ocnli_public_chatml/20240315212836/ --train_data_path data/ocnli_public_chatml/train.json --eval_data_path data/ocnli_public_chatml/dev.json --learning_rate 5e-5 --per_device_train_batch_size 64 --per_device_eval_batch_size 128 --model_max_length 128 --bf16 --use_lora --gradient_accumulation_steps 1 --warmup_steps 100 --max_steps 1000 --weight_decay 0.01 --evaluation_strategy steps --eval_steps 500 --save_strategy steps --save_steps 500 --seed 42 --log_level info --logging_strategy steps --logging_steps 10 --deepspeed configs/ds_config_zero3_offload.json
[2024-03-15 21:28:47,849] [

## 3. 推理验证

In [7]:
import json
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

In [27]:
path = "output/ocnli_public_chatml/20240316002856/checkpoint-1500"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(
    path, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True
)

In [28]:
res, history = model.chat(tokenizer, query="<用户>请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种？\n句子1: 身上裹一件工厂发的棉大衣,手插在袖筒里\n句子2：身上至少一件衣服\n<AI>")
res, history

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


('entailment',
 [{'role': 'user',
   'content': '<用户>请判断下边两个句子的关系属于 [entailment, neutral, contradiction]中的哪一种？\n句子1: 身上裹一件工厂发的棉大衣,手插在袖筒里\n句子2：身上至少一件衣服\n<AI>'},
  {'role': 'assistant', 'content': 'entailment'}])

In [29]:
with open("data/ocnli_public_chatml/dev.json", 'r') as f:
    dev_sample_list = json.load(f)


In [30]:
pos = 0
neg = 0
for sample in tqdm(dev_sample_list[:500]):
    res, history = model.chat(tokenizer, query="<用户>{}<AI>".format(sample["messages"][0]["content"]), max_length=128, top_p=0.5, temperature=0.8)
    if sample["messages"][1]["content"] in res.strip().lower():
        pos += 1
    else:
        neg += 1

  0%|          | 0/500 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 1/500 [00:00<00:54,  9.12it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  0%|          | 2/500 [00:00<00:54,  9.09it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 3/500 [00:00<00:55,  8.98it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 5/500 [00:00<00:49,  9.99it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  1%|          | 6/500 [00:00<00:51,  9.67it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
  2%|▏         | 8/500 [00:00<00:44, 11.07it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

In [31]:
pos / (pos+neg), pos, neg

(0.81, 405, 95)