# 生成模型微调

LLM 训练步骤：

1. 语言建模（预训练）：一个或多个大规模数据集上预训练，生成基座（预训练）模型。目的训练模型根据输入预测下一个token。无数据标注
2. 监督微调：训练模型根据附带标签的输入（对话或指令）预测下一个token。生成指令/对话生成模型
3. 偏好调优：进一步提高模型质量，使其符合安全或人类偏好的预期行为

## 全量微调

* 目标：更新所有参数
* 数据：使用较小但已标注的数据
* 优点：可能显著提升性能
* 缺点：训练成本高、时间长、存储大

## 参数高效微调（parameter-efficient fine-tuning, PEFT）

### Adapter 适配器

transformer内引入适配器模块，微调适配器

每个适配器专注于不同的任务

AdapterHub

### 低秩适配（low-rank adaptation, LoRA）

创建基座模型的小型自己进行微调

原理：模型具有很低的内在维度

### 压缩模型

降低参数精度减少内存需求

`QLoRA`技术：高位数精度和低位数精度转换，同时不会与原始权重产生太大差异。创建额外的块量化相似权重

`A Visual Guide to Quantization`: <https://www.maartengrootendorst.com/blog/quantization/>



## QLoRA 实践

1. 模版化指令：准备遵循对话模板的指令数据
2. 模型量化：bitsandbytes `transformers.BitsAndBytesConfig`
3. `LoRA`微调超参数配置：peft
4. 训练参数配置：通过实验确定任务的最佳参数
5. 执行训练
6. 合并权重
7. 评估

### LoRA 超参数

* r：秩，4~64，秩越高压缩率降低，表示能力提升
* lora_alpha：控制添加到原始权重的变化量，平衡原始模型的知识与新任务的知识，通常 r 的两倍
* target_modules:控制目标层

In [57]:
from transformers import AutoTokenizer, pipeline
from datasets import load_dataset
import pprint


# pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", device_map="auto")
template_tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')

def format_prompt(example):
    chat = example['messages']
    # prompt = pipe.tokenizer.apply_chat_template(chat, tokenize=False)
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
    return {'text': prompt}

dataset = load_dataset('HuggingFaceH4/ultrachat_200k', split='test_sft').shuffle(seed=42).select(range(3_000))

dataset = dataset.map(format_prompt)

print(dataset['text'][0])
print(type(dataset['text'][0]))

<|user|>
Write a compelling mystery story set in a vineyard, where a seasoned detective investigates a murder with twists and turns that will keep the reader engaged until the very end. Add complex characters, multiple suspects, and red herrings to create suspense and challenge the detective's deductive reasoning. Use vivid descriptive language to paint a picture of the vineyard setting, its wine-making process, and the people who live and work there. Make sure to reveal clues and motives gradually, and create a satisfying resolution that ties up all loose ends.</s>
<|assistant|>
Detective Jameson had been called to the vineyard on the outskirts of town to investigate a murder. The sun was setting, casting long shadows over the grape vines, and the air was heavy with the sweet scent of fermented grapes. The body was lying in the middle of the vineyard, surrounded by broken grape vines and a scattering of grapes. The victim's throat had been slashed, and there were bruises around the ne

In [53]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = 'TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = '<PAD>'
tokenizer.padding_side = 'left'


In [59]:
inputs = tokenizer(dataset['text'][0])
print(inputs)

# 当前版本一直报错,先手动处理数据集
prepared_dataset = dataset.map(lambda x: tokenizer(x['text']))
prepared_dataset

{'input_ids': [1, 529, 29989, 1792, 29989, 29958, 13, 6113, 263, 752, 7807, 29236, 5828, 731, 297, 263, 325, 457, 19852, 29892, 988, 263, 4259, 287, 6459, 573, 7405, 1078, 263, 13406, 411, 3252, 2879, 322, 12169, 393, 674, 3013, 278, 9591, 17785, 2745, 278, 1407, 1095, 29889, 3462, 4280, 4890, 29892, 2999, 12326, 29879, 29892, 322, 2654, 902, 29878, 886, 304, 1653, 8872, 1947, 322, 18766, 278, 6459, 573, 29915, 29879, 28262, 5313, 573, 24481, 29889, 4803, 325, 3640, 29037, 573, 4086, 304, 10675, 263, 7623, 310, 278, 325, 457, 19852, 4444, 29892, 967, 19006, 29899, 28990, 1889, 29892, 322, 278, 2305, 1058, 5735, 322, 664, 727, 29889, 8561, 1854, 304, 10320, 284, 1067, 1041, 322, 3184, 3145, 22020, 29892, 322, 1653, 263, 24064, 10104, 393, 260, 583, 701, 599, 23819, 10614, 29889, 2, 29871, 13, 29966, 29989, 465, 22137, 29989, 29958, 13, 6362, 522, 573, 5011, 265, 750, 1063, 2000, 304, 278, 325, 457, 19852, 373, 278, 714, 808, 381, 1372, 310, 4726, 304, 23033, 263, 13406, 29889, 450, 6575

Dataset({
    features: ['prompt', 'prompt_id', 'messages', 'text', 'input_ids', 'attention_mask'],
    num_rows: 3000
})

In [54]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

peft_config = LoraConfig(
    lora_alpha=28,
    lora_dropout=.1,
    r=64,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [55]:
# from transformers import TrainingArguments
from trl import SFTConfig

training_args = SFTConfig(
    output_dir='models/lora_01',
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim='paged_adamw_32bit',
    learning_rate=2e-4,
    lr_scheduler_type='cosine',
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True,
    max_seq_length=512,
    dataset_text_field='text',
    remove_unused_columns=False,
)

# training_args.dataset_kwargs = {"skip_prepare_dataset": True}

In [48]:
dataset = dataset.remove_columns('prompt')

In [60]:
from trl import SFTTrainer


trainer = SFTTrainer(
    model=model,
    train_dataset=prepared_dataset,
    args=training_args,
    peft_config=peft_config,
)
trainer.train()
trainer.model.save_pretrained('models/TinyLlama-1.1B-qlora')

Truncating train dataset:   0%|          | 0/3000 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,1.6762
20,1.4772
30,1.4503
40,1.4858
50,1.4729
60,1.3883
70,1.4939
80,1.4445
90,1.4268
100,1.4011


In [61]:
from peft import AutoPeftModelForCausalLM
from transformers import pipeline


model = AutoPeftModelForCausalLM.from_pretrained(
    'models/TinyLlama-1.1B-qlora',
    device_map='auto',
    low_cpu_mem_usage=True,
)

merged_model = model.merge_and_unload()

pipe = pipeline('text-generation', model=merged_model, tokenizer=tokenizer)
prompt = '''<|user|>
Write a compelling story about an adventure in dnd world. </s>
<|assistant|>
'''
print(pipe(prompt)[0]['generated_text'])

Device set to use cuda:0


<|user|>
Write a compelling story about an adventure in dnd world. </s>
<|assistant|>
It was a warm summer day, and I was sitting on the shady porch of my friend's house, enjoying a few moments of relaxation. But then, I heard a strange sound, and I quickly turned around to see a mysterious creature running through the grass. It was too late to run, but I managed to catch it before it could get very far.

The creature was a large reptile, with a long tail and sharp claws. It was a vicious creature, and I was afraid for my life. But, as I watched it run away, it was suddenly caught in a vine, and I quickly grabbed it.

As I was trying to free it from the vine, I noticed that it was covered in a thick layer of dirt and grime. The creature was trembling and had a look of desperation on its face. I didn't know what to do, so I decided to take it home with me.

As I approached my house, I was surprised to find that the creature had grown into a large snake, with a long, slimy tail and teeth