## 有监督微调

上一章介绍了分布式训练基本概念、分布式策略，以及Deepspeed是如何加速模型的训练。本章将介绍有监督微调（Supervised Fine-Tuning, SFT），也就是在已经训练好的模型基础上，使用有标注的特定任务数据进行下一步的微调，从而使模型具备遵循指令的能力。  


### 提示学习和语境学习

#### 提示学习

提示学习不同于传统的监督学习，直接利用在大量原始文本上进行预训练的语言模型，通过定义提示函数，使模型可以执行小样本甚至零样本学习。  

提示学习流程非常简洁，可以描述为三个阶段：提示添加、答案搜索和答案映射:  

<img src="./images/提示学习.png" style="zoom:60%;" /> 

具体来说，原始输入`x`经过一个模板，被构造成一个提示，然后将其输入语言模型，语言模型即以概率的方式填充模版中待填充的内容，然后根据模型的输出即可得到最终的预测标签。

#### 语境学习

语境学习也称上下文学习，是指模型可以从上下文的几个例子中学习：向模型输入特定任务的一些具体例子以及要测试的样例，模型可以根据给定的实力续写出测试样例的答案。语境学习可以看作是提示学习的一个子类。语境学习的关键思想是从类比中学习，整个过程并不需要对模型进行参数更新，仅执行向前的推理。  

<img src="./images/语境学习.png" style="zoom:60%;" /> 

### 高效模型微调

由于大模型的参数量十分庞大，当其应用到下游任务时，微调全部参数需要相当高的算力，因此为了节省成本，研究人员提出了多种参高效的微调方法，旨在仅训练少量参数使模型适应到下游任务。接下来主要是介绍LoRA，LoRA方法可以在缩减训练参数量和GPU显存占用的同时，使训练后的模型具有与全量微调相当的性能。  

#### LoRA

研究人员认为参数更新量即使投影到较小的子空间中，也不会影响到学习的有效性。因此，提出固定预训练模型参数不变，在原本权重矩阵旁路添加低秩矩阵的乘积作为可训练参数，用以模拟参数的变化量，LoRA方法的计算流程图为:  

<img src="./images/LoRA结构图.png" style="zoom:60%;" /> 

具体来说，假设预训练权重为$W_0$(d * k)，可以通过可训练参数$\Delta W = BA $变为(d * r + r * d)，大大的减少了训练参数的子空间，公式为:  

$$
h = W_0 x + \Delta W x = W_0 x + B A x
$$

对于使用LoRA的模型来说，由于可以将原始权重与训练后权重合并，即$W = W_0 + B A$，因此在推理时不存在额外的开销。

peft库中含有LoRA在内的多种高效微调方法，且与transformer兼容示例为: 

In [None]:
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType

model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"

'''
在Lora的配置项中有两个参数: r和lora_alpha
r: Lora attention dimension, r即rank 是矩阵的秩(rank), 用于控制LoRA中低秩矩阵的维度
lora_alpha: 是一个放缩系数，用于控制微调过程中更新矩阵 ΔW 的大小, 也就是控制其对原始模型的影响
'''

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)

# get_peft_model函数包裹了基础模型得到一个PeftModel类的模型，如果使用lora微调方法则会得到LoraModel类的模型

#### LoRA的变体

1.AdaLoRA  

Lora算法给所有的低秩矩阵指定了唯一的秩，从而忽略了不同模块、不同层的参数对于微调特定任务的重要性差异。因此可以在微调过程中根据各权重矩阵对下游任务的重要性动态调整秩的大小，用以进一步减少可训练参数量的同时保持或提高性能。  

2.QLoRA  

QLoRA并没有对LoRA的逻辑作出修改，而是通过将预训练模型量化为4-bit以进一步节省计算开销。

#### LoRA demo

通过ChatGPT生成的使用了Lora和未使用Lora的模型训练过程作个对比。  
数据量比较小，但是使用lora的模型还是要比不使用的开了一倍的时间，参数越多越明显

In [2]:
# mac GPU加速
import torch
print(torch.backends.mps.is_available())

True


In [None]:
# install packages
pip install torch transformers datasets peft accelerate

In [None]:
# 未使用Lora
import time
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# 使用 distilbert 基础模型
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 加载 imdb 数据集 本地加载数据集一直出错，所以在Google Colab中试下
dataset = load_dataset("imdb")

# Split the dataset into train and eval sets
# Assuming you want to use 1000 samples for training and 500 for evaluation
train_dataset = dataset["train"].shuffle(seed=42).select([i for i in range(1000)])
eval_dataset = dataset["test"].shuffle(seed=42).select([i for i in range(500)]) 

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

encoded_train_dataset = train_dataset.map(preprocess_function, batched=True)
encoded_eval_dataset = eval_dataset.map(preprocess_function, batched=True) # Preprocess the eval dataset

# 设置训练参数
training_args = TrainingArguments(
    output_dir='./results_no_lora',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
)

# 创建trainer, providing both train_dataset and eval_dataset
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_eval_dataset, # Pass the evaluation dataset
    tokenizer=tokenizer,
)

# 记录训练开始时间
start_time = time.time()

# 开始训练
trainer.train()

# 记录训练结束时间
end_time = time.time()

# 打印训练时长
training_time = end_time - start_time
print(f"Training completed in: {training_time // 60:.0f} minutes and {training_time % 60:.0f} seconds")

model.save_pretrained("./distilbert_no_lora")

In [None]:
import time
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType

# 使用 distilbert 基础模型
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LoRA 配置
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # 序列分类任务
    inference_mode=False,
    r=8,  # 秩
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=['q_lin', 'k_lin']  # 指定目标模块
)

# 应用 LoRA 配置
model = get_peft_model(model, lora_config)

# 加载 imdb 数据集
dataset = load_dataset("imdb")

# Split the dataset into train and eval sets
train_dataset = dataset["train"].shuffle(seed=42).select([i for i in range(1000)])
eval_dataset = dataset["test"].shuffle(seed=42).select([i for i in range(500)]) 

# 预处理函数
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

encoded_train_dataset = train_dataset.map(preprocess_function, batched=True)
encoded_eval_dataset = eval_dataset.map(preprocess_function, batched=True)

# 设置训练参数
training_args = TrainingArguments(
    output_dir='./results_with_lora',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
)

# 创建 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_eval_dataset,
    tokenizer=tokenizer,
)

# 记录训练开始时间
start_time = time.time()

# 开始训练
trainer.train()

# 记录训练结束时间
end_time = time.time()

# 打印训练时长
training_time = end_time - start_time
print(f"Training completed in: {training_time // 60:.0f} minutes and {training_time % 60:.0f} seconds")

# 保存模型
model.save_pretrained("./distilbert_with_lora")

### 模型上下文窗口扩展

在大模型中更多长文本建模需求出现，这些任务需要模型更好地处理超出常规上下文窗口大小的文本内容。当涉及长时间对话或摘要长文档时，传统的上下文窗口大小可能无法捕捉到全局语境，从而导致信息丢失或模糊的建模结果。  

为了更好地满足长文本需求，主要有以下方法来扩展语言模型的长文本建模能力:  
- `增加上下文窗口的微调`: 采用直接的方式，即通过使用一个更长的上下文窗口来微调现有的预训练Transformer，以适应长文本建模需求
- `位置编码`: 改进的位置编码，能够实现一定程度上的长度外推。这意味着可以在短的上下文窗口上进行训练，在长的上下文窗口中进行推理
- `插值法`: 将超出上下文窗口的位置编码通过插值法压缩到预训练的上下文窗口中

### 指令数据构建

#### 手动构建指令

手动构建指令的方法比较直观，可以在网上收集大量的问答数据再人为加以筛选过滤，或者使用标注人员直接手动编写提示与相应的回答。以LIMA为例，其从多个来源采样收集指令数据，包括高质量网络问题社区及大量的标注人员手动编写的提示与回答。当然针对这些数据还要做进一步的处理。


#### 自动构建指令

手动构建指令数据代价高昂，需要大量的人力投入。因此，需要寻找更高效的替代方法。比如Self-Instruct，利用大模型的生成能力自动生成指令，其数据生成过程是一个迭代引导算法，包含四个步骤:  

<img src="./images/Self-Instruct.png" style="zoom:60%;" /> 

#### 开源指令数据集

一揽子开源指令数据集: 

<img src="./images/dataset.png" style="zoom:60%;" /> 

### Deepspeed-Chat SFT实践

一些方法的测试

#### SFT

在对模型sft之前先测试下模型，在google colab平台测试opt-1.3b的输出  

In [12]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./DeepSpeed-Chat/download/facebook/opt-350m/")
model = AutoModelForCausalLM.from_pretrained("./DeepSpeed-Chat/download/facebook/opt-350m/")

In [5]:
# get_accelerator

from deepspeed import get_accelerator

accelerator = get_accelerator()
device_name = accelerator.device_name()
print(device_name)

mps


##### 加载tokenizer

In [9]:
import os
import json
from transformers import AutoTokenizer

def get_tokenizer(model_name_or_path, fast_tokenizer=True):
    print(model_name_or_path)
    tokenizer = AutoTokenizer.from_pretrained(
        model_name_or_path, fast_tokenizer=fast_tokenizer)
    tokenizer.pad_token = tokenizer.eos_token
    # make sure tokenizer is right pad in our logic
    tokenizer.padding_side = 'right'
    return tokenizer

def load_hf_tokenizer(model_name_or_path,
                      fast_tokenizer=True,
                      add_special_tokens=None):
    if os.path.exists(model_name_or_path):
        # Locally tokenizer loading has some issue, so we need to force download
        model_json = os.path.join(model_name_or_path, "config.json")
        if os.path.exists(model_json):
            model_json_file = json.load(open(model_json))
            model_name = model_json_file.get("_name_or_path",
                                             model_name_or_path)
            # 这个位置需要传入本地目录，而不是传入model_name
            tokenizer = get_tokenizer(model_name_or_path,
                                      fast_tokenizer=fast_tokenizer)
    else:
        tokenizer = get_tokenizer(model_name_or_path,
                                  fast_tokenizer=fast_tokenizer)

    if add_special_tokens is not None:
        add_special_tokens = [add_special_tokens] if isinstance(add_special_tokens, str) \
            else add_special_tokens
        tokenizer.add_special_tokens(
            {'additional_special_tokens': add_special_tokens})

    return tokenizer

In [11]:
model_path = './DeepSpeed-Chat/download/facebook/opt-350m/'
tokenizer = load_hf_tokenizer(model_path)
print(tokenizer)

./DeepSpeed-Chat/download/facebook/opt-350m/
GPT2TokenizerFast(name_or_path='./DeepSpeed-Chat/download/facebook/opt-350m/', vocab_size=50265, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '</s>', 'eos_token': '</s>', 'unk_token': '</s>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}


##### lora的具体实现

In [10]:
# 加载模型
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./DeepSpeed-Chat/download/facebook/opt-350m/")
print(model)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear(in_features=1024, out_features=512, bias=False)
      (project_in): Linear(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTSdpaAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features

In [31]:
import math
import torch
from torch import nn
import torch.functional as F

class LinearLayer_LoRA(nn.Module):
    # an simple implementation of LoRA
    # for now only support Linear Layer
    def __init__(self,
                 weight,
                 lora_dim=0,
                 lora_scaling=1,
                 lora_droppout=0,
                 bias=None):
        super(LinearLayer_LoRA, self).__init__()
        self.weight = weight
        self.bias = bias

        if lora_dim <= 0:
            raise ValueError(
                "You are training to use LoRA, whose reduced dim should be larger than 1"
            )

        try:
            # for zero stage 3
            rows, columns = weight.ds_shape
        except:
            rows, columns = weight.shape
        self.lora_right_weight = nn.Parameter(torch.zeros(
            columns,
            lora_dim))  # apply transpose so in forward we do not need to
        self.lora_left_weight = nn.Parameter(torch.zeros(lora_dim, rows))
        self.lora_scaling = lora_scaling / lora_dim

        if lora_droppout > 0:
            self.lora_dropout = nn.Dropout(lora_droppout)
        else:
            self.lora_dropout = nn.Identity()

        self.reset_parameters()
        # disable the original weight gradient
        self.weight.requires_grad = False
        # fuse LoRA to the original weight
        self.fuse_lora = False

    def eval(self):
        self.lora_dropout.eval()

    #   self.fuse_lora_weight()

    def train(self, mode=True):
        self.lora_dropout.train(mode)
        # self.unfuse_lora_weight()

    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.lora_right_weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_left_weight)

    def fuse_lora_weight(self):
        if not self.fuse_lora:
            self.weight.data += self.lora_scaling * torch.matmul(
                self.lora_left_weight.t(), self.lora_right_weight.t())
        self.fuse_lora = True

    def unfuse_lora_weight(self):
        if self.fuse_lora:
            self.weight.data -= self.lora_scaling * torch.matmul(
                self.lora_left_weight.t(), self.lora_right_weight.t())
        self.fuse_lora = False

    def forward(self, input):
        if self.fuse_lora:
            return F.linear(input, self.weight, self.bias)
        else:
            return F.linear(
                input, self.weight,
                self.bias) + (self.lora_dropout(input) @ self.lora_right_weight
                              @ self.lora_left_weight) * self.lora_scaling

In [32]:
from torch import nn
from deepspeed.compression.helper import recursive_getattr, recursive_setattr

def convert_linear_layer_to_lora(model,
                                 part_module_name,
                                 lora_dim=0,
                                 lora_scaling=1,
                                 lora_droppout=0):
    replace_name = []
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear) and part_module_name in name:
            replace_name.append(name)
    print(replace_name)
    for name in replace_name:
        module = recursive_getattr(model, name)
        tmp = LinearLayer_LoRA(
            module.weight, lora_dim, lora_scaling, lora_droppout,
            module.bias).to(module.weight.device).to(module.weight.dtype)
        recursive_setattr(model, name, tmp)
    return model

In [33]:
model = convert_linear_layer_to_lora(model=model, part_module_name='decoder.layers.',lora_dim=128)
print(model)

['model.decoder.layers.0.self_attn.k_proj', 'model.decoder.layers.0.self_attn.v_proj', 'model.decoder.layers.0.self_attn.q_proj', 'model.decoder.layers.0.self_attn.out_proj', 'model.decoder.layers.0.fc1', 'model.decoder.layers.0.fc2', 'model.decoder.layers.1.self_attn.k_proj', 'model.decoder.layers.1.self_attn.v_proj', 'model.decoder.layers.1.self_attn.q_proj', 'model.decoder.layers.1.self_attn.out_proj', 'model.decoder.layers.1.fc1', 'model.decoder.layers.1.fc2', 'model.decoder.layers.2.self_attn.k_proj', 'model.decoder.layers.2.self_attn.v_proj', 'model.decoder.layers.2.self_attn.q_proj', 'model.decoder.layers.2.self_attn.out_proj', 'model.decoder.layers.2.fc1', 'model.decoder.layers.2.fc2', 'model.decoder.layers.3.self_attn.k_proj', 'model.decoder.layers.3.self_attn.v_proj', 'model.decoder.layers.3.self_attn.q_proj', 'model.decoder.layers.3.self_attn.out_proj', 'model.decoder.layers.3.fc1', 'model.decoder.layers.3.fc2', 'model.decoder.layers.4.self_attn.k_proj', 'model.decoder.layer

##### Dataset

In [13]:
# Creates the prompt dataset 

# 初始化变量
local_rank = -1
data_path = './DeepSpeed-Chat/download/Dahoas/rm-static'
data_split = '2,4,4'
output_path = './tmp/data_files/'
train_phase = '1' # 没啥意义
seed = 123 # 随便给个值
tokenizer = tokenizer
max_seq_length = 512
end_of_conversation_token="<|endoftext|>"

In [14]:
import hashlib
import torch

def create_prompt_dataset(local_rank,
                          data_path,
                          data_split,
                          output_path,
                          train_phase,
                          seed,
                          tokenizer,
                          max_seq_len,
                          end_of_conversation_token="<|endoftext|>",
                          sft_only_data_path=[],
                          reload=False):
    """
    Creates the prompt dataset
    """
    os.makedirs(output_path, exist_ok=True)
    fname = "_".join(data_path)
    sft_cache_key = "_".join(sft_only_data_path)
    tokenizer_name = tokenizer.init_kwargs["name_or_path"].replace("/", "_")
    fname = f"{fname}_split{data_split}_phase{train_phase}_seed{seed}_tokenizer{tokenizer_name}_seqlen{max_seq_len}_sft{sft_cache_key}"
    fname = "_".join(fname.split("/"))
    fname = hashlib.sha256(fname.encode()).hexdigest(
    )  # hash the file name to avoid too long file name
    train_fname = f"{output_path}/traindata_{fname}.pt"
    eval_fname = f"{output_path}/evaldata_{fname}.pt"
    
    print(train_fname)
    print(eval_fname)

In [15]:
create_prompt_dataset(local_rank, data_path, data_split, output_path, train_phase, seed, tokenizer, max_seq_len=max_seq_length)

./tmp/data_files//traindata_b8c90ef6e7ecd4d64f21e401eac09054cf9d084abc799cf0124886c853c3468f.pt
./tmp/data_files//evaldata_b8c90ef6e7ecd4d64f21e401eac09054cf9d084abc799cf0124886c853c3468f.pt


In [16]:
# 加载数据
from datasets import load_from_disk, load_dataset
raw_datasets = load_dataset(data_path)
print(raw_datasets)
print(raw_datasets['train'][:5])

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response', 'chosen', 'rejected'],
        num_rows: 76256
    })
    test: Dataset({
        features: ['prompt', 'response', 'chosen', 'rejected'],
        num_rows: 5103
    })
})
{'prompt': ['\n\nHuman: Can you describe the steps to clean fingerprints and smudges from a laptop screen\n\nAssistant: Yes, certainly. To clean your screen, you first need to use a microfiber cloth or soft, damp cloth to gently wipe down the surface of the screen. Next, you’ll want to grab a soft, lint-free, microfiber cleaning cloth and gently rub it back and forth across the screen to remove fingerprints and smudges.\n\nHuman: Can I spray isopropyl alcohol onto the cloth and clean it that way?\n\nAssistant:', '\n\nHuman: What are some foods that are good for diabetics?\n\nAssistant: To be honest, some of these are better than others, and they’re a little more like opinions than facts. For example, many of the diets say to limit vegetables w

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### 原始模型测试

对模型opt-1.3b模型进行原始输出测试，后面训练完后再测试输出对比下

In [2]:
# 测试1 需要使用通过集束搜索或者是贪婪搜索法生成
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./DeepSpeed-Chat/download/facebook/opt-350m/")
model = AutoModelForCausalLM.from_pretrained("./DeepSpeed-Chat/download/facebook/opt-350m/")
text = "What are we having for dinner"
outputs = model(torch.tensor([tokenizer.encode(text)]))
logits = outputs.logits
predicted_ids = torch.argmax(logits, dim=-1)

predicted_text = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
print(predicted_text)


's you looking for dinner tonight


In [4]:
# huggingface
from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
generator("What are we having for dinner?")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'generated_text': "What are we having for dinner?\nI'm having a steak and a salad.\nI"}]

In [2]:
import pandas as pd