# 8-QLoRA

大语言模型的量化低秩适应(QLoRA, Quantized Low-Rank Adaptation of Large Language Models) 是一项基于LoRA改进的大模型参数高效微调技术，其在继承LoRA参数高效优势的基础上，通过量化进一步降低显存占用，让消费级硬件也能完成超大参数量模型的微调。

由于大模型参数量极大（如65B参数模型），即使使用LoRA微调，其全精度基座模型仍会占用大量显存，QLoRA 的工作原理是将预训练基座模型进行4-bit量化压缩并冻结，同时在模型中插入可训练的低秩分解矩阵A与B，仅训练这些低秩矩阵，这使得使用QLoRA进行训练的显存效率大幅提升，硬件门槛显著降低，且能保持与LoRA相近的微调性能。

具体来说，QLoRA 冻结了经过4-bit量化后的预训练模型$W_q$的权重，并注入可训练的秩分解矩阵 A 与 B，在微调时，仅训练降维矩阵 A 和 升维矩阵 B，量化基座模型全程不参与更新，微调结束后，将 AB 与反量化后的原始基座权重 W 进行叠加，实现模型适配特定任务。

![images](./images/qlora.png)

其中，基座模型采用专为大模型权重正态分布设计的NF4（NormalFloat4）格式进行4-bit量化，同时通过双重量化进一步优化显存占用；低秩矩阵A用随机高斯分布进行初始化，用0矩阵初始化B，从而保证训练开始时旁路矩阵为0矩阵，不干扰量化基座模型的原始输出。

具体来看，假设模型经过量化预训练主干的输出为$W_q x$（计算时动态反量化为FP16精度以保证性能），在QLoRA微调阶段，我们可以用如下形式对输出进行表示。

$$h=W_q x + \Delta Wx = W_q x + BA x=(W_q + BA)x$$

其中, $B \in \mathbb{R}^{d \times r},A \in \mathbb{R}^{r\times k}$，r 为QLoRA低秩矩阵的维数，$r << min(d, k)$；$W_q$ 为4-bit量化后的预训练基座权重，其存储精度远低于LoRA所用的FP16/BF16精度，可大幅降低显存占用。


In [2]:
import torch
from torch import optim, nn
import os
import platform
import argparse
import random
import time
import math
import warnings
import torch.distributed as dist
from contextlib import nullcontext
from torch.utils.data import DataLoader, DistributedSampler
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from model.model import MiniMindLM
from model.LMConfig import LMConfig
from model.dataset import SFTDataset
import bitsandbytes as bnb
from bitsandbytes.nn import Linear4bit, Params4bit

warnings.filterwarnings('ignore')

### 可选参数设置

首先，查看训练的可选参数，这些参数在实际使用时通过解析命令行进行导入，我们用 class 进行包装.

In [3]:
class args:
    epochs: int = 5 # 训练轮数，延续 pretrain 基础上微调
    batch_size: int = 2 # pretrain 数据集仅两个样本，设置 batch 为 2
    learning_rate: float = 5e-4 # 学习率
    device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
    dtype: str = 'bfloat16' # 16 bit 浮点数：8 bit 指数 + 7 bit 尾数
    # use_wandb: bool = False # 是否使用 wandb 我们不使用
    wandb_project: str = 'MiniMind-Notebook'
    num_workers: int = 1 # 工作进程数
    # ddp：bool = False # 单机多卡
    accumulation_steps: int = 1 # 梯度累积步数
    grad_clip: float = 1.0 # 梯度剪裁
    warmup_iters: int = 0 # 学习率热启动
    log_interval: int = 1 # 每一步打印日志 仅用于观察
    local_rank: int = 1 # device 设备号
    dim: int = 512 # 词嵌入维度 模型超参数
    n_layers: int = 2 # MiniMind Block 数量 模型超参数
    max_seq_len: int = 512 # 序列长度阈值
    use_moe: bool = False # 是否启用混合专家
    data_path: str = './toydata/lora_data.jsonl' # 数据集路径
    save_dir: str = "./output"  # 模型保存目录
    save_weight: str = "minimind_qlora"  # checkpoint 文件前缀
    save_interval: int = 1  # 每多少步保存一次模型，0表示不保存 我们这里只展示训练过程（可选择的保存模型，建议先保存）

In [4]:
print(f'查看工作设备 {args.device}')

查看工作设备 cuda


## QLoRA Adapter

QLoRA 矩阵与 LoRA 矩阵结构相同，也是具有一个隐藏层的全连接网络，其挂接在主干网络边侧进行参数更新，只需调整初始化精度，适配 4-bit 计算即可，我们来看看 MiniMind 模型是如何在主干网络外部定义 LoRA 网络结构的.

In [5]:
class QLoRA(nn.Module):
    def __init__(self, in_features, out_features, rank):
        super().__init__()
        self.rank = rank
        # QLoRA矩阵用bfloat16精度存储以节省内存
        self.A = nn.Linear(in_features, rank, bias=False, dtype=torch.bfloat16)
        self.B = nn.Linear(rank, out_features, bias=False, dtype=torch.bfloat16)
        self.A.weight.data.normal_(mean=0.0, std=0.02)
        self.B.weight.data.zero_()

    def forward(self, x):
        # 确保输入适配4-bit计算精度
        x = x.to(torch.bfloat16)
        return self.B(self.A(x))

可以看到，QLoRA 的网络结构也非常简单直观，我们接下来定义一个方法，将 QLoRA 网络应用到 MiniMind 模型的特定线性层.

In [6]:
# 将 QLoRA 模块绑定到模型的全连接层上，注意此处还未进行训练，仅是结构上的绑定
def apply_qlora(model, rank=16):
    for name, module in model.named_modules():
        # 检测 Linear4bit，且只对方阵层绑定
        if isinstance(module, bnb.nn.Linear4bit):
            # 对于4bit层，用 out_features 和 in_features 判断
            if module.out_features == module.in_features:
                qlora = QLoRA(
                    module.in_features, 
                    module.out_features, 
                    rank=rank
                ).to(args.device)
                
                setattr(module, 'qlora', qlora)
                original_forward = module.forward

                def forward_with_qlora(x, layer1=original_forward, layer2=qlora):
                    x = x.to(args.device)
                    out1 = layer1(x).to(torch.bfloat16)
                    out2 = layer2(x).to(torch.bfloat16)
                    return (out1 + out2).to(torch.bfloat16)
                    
                module.forward = forward_with_qlora
                print(f'apply qlora on module: {name}')

同理LoRA，我们可以声明一个小模型，对于 QLoRA 的绑定进行测试.

In [7]:
class TestModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = bnb.nn.Linear4bit(64, 512)
        self.linear2 = bnb.nn.Linear4bit(512, 512)
        self.linear3 = bnb.nn.Linear4bit(512, 64)

    @property
    def device(self):
        return next(self.parameters()).device
    
    def forward(self, x):
        out = self.linear3(self.linear2(self.linear1))
        return out

按照 apply_qlora 的函数逻辑，QLoRA 模块会应用在主干网络中满足 input_feature == output_feature 的模块上.

In [8]:
test_model = TestModel()
apply_qlora(test_model)
print(test_model)

apply qlora on module: linear2
TestModel(
  (linear1): Linear4bit(in_features=64, out_features=512, bias=True)
  (linear2): Linear4bit(
    in_features=512, out_features=512, bias=True
    (qlora): QLoRA(
      (A): Linear(in_features=512, out_features=16, bias=False)
      (B): Linear(in_features=16, out_features=512, bias=False)
    )
  )
  (linear3): Linear4bit(in_features=512, out_features=64, bias=True)
)


In [9]:
del test_model

完成了 QLoRA 模块在主干网络特定模块的绑定后，我们便可以冻结主干网络参数进行微调了，不过，考虑到主干网络权重在训练过程中并不会做任何参数更新，我们可以只保存 QLoRA 模块的参数来节省内存，下面给出加载/保存 QLoRA 权重的方法.

In [10]:
def load_qlora(model, path):
    state_dict = torch.load(path, map_location=model.device)
    for name, module in model.named_modules():
        if hasattr(module, 'qlora'):
            qlora_state = {k.replace(f'{name}.qlora.', ''): v for k, v in state_dict.items() if f'{name}.qlora.' in k}
            module.qlora.load_state_dict(qlora_state)

def save_qlora(model, path):
    state_dict = {}
    for name, module in model.named_modules():
        if hasattr(module, 'qlora'):
            qlora_state = {f'{name}.qlora.{k}': v for k, v in module.qlora.state_dict().items()}
            state_dict.update(qlora_state)
    torch.save(state_dict, path)

## Fine-Tuning MiniMind with QLoRA

接下来，我们对分词器、MiniMindLM 和数据迭代器执行初始化.

In [11]:
def init_model(lm_config):
    tokenizer = AutoTokenizer.from_pretrained('./model/minimind_tokenizer')
    
    # 在 CPU 上加载模型（float32）
    model = MiniMindLM(lm_config)
    moe_path = '_moe' if lm_config.use_moe else ''
    ckp = f'./output/minimind_sft_{lm_config.dim}{moe_path}.pth'
    state_dict = torch.load(ckp, map_location='cpu')
    model.load_state_dict(state_dict, strict=False)
    
    # 将 nn.Linear 替换为 bnb.nn.Linear4bit
    replace_linear_with_4bit(model)
    model = model.to(args.device)  # 移到 GPU 上
    
    # 冻结所有参数
    for param in model.parameters():
        param.requires_grad = False

    return model, tokenizer

# 4-bit 替换函数，将模型中的 nn.Linear 替换为 bnb.nn.Linear4bit，并正确处理权重和 bias
def replace_linear_with_4bit(model):
    # 收集需要替换的模块
    replace_list = []
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            replace_list.append((name, module))
    
    for name, module in replace_list:
        # 解析父模块和属性名
        parts = name.split('.')
        parent = model
        for part in parts[:-1]:
            parent = getattr(parent, part)
        attr_name = parts[-1]
        
        # 创建 Linear4bit 层
        has_bias = module.bias is not None
        quant_linear = Linear4bit(
            module.in_features,
            module.out_features,
            bias=has_bias,  # 保持原有的 bias 配置
            quant_type='nf4',  # 4-bit 量化类型
            compute_dtype=torch.bfloat16  # 4-bit 计算使用 bfloat16 精度
        )
        
        # 用 Params4bit 正确包装权重，移到 CUDA 时自动量化
        quant_linear.weight = Params4bit(
            module.weight.data,  # 直接使用原始权重数据，移到 CUDA 时自动量化
            requires_grad=False,  # 权重不更新
            quant_type='nf4',  # 4-bit 量化类型
            compress_statistics=True  # 双重量化
        )
        
        # 如果原线性层有 bias，也将其复制到新的量化层中
        if has_bias:
            quant_linear.bias = nn.Parameter(module.bias.data, requires_grad=False)
        
        # 将新的量化层替换原有的线性层
        setattr(parent, attr_name, quant_linear)

In [12]:
lm_config = LMConfig(dim=args.dim, n_layers=args.n_layers, max_seq_len=args.max_seq_len, use_moe=args.use_moe)
model, tokenizer = init_model(lm_config)
apply_qlora(model)

train_ds = SFTDataset(args.data_path, tokenizer, max_length=lm_config.max_seq_len)
train_loader = DataLoader(
    train_ds,
    batch_size=args.batch_size,
    pin_memory=True,
    drop_last=False,
    shuffle=False,
    num_workers=args.num_workers,
)

apply qlora on module: layers.0.attention.wq
apply qlora on module: layers.0.attention.wo
apply qlora on module: layers.1.attention.wq
apply qlora on module: layers.1.attention.wo


可以看到，QLoRA 模块挂接在 Attention Block 的 Query 与 Output 线性层上，下面查看 QLoRA 微调下可学习参数的占比：

In [13]:
total_params = sum(p.numel() for p in model.parameters())
qlora_params_count = sum(p.numel() for name, p in model.named_parameters() if 'qlora' in name)
print(f"LLM 总参数量: {total_params}")
print(f"QLoRA 参数量: {qlora_params_count}")
print(f"QLoRA 参数占比: {qlora_params_count / total_params * 100:.2f}%")

LLM 总参数量: 7801344
QLoRA 参数量: 65536
QLoRA 参数占比: 0.84%


接下来，冻结 MiniMindLM 主干网络的参数，做好 QLoRA 微调准备.

In [14]:
qlora_params = []
for name, param in model.named_parameters():
    if 'qlora' in name:
        param.requires_grad = True
        qlora_params.append(param)
    else:
        param.requires_grad = False

### 启动训练

接下来，我们定义 MiniMind QLoRA 微调所使用的优化器，损失函数和学习率调度，并进行一轮简单的训练.

In [15]:
# 学习率调度
def get_lr(current_step, total_steps, lr):
    return lr / 10 + 0.5 * lr * (1 + math.cos(math.pi * current_step / total_steps))

# QLoRA优化器：PagedAdamW（显存优化）
optimizer = bnb.optim.PagedAdamW(
    qlora_params,
    lr=args.learning_rate,
    betas=(0.9, 0.999),  # AdamW 的默认 beta 参数
    weight_decay=0.01,  # 权重衰减
    optim_bits=8  # 8-bit 优化器，节省显存
)

# 混精度配置
device_type = "cuda" if "cuda" in args.device else "cpu"
if args.dtype == 'bfloat16':
    amp_dtype = torch.bfloat16
elif args.dtype == 'float16':
    amp_dtype = torch.float16
else:
    amp_dtype = torch.bfloat16
autocast_ctx = nullcontext() if device_type == "cpu" else torch.amp.autocast(device_type='cuda', dtype=amp_dtype)

接下来，我们来看训练函数.

In [16]:
def train_epoch(epoch, loader, iters, lora_params, start_step=0, wandb=None):
    start_time = time.time()
    for step, (X, Y, loss_mask) in enumerate(loader, start=start_step + 1):
        X = X.to(args.device)
        Y = Y.to(args.device)
        loss_mask = loss_mask.to(args.device)
        lr = get_lr(epoch * iters + step, args.epochs * iters, args.learning_rate)

        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        with autocast_ctx:
            res = model(input_ids=X)
            logits = res.logits

            # 对logits/labels/loss_mask做同步截断（去掉最后1位，避免预测未来token，与pretrain不同的地方）
            shift_logits = logits[..., :-1, :].contiguous()  # [batch, seq_len-1, vocab]
            shift_labels = Y[..., 1:].contiguous()      # [batch, seq_len-1]
            shift_loss_mask = loss_mask[..., 1:].contiguous()# [batch, seq_len-1] 

            loss_fct = nn.CrossEntropyLoss(reduction='none')
            raw_loss = loss_fct(
                shift_logits.reshape(-1, shift_logits.size(-1)),
                shift_labels.reshape(-1)
            )
            shift_loss_mask = shift_loss_mask.reshape(-1)
            masked_loss = (raw_loss * shift_loss_mask).sum() / (shift_loss_mask.sum() + 1e-8)
            total_loss = masked_loss + (res.aux_loss if res.aux_loss is not None else 0.0)
            # 梯度累积：损失除以累积步数  相当于在显存受限的情况下模拟更大的 batch size
            total_loss = total_loss / args.accumulation_steps

        total_loss.backward()

        if (step + 1) % args.accumulation_steps == 0:
            torch.nn.utils.clip_grad_norm_(lora_params, args.grad_clip)
            optimizer.step()
            optimizer.zero_grad(set_to_none=True)

        if step % args.log_interval == 0 or step == iters - 1:
            spend_time = time.time() - start_time
            current_loss = total_loss.item() * args.accumulation_steps
            current_aux_loss = res.aux_loss if res.aux_loss is not None else 0.0
            current_logits_loss = current_loss - current_aux_loss
            current_lr = optimizer.param_groups[-1]['lr']
            eta_min = spend_time / (step + 1) * iters // 60 - spend_time // 60
            print(
                f'Epoch:[{epoch + 1}/{args.epochs}]({step}/{iters}), ' 
                f'loss: {current_loss:.4f}, '
                f'logits_loss: {current_logits_loss:.4f}, '
                f'aux_loss: {current_aux_loss:.4f}, '
                f'lr: {current_lr:.8f}, '
                f'epoch_time: {eta_min:.1f}min'
            )
            if wandb: 
                wandb.log({
                    "loss": current_loss, 
                    "logits_loss": current_logits_loss, 
                    "aux_loss": current_aux_loss, 
                    "learning_rate": current_lr, 
                    "epoch_time": eta_min
                })

        if args.save_interval > 0 and (step % args.save_interval == 0 or step == iters - 1):
            if not dist.is_initialized() or dist.get_rank() == 0:
                os.makedirs(args.save_dir, exist_ok=True)
                model.eval()
                lora_save_path = f'{args.save_dir}/{args.save_weight}_{lm_config.dim}.pth'
                save_qlora(model, lora_save_path)
                print(f'模型已保存至：{lora_save_path}')
                model.train()

        del X, Y, loss_mask, res, total_loss


接下来，我们启动一个 Epoch 的训练进行观察.

In [17]:
iter_per_epoch = len(train_loader)
for epoch in range(args.epochs):
    train_epoch(epoch, train_loader, iter_per_epoch, qlora_params)
print('qlora训练完成！')

Epoch:[1/5](1/1), loss: 8.4772, logits_loss: 8.4772, aux_loss: 0.0000, lr: 0.00050225, epoch_time: 0.0min
模型已保存至：./output/minimind_qlora_512.pth
Epoch:[2/5](1/1), loss: 8.4673, logits_loss: 8.4673, aux_loss: 0.0000, lr: 0.00037725, epoch_time: 0.0min
模型已保存至：./output/minimind_qlora_512.pth
Epoch:[3/5](1/1), loss: 8.4525, logits_loss: 8.4525, aux_loss: 0.0000, lr: 0.00022275, epoch_time: 0.0min
模型已保存至：./output/minimind_qlora_512.pth
Epoch:[4/5](1/1), loss: 8.4407, logits_loss: 8.4407, aux_loss: 0.0000, lr: 0.00009775, epoch_time: 0.0min
模型已保存至：./output/minimind_qlora_512.pth
Epoch:[5/5](1/1), loss: 8.4357, logits_loss: 8.4357, aux_loss: 0.0000, lr: 0.00005000, epoch_time: 0.0min
模型已保存至：./output/minimind_qlora_512.pth
qlora训练完成！


In [18]:
del model

## 参考资料

- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314)