# 使用LoRA和QLoRA进行微调
![微调步骤](image-41.png)
![微调种类](image-42.png)

非参数微调是一种通过将特定于任务的数据作为上下文传递而不训练模型的任何参数来工作的类型。该模型使用提供的上下文信息来提高模型在新任务上的准确性和效率。这是以最少的精力和资源定制模型的经济有效的方式。

情境学习是一种通过提供特定任务示例作为输入来优化模型响应的方法。它的工作原理是在提示中提供任务描述和相关示例作为上下文，引导模型生成更符合用户偏好的响应。

![情境学习](image-43.png)

检索增强生成 (RAG) 是一种 AI 框架，它使用外部知识库和预训练模型来定制特定任务的模型。它的工作原理是将特定于任务的数据作为外部存储器提供给模型。该模型在每个查询上搜索知识库以查找相关信息，并根据用户偏好生成响应。在非参数微调技术中，RAG 是在不更改模型参数的情况下定制模型的最有效方法之一。

![RAG工作原理](image-44.png)


In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_name = "deepseek-ai/deepseek-llm-7b-chat"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map = "auto")

param_dtypes = [param.dtype for param in model.parameters()]
print("Parameter dtypes:", param_dtypes)

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

ChunkedEncodingError: ('Connection broken: IncompleteRead(11345308 bytes read, 9956849019 more expected)', IncompleteRead(11345308 bytes read, 9956849019 more expected))

In [None]:
print(model.get_memory_footprint())

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Portugal is", return_tensors="pt").to('cuda')

response = model.generate(**input, max_new_tokens = 50)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

In [None]:
import gc
del model
gc.collect()
torch.cuda.empty_cache()

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_8bit = True
)

In [None]:
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config=bnb_config,
                    device_map = "auto")

In [None]:
param_dtypes = [param.dtype for param in quantized_model.parameters()]
print("Parameter dtypes:", param_dtypes)

In [None]:
print(quantized_model.get_memory_footprint())

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Portugal is", return_tensors="pt").to('cuda')

response = quantized_model.generate(**input, max_new_tokens = 50)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

# LoRA（low-rank transformation）原理
LoRA 的工作原理是将预训练模型的权重矩阵分解为较小的、排名较低的矩阵，这些矩阵近似于较大的矩阵。这些新矩阵被注入到 transformer 的每一层中，并经过训练以适应数据集。预训练模型的原始权重被冻结，以避免灾难性的遗忘。然后，将更新的权重与原始权重组合在一起以生成结果。

![LoRA原理](image-45.png)

eg：考虑一个权重矩阵为W，维度为[m * n]的随机层数模型，如下图所示

![预训练的权重矩阵](image-46.png)

LoRA向transformer层添加新的权重，该层通过将原始权重矩阵W分解为两个低秩矩阵（例如具有维度 [m * k]的A和具有维度[k * n]的B），这近似W，这种技术就叫做low-rank transformation，如下图所示：

![低秩变换矩阵](image-47.png)

此时我们进行微调，得到新的A和B

![更新了A和B矩阵的权重](image-48.png)

微调后，更新的权重将与完整 LoRA 微调模型的原始权重合并。
$$
\triangle W = W + A_0 * B_0
$$

![微调模型的新权重](image-49.png)

以上是k为1的结果，那我们怎么来找到实现LoRA的正确的k值呢？取决于以下因素：
模型大小：如果模型大小很大，则较小的 rank 值也足以捕获信息。但是，如果模型大小较小，则需要高 rank 值来适应模型。
任务复杂度：如果任务很简单，那么一个小的 rank 值就足以让模型适应特定的数据，因为它已经拥有很强的一般理解。但是，如果任务很复杂，则需要更高的 rank 值来使模型适应任务复杂性。
计算资源：选择矩阵的秩取决于计算资源的可用性。rank 值越高，可训练的参数越多，从而产生更多的训练时间和计算资源。低 rank 值将具有较少的可训练参数。它将减少计算时间和资源。
精度级别：在选择秩时，所需的权重精度级别也很重要。与较低的 rank 值相比，较高的 rank 值将具有更精确的权重。

In [None]:
# 加载原始模型
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config = bnb_config,
                    device_map = "auto")
                    
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Natalia sold clips to 48 of her friends in April, and then she sold half as \
many clips in May. How many clips did Natalia sell altogether in April and May?", return_tensors="pt").to('cuda')

response = quantized_model.generate(**input, max_new_tokens = 100)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

In [None]:
# 加载训练的数据集
from datasets import load_dataset

dataset = "openai/gsm8k"
data = load_dataset(dataset, 'main')

tokenizer.pad_token = tokenizer.eos_token
data = data.map(lambda samples: tokenizer(samples["question"], samples["answer"], truncation=True, padding="max_length", max_length=100), batched=True)
train_sample = data["train"].select(range(400))

display(train_sample)

In [None]:
# LoRA配置
import peft
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

In [None]:
# 设置训练参数
from transformers import TrainingArguments
import os

working_dir = './'
output_directory = os.path.join(working_dir, "lora")

training_args = TrainingArguments(
    output_dir = output_directory,
    auto_find_batch_size = True,
    learning_rate = 3e-4,
    num_train_epochs=5
)

In [None]:
# 设置训练器
import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model = quantized_model,
    args = training_args,
    train_dataset = train_sample,
    peft_config = lora_config,
    tokenizer = tokenizer,
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

In [None]:
# 训练
trainer.train()

In [None]:
# 保存模型
model_path = os.path.join(output_directory, f"lora_model")
trainer.model.save_pretrained(model_path)

In [None]:
# 加载模型
model_path = "/trained_models/lora/lora_model"

from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)




loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        model_path,
                                        quantization_config = bnb_config,
                                        device_map = 'auto')
                                        
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", return_tensors="pt").to('cuda')

response = loaded_model.generate(**input, max_new_tokens = 100)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

# QLoRA
顾名思义，QLoRA是在LoRA的基础上加上量化的操作，如下图所示

![QLoRA概述](image-50.png)

## NF（NormalFloat）
NormalFloat （NF） 数据类型是一种理论上最优的数据类型，它使用分位数量化来确保每个量化 bin 具有从输入张量分配的相同数量的值。
QLoRA 使用一种称为 4 位 NormalFloat （NF4） 量化的特殊量化类型，它将模型的权重从 32 位浮点压缩为 4 位格式。模型权重往往服从正态分布（大多数值接近零），首先缩放以适应 [−1,1] 的范围，然后压缩为 4 位。

## 工作原理
![QLoRA工作原理](image-51.png)


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16
)

In [None]:
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantized_model = AutoModelForCausalLM.from_pretrained(model_name,
                    quantization_config = bnb_config,
                    device_map = "auto")
print(quantized_model.get_memory_footprint())

In [None]:
param_dtypes = [param.dtype for param in quantized_model.parameters()]
print("Parameter dtypes:", param_dtypes)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", return_tensors="pt").to('cuda')

response = quantized_model.generate(**input, max_new_tokens = 100)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

In [None]:
from datasets import load_dataset
import peft
from peft import LoraConfig
import transformers
from transformers import TrainingArguments
import os
from trl import SFTTrainer

# Preprocess the dataset

dataset = "openai/gsm8k"
data = load_dataset(dataset, 'main')

tokenizer.pad_token = tokenizer.eos_token
data = data.map(lambda samples: tokenizer(samples["question"], samples["answer"], truncation=True, padding="max_length", max_length=100), batched=True)
train_sample = data["train"].select(range(400))

# LoRA configurations

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Setting the training arguments

working_dir = './'
output_directory = os.path.join(working_dir, "qlora")

training_args = TrainingArguments(
    output_dir = output_directory,
    auto_find_batch_size = True,
    learning_rate = 3e-4,
    num_train_epochs=5
)

# Setting the trainer

trainer = SFTTrainer(
    model = quantized_model,
    args = training_args,
    train_dataset = train_sample,
    peft_config = lora_config,
    tokenizer = tokenizer,
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Train the model

trainer.train()

In [None]:
# Save the model.
model_path = os.path.join(output_directory, f"qlora_model")
trainer.model.save_pretrained(model_path)

In [None]:
#We are going to clean some variables just to avoid memory problems
import gc
import torch
del quantized_model
del trainer
del train_sample
del data
torch.cuda.empty_cache()
gc.collect()

In [None]:
model_path = "/trained_models/qlora/qlora_model"

from peft import AutoPeftModelForCausalLM
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

loaded_model = AutoPeftModelForCausalLM.from_pretrained(
                                        model_path,
                                        quantization_config = bnb_config,
                                        device_map = 'auto')
                                        
tokenizer = AutoTokenizer.from_pretrained(model_name)
input = tokenizer("Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", return_tensors="pt").to('cuda')

response = loaded_model.generate(**input, max_new_tokens = 100)
print(tokenizer.batch_decode(response, skip_special_tokens=True))