# Qwen2.5 Coder 32B 指令微调教程

## 环境准备

本教程可在 Google Colab Tesla L4 （24G显存）实例上运行。运行前需要:

1. 点击"运行时"
2. 点击"全部运行"

## 教程内容

本教程将介绍以下内容:

1. [数据准备](#Data) - 如何准备和处理训练数据
2. [模型训练](#Train) - 如何训练和优化模型
3. [模型推理](#Inference) - 如何使用训练好的模型进行推理
4. [模型保存](#Save) - 如何保存训练结果

## 安装依赖

In [1]:
!pip install unsloth

Collecting unsloth
  Downloading unsloth-2025.1.5-py3-none-any.whl.metadata (60 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/60.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.1.2 (from unsloth)
  Downloading unsloth_zoo-2025.1.3-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.9-py3-none-any.whl.metadata (9.4 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,>=0.7.9 (from unsloth)
  Downloading trl-0.13.0-p

## 模型初始化

In [2]:
from unsloth import FastLanguageModel
import torch

# 基础配置参数
max_seq_length = 2048 # 最大序列长度
dtype = None # 自动检测数据类型
load_in_4bit = True # 使用4位量化以减少内存使用

# 4位量化模型列表
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Mistral-Small-Instruct-2409",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-27b-bnb-4bit",
    "unsloth/Llama-3.2-1B-bnb-4bit",
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
]

# Qwen系列模型列表
qwen_models = [
    "unsloth/Qwen2.5-Coder-32B-Instruct",
    "unsloth/Qwen2.5-Coder-7B",
    "unsloth/Qwen2.5-14B-Instruct",
    "unsloth/Qwen2.5-7B",
    "unsloth/Qwen2.5-72B-Instruct",
]

# 加载预训练模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-Coder-32B-Instruct", # 选择Qwen2.5 32B指令模型
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.5: Fast Qwen2 patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/280k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/4.32G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.51k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

## LoRA适配器配置

添加LoRA适配器以实现参数高效微调,仅需更新1%-10%的参数

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA秩,控制可训练参数数量
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # 需要训练的目标模块
    lora_alpha = 16, # LoRA缩放因子
    lora_dropout = 0, # LoRA dropout率
    bias = "none", # 是否训练偏置项
    use_gradient_checkpointing = "unsloth", # 使用梯度检查点节省显存
    random_state = 3407, # 随机数种子
    use_rslora = False, # 是否使用稳定版LoRA
    loftq_config = None, # LoftQ配置
)

Unsloth 2025.1.5 patched 64 layers with 64 QKV layers, 64 O layers and 64 MLP layers.


<a name="Data"></a>
## 数据准备

### 对话格式说明

使用Qwen-2.5格式进行对话风格微调。数据集采用[Maxime Labonne的FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k)。

Qwen-2.5的对话格式示例:
```
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2+2?<|im_end|>
<|im_start|>assistant
It's 4.<|im_end|>
```

In [4]:
from unsloth.chat_templates import get_chat_template

# 配置分词器使用qwen-2.5对话模板
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)

def formatting_prompts_func(examples):
    """格式化对话数据的函数
    Args:
        examples: 包含对话列表的字典
    Returns:
        包含格式化文本的字典
    """
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

# 加载数据集
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

### 数据格式转换

将ShareGPT格式转换为HuggingFace通用格式:

转换前:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```

转换后:
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [5]:
from unsloth.chat_templates import standardize_sharegpt
# 标准化数据集格式
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Standardizing format:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

### 查看数据示例

In [6]:
# 查看第5条对话的结构
dataset[5]["conversations"]

[{'content': 'How do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?',
  'role': 'user'},
 {'content': 'Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.',
  'role': 'assistant'}]

In [7]:
# 查看应用聊天模板后的格式
dataset[5]["text"]

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|im_end|>\n<|im_start|>assistant\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>\n'

<a name="Train"></a>
## 模型训练

使用HuggingFace TRL的SFTTrainer进行训练。详细文档参见[TRL SFT文档](https://huggingface.co/docs/trl/sft_trainer)。

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

# 配置训练器
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=4,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=1, # 每个设备的批次大小
        gradient_accumulation_steps=4, # 梯度累积步数
        warmup_steps=5, # 预热步数
        max_steps=100, # 最大训练步数
        learning_rate=2e-4, # 学习率
        fp16=not is_bfloat16_supported(), # 是否使用fp16
        bf16=is_bfloat16_supported(), # 是否使用bf16
        logging_steps=1, # 日志记录间隔
        optim="paged_adamw_8bit", # 优化器
        weight_decay=0.01, # 权重衰减
        lr_scheduler_type="linear", # 学习率调度器
        seed=3407, # 随机种子
        output_dir="outputs", # 输出目录
        report_to="none", # 不使用外部日志工具
    ),
)

Map (num_proc=4):   0%|          | 0/100000 [00:00<?, ? examples/s]

### 配置仅对助手回复进行训练

In [9]:
from unsloth.chat_templates import train_on_responses_only
# 设置仅对助手回复部分计算损失
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|im_start|>user\n",
    response_part = "<|im_start|>assistant\n",
)

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

### 验证训练掩码

In [10]:
# 查看输入文本
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHow do astronomers determine the original wavelength of light emitted by a celestial body at rest, which is necessary for measuring its speed using the Doppler effect?<|im_end|>\n<|im_start|>assistant\nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>\n'

In [11]:
# 查看标签掩码
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                          \nAstronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|im_end|>\n'

### 显示内存使用情况

In [12]:
# 获取GPU信息
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.168 GB.
18.957 GB of memory reserved.


### 开始训练

In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 100
 "-____-"     Number of trainable parameters = 134,217,728


Step,Training Loss
1,1.0183
2,0.7375
3,0.7703
4,0.8063
5,1.3266
6,1.0445
7,0.8609
8,0.7464
9,0.7707
10,0.4988


In [14]:
# 显示训练统计信息
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1713.7428 seconds used for training.
28.56 minutes used for training.
Peak reserved memory = 20.176 GB.
Peak reserved memory for training = 1.219 GB.
Peak reserved memory % of max memory = 91.014 %.
Peak reserved memory for training % of max memory = 5.499 %.


<a name="Inference"></a>
## 模型推理

使用训练好的模型进行推理测试。这里使用temperature=1.5和min_p=0.1的采样参数。

In [15]:
from unsloth.chat_templates import get_chat_template

# 配置推理用的分词器
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "qwen-2.5",
)
FastLanguageModel.for_inference(model)

# 准备测试输入
messages = [
    {"role": "user", "content": """Here is a programming problem for testing:

    **Matrix Chain Multiplication Optimization**

    ### Problem:
    Given a chain of matrices `A1, A2, ..., An`, where the dimensions of Ai are `P[i-1] x P[i]`,
    find the optimal parenthesization order that minimizes the total scalar multiplication cost.

    **Input:**
    1. An array `P` representing dimensions, e.g., P = [10, 20, 30, 40].

    **Output:**
    1. The optimal parenthesization order (e.g., `(A1 x (A2 x A3))`).
    2. The minimum scalar multiplication cost.
    3. A comparison to the naive left-to-right multiplication cost.

    ### Constraints:
    - Use dynamic programming to solve this problem efficiently.
    - Provide a solution for P of length up to 10^5 (optional for advanced testing).

    ### Example:
    Input: P = [10, 20, 30]
    Output:
    - Optimal order: `(A1 x A2)`
    - Minimum cost: 6000
    - Naive cost: 6000

    Input: P = [10, 20, 30, 40]
    Output:
    - Optimal order: `((A1 x A2) x A3)`
    - Minimum cost: 18000
    - Naive cost: 24000

    Implement the solution and evaluate it against these criteria."""}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHere is a programming problem for testing:\n\n    **Matrix Chain Multiplication Optimization**\n\n    ### Problem:\n    Given a chain of matrices `A1, A2, ..., An`, where the dimensions of Ai are `P[i-1] x P[i]`,\n    find the optimal parenthesization order that minimizes the total scalar multiplication cost.\n\n    **Input:**\n    1. An array `P` representing dimensions, e.g., P = [10, 20, 30, 40].\n\n    **Output:**\n    1. The optimal parenthesization order (e.g., `(A1 x (A2 x A3))`).\n    2. The minimum scalar multiplication cost.\n    3. A comparison to the naive left-to-right multiplication cost.\n\n    ### Constraints:\n    - Use dynamic programming to solve this problem efficiently.\n    - Provide a solution for P of length up to 10^5 (optional for advanced testing).\n\n    ### Example:\n    Input: P = [10, 20, 30]\n    Output:\n    - Optimal order: `(A1 x A2)

### 流式推理

使用TextStreamer实现逐个token的生成显示

In [16]:
FastLanguageModel.for_inference(model)

# 准备相同的测试输入
messages = [
    {"role": "user", "content": """Here is a programming problem for testing:

    **Matrix Chain Multiplication Optimization**

    ### Problem:
    Given a chain of matrices `A1, A2, ..., An`, where the dimensions of Ai are `P[i-1] x P[i]`,
    find the optimal parenthesization order that minimizes the total scalar multiplication cost.

    **Input:**
    1. An array `P` representing dimensions, e.g., P = [10, 20, 30, 40].

    **Output:**
    1. The optimal parenthesization order (e.g., `(A1 x (A2 x A3))`).
    2. The minimum scalar multiplication cost.
    3. A comparison to the naive left-to-right multiplication cost.

    ### Constraints:
    - Use dynamic programming to solve this problem efficiently.
    - Provide a solution for P of length up to 10^5 (optional for advanced testing).

    ### Example:
    Input: P = [10, 20, 30]
    Output:
    - Optimal order: `(A1 x A2)`
    - Minimum cost: 6000
    - Naive cost: 6000

    Input: P = [10, 20, 30, 40]
    Output:
    - Optimal order: `((A1 x A2) x A3)`
    - Minimum cost: 18000
    - Naive cost: 24000

    Implement the solution and evaluate it against these criteria."""}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

# 使用TextStreamer进行流式生成
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Here is a possible solution for the problem:

```python
# Function to compute the matrix chain multiplication problem
def matrix_chain_multiplication(P):
    # n represents the number of matrices
    n = len(P) - 1

    # Initialize two arrays: m for the cost and s for the splits
    m = [[0] * n for _ in range(n)]
    s = [[0] * n for _ in range(n)]

    # Loop through lengths i from 2 to n
    for i in range(2, n+1):  
        # Loop over the first i-1 matrices



<a name="Save"></a>
## 模型保存

### 保存LoRA适配器

可以选择本地保存或上传到HuggingFace Hub。注意这只保存LoRA权重,不包含完整模型。

In [17]:
# 本地保存模型和分词器
model.save_pretrained("lora_model") # 保存模型权重
tokenizer.save_pretrained("lora_model") # 保存分词器
# 在线保存到HuggingFace Hub
# model.push_to_hub("your_name/lora_model", token = "...") # 上传模型到Hub
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # 上传分词器到Hub

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

## 模型加载与使用

### 1. 加载LoRA模型

以下代码展示了如何加载已保存的LoRA模型。将`if False`改为`if True`即可执行加载。

### 2. 模型格式转换

支持以下几种格式转换:
1. float16格式 - 适用于通用推理场景
2. int4格式 - 用于低资源设备部署
3. GGUF格式 - 用于llama.cpp等推理框架

### 3. 量化选项
- q8_0: 8位量化,速度快精度损失小
- q4_k_m: 4位量化,模型大小更小
- q5_k_m: 5位量化,平衡大小和精度

In [18]:
if False:
    from unsloth import FastLanguageModel
    # 加载预训练模型
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # 模型路径
        max_seq_length = max_seq_length, # 最大序列长度
        dtype = dtype, # 数据类型
        load_in_4bit = load_in_4bit, # 是否4bit加载
    )
    FastLanguageModel.for_inference(model) # 开启推理模式

    # 测试用例
    messages = [
        {"role": "user", "content": """Here is a programming problem for testing:

        **Matrix Chain Multiplication Optimization**

        ### Problem:
        Given a chain of matrices `A1, A2, ..., An`, where the dimensions of Ai are `P[i-1] x P[i]`,
        find the optimal parenthesization order that minimizes the total scalar multiplication cost.

        **Input:**
        1. An array `P` representing dimensions, e.g., P = [10, 20, 30, 40].

        **Output:**
        1. The optimal parenthesization order (e.g., `(A1 x (A2 x A3))`).
        2. The minimum scalar multiplication cost.
        3. A comparison to the naive left-to-right multiplication cost.

        ### Constraints:
        - Use dynamic programming to solve this problem efficiently.
        - Provide a solution for P of length up to 10^5 (optional for advanced testing).

        ### Example:
        Input: P = [10, 20, 30]
        Output:
        - Optimal order: `(A1 x A2)`
        - Minimum cost: 6000
        - Naive cost: 6000

        Input: P = [10, 20, 30, 40]
        Output:
        - Optimal order: `((A1 x A2) x A3)`
        - Minimum cost: 18000
        - Naive cost: 24000

        Implement the solution and evaluate it against these criteria."""}
    ]
    # 准备输入数据
    inputs = tokenizer.apply_chat_template(
        messages, # 消息列表
        tokenize = True, # 是否分词
        add_generation_prompt = True, # 添加生成提示
        return_tensors = "pt", # 返回PyTorch张量
    ).to("cuda") # 移至GPU

    # 文本生成
    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer, skip_prompt = True) # 创建文本流
    _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                       use_cache = True, temperature = 1.5, min_p = 0.1) # 生成文本

In [19]:
if False:
    # 使用标准Hugging Face接口加载
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer

    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # 模型路径
        load_in_4bit=load_in_4bit, # 4bit加载
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model") # 加载分词器

In [20]:
# 保存为16bit格式
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",) # 本地保存
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "") # 上传到Hub

# 保存为4bit格式
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",) # 本地保存
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "") # 上传到Hub

# 保存LoRA权重
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",) # 本地保存
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "") # 上传到Hub

In [21]:
# 保存为GGUF格式(8位量化)
if False: model.save_pretrained_gguf("model", tokenizer,) # 本地保存
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "") # 上传到Hub

# 保存为GGUF格式(16位)
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16") # 本地保存
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "") # 上传到Hub

# 保存为GGUF格式(4位量化)
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") # 本地保存
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "") # 上传到Hub

# 批量保存多种量化格式
if False:
    model.push_to_hub_gguf(
        "hf/model", # Hub仓库名
        tokenizer, # 分词器
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",], # 量化方法列表
        token = "", # Hub访问令牌
    )