## LLaMA 2 指令微调（Alpaca-Style on Dolly-15K Dataset)

示例代码关键训练要素：
- 使用 Dolly-15K 数据集，以 Alpaca 指令风格生成训练数据
- 以 4-bit（NF4）量化精度加载 `LLaMA 2-7B` 模型
- 使用 QLoRA 以 `bf16` 混合精度训练模型
- 使用 `HuggingFace TRL` 的 `SFTTrainer` 实现监督指令微调
- 使用 Flash Attention 快速注意力机制加速训练（需硬件支持）

### 下载 databricks-dolly-15k 数据集

In [1]:
from datasets import load_dataset, interleave_datasets
from random import randrange
from torch.utils.data import ConcatDataset 
# 从hub加载数据集
# dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# dataset = load_dataset("/home/rr-ai/huggingface_datasets/databricks-dolly-15k", split="train")

data_fineval_dir = "/home/rr-ai/python-project/traindata/fingpt-fineval/"
data_finfq_dir = "/home/rr-ai/python-project/traindata/fingpt-fiqa_qa/"

dataset_fineval = load_dataset(data_fineval_dir, split="train")
dataset_finfq = load_dataset(data_finfq_dir, split="train")

# dataset = interleave_datasets([dataset_fineval, dataset_finfq])
dataset = ConcatDataset([dataset_fineval, dataset_finfq])

  from .autonotebook import tqdm as notebook_tqdm


In [29]:
from datasets import load_dataset, concatenate_datasets
from random import randrange

# 从hub加载数据集
# dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# dataset = load_dataset("/home/rr-ai/huggingface_datasets/databricks-dolly-15k", split="train")

data_fineval_dir = "/home/rr-ai/python-project/traindata/fingpt-fineval/"
data_finfq_dir = "/home/rr-ai/python-project/traindata/fingpt-fiqa_qa/"

dataset_fineval = load_dataset(data_fineval_dir, split="train")
dataset_finfq = load_dataset(data_finfq_dir, split="train")

dataset = concatenate_datasets([dataset_fineval, dataset_finfq])

In [30]:
# 数据集样例总数: 15011
dataset

Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 18166
})

In [31]:
dataset_finfq

Dataset({
    features: ['input', 'output', 'instruction'],
    num_rows: 17110
})

In [4]:
dataset_fineval[1054]

{'input': '1848年，芝加哥的82位商人发起组建了____。\nA. 芝加哥期权交易所(CBOE)\nB. 芝加哥期货交易所(CBOT)\nC. 芝加哥股票交易所(CHX)\nD. 芝加哥商业交易所(CME)\n',
 'output': 'B. 芝加哥期货交易所(CBOT)',
 'instruction': '以下是中国关于期货从业资格证考试的单项选择题，请选出其中的正确答案。'}

In [32]:
dataset[18100]

{'input': 'Do I need to own all the funds my target-date funds owns to mimic it?',
 'output': 'If you read Joel Greenblatt\'s The Little Book That Beats the Market, he says: Owning two stocks eliminates 46% of the non market risk of owning just one stock.    This risk is reduced by 72% with 4 stocks, by 81% with 8 stocks, by 93% with 16 stocks, by 96% with 32 stocks, and by 99% with 500 stocks.  Conclusion: After purchasing 6-8 stocks, benefits of adding stocks to decrease risk are small.   Overall market risk won\'t be eliminated merely by adding more stocks.  And that\'s just specific stocks. So you\'re very right that allocating a 1% share to a specific type of fund is not going to offset your other funds by much. You are correct that you can emulate the lifecycle fund by simply buying all the underlying funds, but there are two caveats: Generally, these funds are supposed to be cheaper than buying the separate funds individually. Check over your math and make sure everything is in 

In [5]:
# 随机抽选一个数据样例打印
print(dataset[randrange(len(dataset))])

{'input': 'What time period is used by yahoo finance to calculate beta', 'output': 'Citing the Yahoo Finance Help page, Beta: The Beta used is Beta of Equity. Beta is the monthly price   change of a particular company relative to the monthly price change of   the S&P500. The time period for Beta is 3 years (36 months) when   available. Regarding customised time periods, I do not think so.', 'instruction': 'Offer your thoughts or opinion on the input financial query or topic using your financial background.'}


### 以 Alpaca-Style 格式化指令数据

`Alpacca-style` 格式：https://github.com/tatsu-lab/stanford_alpaca#data-release

In [30]:
# def format_instruction(sample_data):
#     """
#     Formats the given data into a structured instruction format.

#     Parameters:
#     sample_data (dict): A dictionary containing 'response' and 'instruction' keys.

#     Returns:
#     str: A formatted string containing the instruction, input, and response.
#     """
#     # Check if required keys exist in the sample_data
#     if 'response' not in sample_data or 'instruction' not in sample_data:
#         # Handle the error or return a default message
#         return "Error: 'response' or 'instruction' key missing in the input data."

#     return f"""### Instruction:
# Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 
 
# ### Input:
# {sample_data['response']}
 
# ### Response:
# {sample_data['instruction']}
# """

In [33]:
def format_instruction(sample_data):
    """
    Formats the given data into a structured instruction format.

    Parameters:
    sample_data (dict): A dictionary containing 'response' and 'instruction' keys.

    Returns:
    str: A formatted string containing the instruction, input, and response.
    """
    # Check if required keys exist in the sample_data
    if 'input' not in sample_data or 'output' not in sample_data or 'instruction' not in sample_data:
        # Handle the error or return a default message
        return "Error: 'response' or 'instruction' key missing in the input data."

    return f"""### Instruction:
{sample_data['instruction']} 
 
### Input:
{sample_data['input']}
 
### Response:
{sample_data['output']}
"""

In [34]:
# 随机抽选一个样例，打印 Alpaca 格式化后的样例 
print(format_instruction(dataset[randrange(len(dataset))]))
# print(format_instruction(dataset[0]))

### Instruction:
Based on your financial expertise, provide your response or viewpoint on the given financial question or topic. The response format is open. 
 
### Input:
What's the purpose of having separate checking and savings accounts?
 
### Response:
A checking account is instant access. It can be tapped via check or debit card.  A savings account is supposed to be used to accumulate cash for a goal that is is longer term or for an emergency.  Many people need to separate these funds into different accounts to be able to know if they are overspending or falling short on their savings. In the United States the Federal Reserve also looks at these accounts differently. Money in a checking account generally can't be used to fund loans, money in a savings account can be used as a source of loans by the bank. An even greater percentage of funds in longer term accounts can be used to fund loans. This includes Certificates of Deposit, and retirement accounts.



### 使用快速注意力（Flash Attention）加速训练

检查你的 GPU 是否支持 `flash-attn` 加速：

```shell
$ python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Hardware not supported for Flash Attention
```
**运行结果：演示使用的 NVIDIA T4 硬件不支持 Flash Attention**

#### 安装 flash-attn 加速包（需要GPU硬件支持）

```shell
$ MAX_JOBS=4 pip install flash-attn --no-build-isolation
```

### 加载模型

In [35]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 如果硬件设备支持，成功安装 flash-attn后，将 use_flash_attention 设置为True
use_flash_attention = False
 
# 取消注释以使用 flash-atten
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print("Using flash attention")
#     replace_attn_with_flash_attn()
#     use_flash_attention = True
 
 
# 获取 LLaMA 2-7B 模型权重
# 无需 Meta AI 审核的模型权重
# model_id = "NousResearch/Llama-2-7b-hf" 
# 通过 Meta AI 审核后可使用此 Model ID 下载
# model_id = "meta-llama/Llama-2-7b-hf" 
# model_id = '/home/rr-ai/huggingface_models/Llama-2-7b-hf/' 
model_id = '/home/rr-ai/huggingface_models/Llama2-Chinese-13b-Chat/' 

 
# 使用 BnB 加载量化后的模型
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
 
# 加载模型与分词器
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=True, device_map="auto")
model.config.pretraining_tp = 1 
 
# 通过对比doc中的字符串，验证模型是否在使用flash attention
if use_flash_attention:
    from utils.llama_patch import forward    
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"
 
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.32s/it]


### 使用 QLoRA 配置加载 PEFT 模型

In [36]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
 
# QLoRA 配置
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=16,
        bias="none",
        task_type="CAUSAL_LM", 
)
 
 
# 使用 QLoRA 配置加载 PEFT 模型
model = prepare_model_for_kbit_training(model)
qlora_model = get_peft_model(model, peft_config)

In [37]:
qlora_model.print_trainable_parameters()

trainable params: 13,107,200 || all params: 13,028,971,520 || trainable%: 0.10060041945659269


### 训练超参数

In [38]:
import datetime

# 演示训练参数（实际训练是设置为 False）
demo_train = False
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

output_dir = f"models/Llama2-Chinese-13b-Chat-int4-finfq-{timestamp}"

In [39]:
from transformers import TrainingArguments
 
args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1 if demo_train else 3,
    max_steps=450,
    per_device_train_batch_size=2, # Nvidia T4 16GB 显存支持的最大 Batch Size
    gradient_accumulation_steps=1 if demo_train else 4,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="steps" if demo_train else "epoch",
    save_steps=10,
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant"
)

### 实例化 SFTTrainer

In [40]:
from trl import SFTTrainer
 
# 数据集的最大长度序列（筛选后的训练数据样例数为1158）
max_seq_length = 2048 
 
trainer = SFTTrainer(
    model=qlora_model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction, 
    args=args,
)

Generating train split: 2762 examples [00:01, 1894.81 examples/s]


### 训练模型

In [41]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,2.0507
20,1.9413
30,1.8185
40,1.7877
50,1.7971
60,1.7859
70,1.7623
80,1.7561
90,1.7529
100,1.759




TrainOutput(global_step=450, training_loss=1.7406717427571614, metrics={'train_runtime': 7654.1957, 'train_samples_per_second': 0.47, 'train_steps_per_second': 0.059, 'total_flos': 5.69112250023936e+17, 'train_loss': 1.7406717427571614, 'epoch': 1.3})

### 保存模型

In [42]:
trainer.save_model()

### 模型推理（测试）