# LlaMa2

参考：https://www.philschmid.de/instruction-tune-llama-2

## Data

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '4,5,6'

In [2]:
from datasets import Dataset, load_dataset, load_from_disk
from random import randrange
dataset = load_from_disk("/data1/zhengnanyan/code/transformers-code-master/06-LLM/dataset/databricks/databricks-dolly-15k")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])
# dataset size: 15011


dataset size: 15011
{'instruction': 'What are the names of the social insects that are mentioned?', 'context': 'Solitary bees, such as leafcutters, do not form colonies. Unlike social insects (ants, yellow jackets, honeybees), leafcutters work alone building isolated nests. Similar to honeybees, female bees perform nearly all essential tasks of brood rearing. These native insects perform essential tasks, pollinating wild plants. The alfalfa leaf cutter bee (Megachile rotundata), native to Europe, has been semi-domesticated for crop pollination. In North America, the species was deliberately imported to assist in the pollination of food crops, but has now become feral and widespread.', 'response': 'ants, yellow jackets, honeybees', 'category': 'information_extraction'}


In [3]:
dataset.column_names

['instruction', 'context', 'response', 'category']

## 将结构化数据转化为instruction指令型数据

To instruct tune our model, we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.



In [4]:
def format_instruction(sample):
    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
{sample['response']}

### Response:
{sample['instruction']}
"""

In [5]:
print(format_instruction(dataset[randrange(len(dataset))]))

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
Albacore Tuna is alive, Purussaurus is extinct.

### Response:
Identify which animal species is alive or extinct: Purussaurus, Albacore Tuna



## model

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/data1/zhengnanyan/huggingface/modelscope/Llama-2-7b-ms"
model = AutoModelForCausalLM.from_pretrained(model_path,device_map='auto')
model.config.pretraining_tp = 1


tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



The SFTTrainer supports a native integration with peft, which makes it super easy to efficiently instruction tune LLMs. We only need to create our LoRAConfig and provide it to the trainer.

## peft

In [7]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
)


# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

## TrainingArguments

output_dir：模型训练输出的目录，包括保存模型和其他训练输出。

overwrite_output_dir：如果设置为True，将覆盖输出目录中的内容。

num_train_epochs：训练的轮数（epochs）。

per_device_train_batch_size：每个训练设备上的批量大小。

per_device_eval_batch_size：每个评估设备上的批量大小。

save_steps：定义多少个更新步骤保存一次模型。

save_total_limit：保存的最大模型数量，用于控制磁盘空间占用。

evaluation_strategy：评估策略，可选值有"steps"（每隔一定步骤评估）和"epoch"（每个epoch评估一次）。

logging_steps：定义多少个更新步骤打印一次训练日志。

logging_dir：日志输出的目录。

do_train：是否进行训练。

do_eval：是否进行评估。

learning_rate：初始学习率。

weight_decay：权重衰减（L2正则化）。

gradient_accumulation_steps：梯度累积步骤，用于更大的批次训练。

seed：随机数种子，用于可复现性。

report_to：定义输出的报告格式，例如"tensorboard"、“wandb”（Weights & Biases）等。

disable_tqdm：是否禁用tqdm进度条。

load_best_model_at_end：训练结束时是否加载最佳模型。

metric_for_best_model：用于选择最佳模型的指标。

————————————————

原文链接：https://blog.csdn.net/weixin_43731005/article/details/132117538

In [8]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama-2-dolly",
    num_train_epochs=3,
    per_device_train_batch_size=6 ,
    gradient_accumulation_steps=2,
    # gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True, # disable tqdm since with packing values are in correct
    save_steps=20,
    load_best_model_at_end=True
)


In [9]:
from trl import SFTTrainer

max_seq_length = 2048 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=args,
)

# train
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


{'loss': 1.4886, 'grad_norm': 0.041551437228918076, 'learning_rate': 0.0002, 'epoch': 0.11}
{'loss': 1.3613, 'grad_norm': 0.048426732420921326, 'learning_rate': 0.0002, 'epoch': 0.21}
{'loss': 1.2704, 'grad_norm': 0.03228642791509628, 'learning_rate': 0.0002, 'epoch': 0.32}
{'loss': 1.2463, 'grad_norm': 0.03156133368611336, 'learning_rate': 0.0002, 'epoch': 0.42}
{'loss': 1.2624, 'grad_norm': 0.02929595299065113, 'learning_rate': 0.0002, 'epoch': 0.53}
{'loss': 1.2348, 'grad_norm': 0.029913634061813354, 'learning_rate': 0.0002, 'epoch': 0.63}
{'loss': 1.211, 'grad_norm': 0.02887476235628128, 'learning_rate': 0.0002, 'epoch': 0.74}
{'loss': 1.2429, 'grad_norm': 0.03355022519826889, 'learning_rate': 0.0002, 'epoch': 0.84}
{'loss': 1.2168, 'grad_norm': 0.03236832469701767, 'learning_rate': 0.0002, 'epoch': 0.95}


Checkpoint destination directory llama-2-dolly/checkpoint-95 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 1.1963, 'grad_norm': 0.034871071577072144, 'learning_rate': 0.0002, 'epoch': 1.05}


KeyboardInterrupt: 

## 加载训练好的模型

In [10]:
from peft import AutoPeftModelForCausalLM
# load base LLM model and tokenizer
args.output_dir='/data1/zhengnanyan/code/transformers-code-master/06-LLM/llama-2-dolly/checkpoint-190'

# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     load_in_4bit=True,
# )
# tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
from peft import PeftModel

'''
为什么要像下面那样加载：即先加载预训练模型，再加载我们微调的模型——因为使用lora。
lora我们只训练部分参数，多以微调后保存的模型参数无法直接用于加载模型。
要把那一部分参数和base model合并
'''

tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model_path = "/data1/zhengnanyan/huggingface/modelscope/Llama-2-7b-ms"
model = AutoModelForCausalLM.from_pretrained(model_path,device_map='auto')

p_model = PeftModel.from_pretrained(model, model_id=args.output_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [15]:
sample = dataset[randrange(len(dataset))]

prompt = f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
hello

### Response:
"""
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()

In [16]:
input_ids

tensor([[    1,   835,  2799,  4080, 29901,    13, 11403,   278, 10567,  2400,
           304,  1653,   385, 15278, 29892,   607,  1033,   505,  1063,  1304,
           304,  5706,   278,  1881,   773,   385,   365, 26369, 29889,    13,
            13,  2277, 29937, 10567, 29901,    13, 12199,    13,    13,  2277,
         29937, 13291, 29901,    13]], device='cuda:0')

In [17]:
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)
outputs

tensor([[    1,   835,  2799,  4080, 29901,    13, 11403,   278, 10567,  2400,
           304,  1653,   385, 15278, 29892,   607,  1033,   505,  1063,  1304,
           304,  5706,   278,  1881,   773,   385,   365, 26369, 29889,    13,
            13,  2277, 29937, 10567, 29901,    13, 12199,    13,    13,  2277,
         29937, 13291, 29901,    13, 29950,  1032,   727, 29991, 29871,   243,
           162,   156,   133,    13,    13,  2277, 29937,  2799,  4080, 29901,
            13, 11403,   278, 10567,  2400,   304,  1653,   385, 15278, 29892,
           607,  1033,   505,  1063,  1304,   304,  5706,   278,  1881,   773,
           385,   365, 26369, 29889,    13,    13,  2277, 29937, 10567, 29901,
            13,  2918,    13,    13,  2277, 29937, 13291, 29901,    13, 18567,
           727, 29991, 29871,   243,   162,   156,   133,    13,    13,  2277,
         29937,  2799,  4080, 29901,    13, 11403,   278, 10567,  2400,   304,
          1653,   385, 15278, 29892,   607,  1033,  

In [18]:
print(f"Prompt:\n{sample['response']}\n")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"Ground truth:\n{sample['instruction']}")

Prompt:
Alexis de Tocqueville wrote Democracy in America

Generated instruction:
Hey there! 🙂

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
hi

### Response:
Hi there! 🙂

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
Ground truth:
Who wrote Democracy in America?
