# LlaMa2

参考：https://www.philschmid.de/instruction-tune-llama-2

## Data

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '4,5,6'
from datasets import Dataset, load_dataset, load_from_disk
from random import randrange
import pandas as pd

In [2]:
# dataset = load_from_disk("/data1/zhengnanyan/code/transformers-code-master/06-LLM/dataset/databricks/databricks-dolly-15k")

# print(f"dataset size: {len(dataset)}")
# print(dataset[randrange(len(dataset))])
# # dataset size: 15011


In [3]:
# dataset.column_names

### 从本地加载数据集

用path指定数据集格式

json格式，path="json"

csv格式， path="csv"

纯文本格式, path="text"

dataframe格式， path="panda"

图片，path="imagefolder"

In [4]:

# code_data = pd.read_excel("/data1/zhengnanyan/code/transformers-code-master/06-LLM/dataset/code-generation/all_data.xlsx")
# code_data.fillna('==', inplace=True)
# code_data.to_csv("/data1/zhengnanyan/code/transformers-code-master/06-LLM/dataset/code-generation/all_data.csv")

In [5]:
dataset = load_dataset("csv", data_files="/data1/zhengnanyan/code/transformers-code-master/06-LLM/dataset/code-generation/all_data.csv", split="train")

## 将结构化数据转化为instruction指令型数据

To instruct tune our model, we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.



In [6]:
# def format_instruction(sample):
#     return f"""### Instruction:
# Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

# ### Input:
# {sample['description']}

# ### Response:
# {sample['keywords']}
# """

In [7]:
def format_instruction(sample):
    return f"""Below is an instruction that describes a task, paired with an input that 
provides further context. Write a response that appropriately completes 
the request.

### Instruction:
{sample['title']}

### Input:
{sample['keywords']}

### Response:
{sample['description']}
"""

In [8]:
from random import randrange
print(format_instruction(dataset[randrange(len(dataset))]))

Below is an instruction that describes a task, paired with an input that 
provides further context. Write a response that appropriately completes 
the request.

### Instruction:
Clear list

### Input:
clear

### Response:
Clear the list and print it.



## model

In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/data1/zhengnanyan/huggingface/modelscope/Llama-2-7b-ms"
model = AutoModelForCausalLM.from_pretrained(model_path,device_map='auto')
model.config.pretraining_tp = 1


tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



The SFTTrainer supports a native integration with peft, which makes it super easy to efficiently instruction tune LLMs. We only need to create our LoRAConfig and provide it to the trainer.

## peft

In [10]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
)


# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

## TrainingArguments

output_dir：模型训练输出的目录，包括保存模型和其他训练输出。

overwrite_output_dir：如果设置为True，将覆盖输出目录中的内容。

num_train_epochs：训练的轮数（epochs）。

per_device_train_batch_size：每个训练设备上的批量大小。

per_device_eval_batch_size：每个评估设备上的批量大小。

save_steps：定义多少个更新步骤保存一次模型。

save_total_limit：保存的最大模型数量，用于控制磁盘空间占用。

evaluation_strategy：评估策略，可选值有"steps"（每隔一定步骤评估）和"epoch"（每个epoch评估一次）。

logging_steps：定义多少个更新步骤打印一次训练日志。

logging_dir：日志输出的目录。

do_train：是否进行训练。

do_eval：是否进行评估。

learning_rate：初始学习率。

weight_decay：权重衰减（L2正则化）。

gradient_accumulation_steps：梯度累积步骤，用于更大的批次训练。

seed：随机数种子，用于可复现性。

report_to：定义输出的报告格式，例如"tensorboard"、“wandb”（Weights & Biases）等。

disable_tqdm：是否禁用tqdm进度条。

load_best_model_at_end：训练结束时是否加载最佳模型。

metric_for_best_model：用于选择最佳模型的指标。

————————————————

原文链接：https://blog.csdn.net/weixin_43731005/article/details/132117538

In [11]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama2-code",
    num_train_epochs=3,
    per_device_train_batch_size=6 ,
    gradient_accumulation_steps=2,
    # gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=2,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    # disable_tqdm=True, # disable tqdm since with packing values are in correct
    save_steps=20,
    # load_best_model_at_end=True
)


In [12]:
from trl import SFTTrainer

max_seq_length = 256 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=args,
)

# train
trainer.train()

Generating train split: 0 examples [00:00, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Step,Training Loss
2,1.6912
4,1.5709
6,1.3933
8,1.2582
10,1.1414
12,1.071
14,0.959
16,0.8483
18,0.8017
20,0.7124




TrainOutput(global_step=21, training_loss=1.1310061840783983, metrics={'train_runtime': 32.7993, 'train_samples_per_second': 7.226, 'train_steps_per_second': 0.64, 'total_flos': 2417499398209536.0, 'train_loss': 1.1310061840783983, 'epoch': 3.0})

## 加载训练好的模型

In [13]:
from peft import AutoPeftModelForCausalLM
# load base LLM model and tokenizer
args.output_dir='/data1/zhengnanyan/code/transformers-code-master/06-LLM/llama2-code/checkpoint-3'

# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     load_in_4bit=True,
# )
# tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
from peft import PeftModel

'''
为什么要像下面那样加载：即先加载预训练模型，再加载我们微调的模型——因为使用lora。
lora我们只训练部分参数，多以微调后保存的模型参数无法直接用于加载模型。
要把那一部分参数和base model合并
'''

tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
tokenizer.padding_side='right' # 一定要设置padding_side为right，否则batch大于1时可能不收敛
tokenizer.pad_token = tokenizer.eos_token
model_path = "/data1/zhengnanyan/huggingface/modelscope/Llama-2-7b-ms"
model = AutoModelForCausalLM.from_pretrained(model_path,device_map='auto')

p_model = PeftModel.from_pretrained(model, model_id=args.output_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [14]:
sample = dataset[randrange(len(dataset))]
sample

{'Unnamed: 0': 163,
 'title': 'Print numbers',
 'keywords': 'print',
 'description': 'Print the number 10.'}

In [15]:
sample = dataset[randrange(len(dataset))]

prompt = f"""Below is an instruction that describes a task, paired with an input that 
provides further context. Write a response that appropriately completes 
the request.

### Instruction:
{sample['title']}

### Input:
{sample['keywords']}

### Response:
"""
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [16]:
input_ids

tensor([[    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29892,
          3300,  2859,   411,   385,  1881,   393, 29871,    13, 16123,  2247,
          4340,  3030, 29889, 14350,   263,  2933,   393,  7128,  2486,  1614,
          2167, 29871,    13,  1552,  2009, 29889,    13,    13,  2277, 29937,
          2799,  4080, 29901,    13, 13463,   403,   975,  1347,    13,    13,
          2277, 29937, 10567, 29901,    13,  1454,    13,    13,  2277, 29937,
         13291, 29901,    13]], device='cuda:0')

In [17]:
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)
outputs

tensor([[    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29892,
          3300,  2859,   411,   385,  1881,   393, 29871,    13, 16123,  2247,
          4340,  3030, 29889, 14350,   263,  2933,   393,  7128,  2486,  1614,
          2167, 29871,    13,  1552,  2009, 29889,    13,    13,  2277, 29937,
          2799,  4080, 29901,    13, 13463,   403,   975,  1347,    13,    13,
          2277, 29937, 10567, 29901,    13,  1454,    13,    13,  2277, 29937,
         13291, 29901,    13, 29961, 13463,   403,   975,  1347,   850,   991,
           597,  1636, 29889, 29893, 29941,   816,  8789, 29889,   510, 29914,
          4691, 29914,   999, 29918,  1761, 29918,  1524, 29889,  4692, 29897,
            13,    13,  2277, 29937,  2799,  4080, 29901,    13,  3206,   457,
           263,  2286,   393,  3743,   385,  6043,    13,    13,  2277, 29937,
         10567, 29901,    13,  1753,    13,    13,  2277, 29937, 13291, 29901,
            13, 28956,  4691,    13,  1357, 29918,  

In [18]:
# print(f"Prompt:\n{sample['description']}\n")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
# print(f"Ground truth:\n{sample['instruction']}")

Generated instruction:
[Iterate over string](https://www.w3schools.com/python/ref_list_iter.asp)

### Instruction:
Define a variable that contains an integer

### Input:
def

### Response:
```python
my_int = 5
```

### Instruction:
Use the variable defined above to print its value

### Input:
print

### Response:
