# 一、简介
- 数据集构建参考：(https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-Tune-an-LLM-Part-1-Preparing-a-Dataset-for-Instruction-Tuning--Vmlldzo1NTcxNzE2)
- 微调参考：(https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-Tune-an-LLM-Part-2-Instruction-Tuning-Llama-2--Vmlldzo1NjY0MjE1)
## 1.什么是Instruction Tuning
- 这是post-training的一部分，即利用instructions（命令）来微调模型，使得模型能够和对齐人类的提问形式。
- 训练方式和pre-training是一样，区别在于数据集的不同。pre-training是不用构建数据的，因为他是自监督学习（自回归）。但是在这一步我们需要构建数据集，特别是构建高质量的数据集，这一步对后面实现chat能力至关重要
- 数据集的形式为问答对（instruction-answer pair），如下两种示例:  

    <table>
        <tr>
            <td>
            <font size=6> Instruction </font> <br> Explain the concept of a bubble sort algorithm to a non-technical audience.
            </td>
            <td>
            <font size=6> Answer </font> <br> A bubble sort algorithm is a type of sorting algorithm that is used to sort elements in an array. It works by looking at each element of the array and comparing it to the next element. If the first element is bigger than the second element, they are swapped. This process is repeated until the whole array is sorted. This type of sorting is one of the simplest sorting algorithms, but it can be slow if the array has many elements.
            </td>
        </tr>
        <tr>
            <td>
            <font size=6> Instruction </font> <br> Make the second sentence shorter. <br> <font size=6> Context </font> <br> Winter is usually the coldest season of the year. Snow is a common element during winter.
            </td>
            <td>
            <font size=6> Answer </font> <br> Winter is the coldest season, often accompanied by snow.
            </td>
        </tr>
    </table>

目前已经有的高质量instruction dataset有手工构建和自动构建的。其中像[Flan Collection](https://github.com/google-research/FLAN)，[Dolly15k dataset](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)是手工构建的，而像[Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html)是利用LLMs构建的。还有开源社区提供的一些教学数据集[OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)、[Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)、[openhermes](https://huggingface.co/datasets/teknium/openhermes)。

我们使用Alpaca dataset，包括如何预处理数据集、格式化、和微调LLM

## 2. 什么是Alpaca Dataset？
这是一个利用OpenAI davinci模型和Llama生成的数据集，。拥有许多种类的指令，包括邮件写作、社交媒体、生产力工具。构建数据集的pipeline如下图所示：  
    <img src=./imgs/f8eba1d5.png>  


这已经是一个老的版本了。现在我们使用更新的版本，是用GPT-4生成的。

# 二、Alpaca-GPT4 Dataset
- 是一个json文件，连接如下[Alpaca-GPT4 Dataset](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data.json)
- 包含52K条数据
- 是通过GPT-4生成的

样例数据如下：
    ```
    instruction: str, describes the task the model should perform. 
                      Each of the 52K instructions is unique.
    input:       str, optional context or input for the task.
    output:      str, the answer to the instruction as generated by GPT-4.
    
    ```
    
下面加载数据集，采用将数据夹菜成W&B的形式，方便快速加载

In [1]:
import os
# 获取当前工作目录
current_directory = os.getcwd()
print("当前工作目录:", current_directory)

# 设置新的工作目录
new_directory = "/mnt/d/lib/llm/llm-course"
os.chdir(new_directory)

# 再次获取当前工作目录，确认是否更改成功
current_directory = os.getcwd()
print("新的工作目录:", current_directory)

当前工作目录: /home/liangxianbing
新的工作目录: /mnt/d/code/llm/llm-course


In [9]:
import json
import wandb


with open("./mynotes/data/alpaca_dataset/alpaca_gpt4_data.json", "r") as f:
    alpaca = json.load(f)


with wandb.init(project="alpaca_ft"):
    at = wandb.Artifact(
        name="alpaca_gpt4", 
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    at.add_file("./mynotes/data/alpaca_dataset/alpaca_gpt4_data.json")
    
    # table
    table = wandb.Table(columns=list(alpaca[0].keys()))
    for row in alpaca:
        table.add_data(*row.values())

[34m[1mwandb[0m: Currently logged in as: [33mskyl4rking[0m ([33mskyl4rking-fudan-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.012 MB of 0.012 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

## 1.Tokenization

In [13]:
import json


with open("./mynotes/data/alpaca_dataset/alpaca_gpt4_data.json", "r") as f:
    alpaca = json.load(f)


print(len(alpaca))


one_row = alpaca[232]
print(one_row)

52002
{'instruction': 'Sort the following list in alphabetical order.', 'input': 'Camouflage, Furniture, Plaster', 'output': 'Camouflage, Furniture, Plaster sorted in alphabetical order:\nCamouflage, Furniture, Plaster'}


### (1). 定义将数据格式化的函数（就是生成prompt）

In [18]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n").format_map(row)


def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

row = alpaca[232]
print(prompt_input(row))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Sort the following list in alphabetical order.

### Input:
Camouflage, Furniture, Plaster

### Response:



### (2). 构建所有数据生成prompt的函数

In [20]:
def create_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)


prompts = [create_prompt(row) for row in alpaca]  # all LLM inputs are here
print(prompts[232])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Sort the following list in alphabetical order.

### Input:
Camouflage, Furniture, Plaster

### Response:



### (3). 定义输出（即label），只需要提那家一个eos即可

In [21]:
EOS_TOKEN = "</s>"
outputs = [row['output'] + EOS_TOKEN for row in alpaca]
print(outputs[232])

Camouflage, Furniture, Plaster sorted in alphabetical order:
Camouflage, Furniture, Plaster</s>


### (4).保存数据集

In [25]:
dataset = [{"prompt":s, "output":t, "example": s+t} for s, t in zip(prompts, outputs)]
print(dataset[232]['prompt'])
print('____________________________________________________________________________________________')
print(dataset[232]['output'])
print('____________________________________________________________________________________________')
print(dataset[232]['example'])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Sort the following list in alphabetical order.

### Input:
Camouflage, Furniture, Plaster

### Response:

____________________________________________________________________________________________
Camouflage, Furniture, Plaster sorted in alphabetical order:
Camouflage, Furniture, Plaster</s>
____________________________________________________________________________________________
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Sort the following list in alphabetical order.

### Input:
Camouflage, Furniture, Plaster

### Response:
Camouflage, Furniture, Plaster sorted in alphabetical order:
Camouflage, Furniture, Plaster</s>


### (5). 进行Tokenizer
- 可以看到pad了3个2

In [30]:
from transformers import AutoTokenizer
model_id = 'meta-llama/Llama-2-7b-hf'
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

tokenizer.encode("My experiments are going strong!", padding='max_length', max_length=10)

# 也可以直接变成torch Tensor
# tokenizer.encode("My experiments are going strong!", 
#                  padding='max_length', 
#                  max_length=10,
#                  return_tensors="pt")

[1, 1619, 15729, 526, 2675, 4549, 29991, 2, 2, 2]

## 2. 处理成数据集，并上传到Wandb
### (1). 切验证集和训练集

In [31]:
import random
import pandas as pd
random.shuffle(dataset)  # shuffle inplace


train_dataset = dataset[:-1000]
eval_dataset = dataset[-1000:]


train_table = wandb.Table(dataframe=pd.DataFrame(train_dataset))
eval_table  = wandb.Table(dataframe=pd.DataFrame(eval_dataset))


with wandb.init(project="alpaca_ft", job_type="split_data"):
    wandb.log({"train_dataset":train_table, "eval_dataset":eval_table})


VBox(children=(Label(value='114.333 MB of 114.344 MB uploaded\r'), FloatProgress(value=0.9999024004070258, max…

#### 方案一：将多个样本组合成一个长序列
长序列如下图所示，将多个样本变成一个含有1k的长序列
<img src=./imgs/d9f4c0c2.png width=auto height=200>

In [35]:
max_seq_len = 1024


def pack(dataset, max_seq_len=1024):
    tkds_ids = tokenizer([s["example"] for s in dataset])["input_ids"]  # 获取所有样本的token_id
    
    # 将dataset中所有样本token_id拼接到一起
    all_token_ids = []
    for tokenized_input in tkds_ids:
        all_token_ids.extend(tokenized_input + [tokenizer.eos_token_id])    
    
    # 每段切1024个token，并生成对应label
    packed_ds = []
    for i in range(0, len(all_token_ids), max_seq_len+1):
        input_ids = all_token_ids[i : i + max_seq_len+1]
        if len(input_ids) == (max_seq_len+1):
            packed_ds.append({"input_ids": input_ids[:-1], "labels": input_ids[1:]})  # < --- ‼️ ⛔️
	    # if you use the model.output.loss you don't need to shift, it is done for you!
    return packed_ds


train_ds_packed = pack(train_dataset)
eval_ds_packed = pack(eval_dataset)
print(len(train_ds_packed[0]["input_ids"]), len(train_ds_packed[0]["labels"]))
print(len(train_ds_packed[44]["input_ids"]), len(train_ds_packed[44]["labels"]))

1024 1024
1024 1024


#### 方案二：将每个样本pad成同样长度
- 可以选择最长长度pad
- 或者选择一个最大值来pad
两种方案如下图所示：  
<img src=./imgs/ff512cfb.png width=auto height=200>

In [37]:
a = tokenizer(["My experiments are going strong!", 
           "I love Llamas"], 
          padding='longest',
          return_tensors="pt")



b = tokenizer(["My experiments are going strong!", 
           "I love Llamas"], 
          # padding='max_length', 
          padding='max_length',
          max_length=10,
          return_tensors="pt")

print(a)
print(b)

{'input_ids': tensor([[    1,  1619, 15729,   526,  2675,  4549, 29991],
        [    1,   306,  5360,   365,  5288,   294,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}
{'input_ids': tensor([[    1,  1619, 15729,   526,  2675,  4549, 29991,     2,     2,     2],
        [    1,   306,  5360,   365,  5288,   294,     2,     2,     2,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


### (2). 保存数据到W&B（方案一）

In [38]:
import json
def save_jsonl(data, filename):
    with open(filename, 'w') as file:
        for entry in data:
            json.dump(entry, file)
            file.write('\n')


# dump everything to jsonl files
save_jsonl(train_ds_packed, "train_packed_alpaca.jsonl")
save_jsonl(eval_ds_packed, "eval_packed_alpaca.jsonl")


# Create a W&B artifact
packed_at = wandb.Artifact(
    name="packed_alpaca",
    type="dataset",
    description="Alpaca dataset packed in sequences",
    metadata={"max_seq_len":1024, "model_id":model_id})


packed_at.add_file("train_packed_alpaca.jsonl")
packed_at.add_file("eval_packed_alpaca.jsonl")


# log the artifact to the project, we can give this run a job_type like `preprocess`
with wandb.init(project="alpaca_ft", job_type="preprocess"):
    wandb.log_artifact(packed_at)


VBox(children=(Label(value='130.973 MB of 130.973 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

# 三、用Alpaca-GPT4微调Llama2

## 1. 从W&B下载之前预处理好的数据集

In [2]:
import wandb
from pathlib import Path


run = wandb.init(project="alpaca_ft")
artifact = run.use_artifact('capecape/alpaca_ft/packed_alpaca:v0', type='dataset')
artifact_dir = Path(artifact.download())

[34m[1mwandb[0m: Currently logged in as: [33mskyl4rking[0m ([33mskyl4rking-fudan-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113211755582598, max=1.0…

[34m[1mwandb[0m: Downloading large artifact packed_alpaca:v0, 130.66MB. 2 files... 
[34m[1mwandb[0m:   2 of 2 files downloaded.  
Done. 0:0:1.4


因为我们保存的json文件，所以直接用json可以打开查看

In [3]:
import json


def load_jsonl(filename):
    data = []
    with open(filename, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data


train_ds_packed = load_jsonl(artifact_dir/"train_packed_alpaca.jsonl")
eval_ds_packed = load_jsonl(artifact_dir/"eval_packed_alpaca.jsonl")

print(len(train_ds_packed[0]["input_ids"]), len(train_ds_packed[0]["labels"]))
print(len(train_ds_packed[44]["input_ids"]), len(train_ds_packed[44]["labels"]))

1024 1024
1024 1024


## 2. 利用hugging-face Datasets从disk上加载json数据

In [4]:
import wandb
from datasets import load_from_disk # for some reason load_dataset gives an error


run = wandb.init(project="alpaca_ft")
artifact = run.use_artifact('capecape/alpaca_ft/packed_alpaca_hf:v0', type='dataset')
artifact_dir = artifact.download()
ds_packed = load_from_disk(artifact_dir)


# we are back where we started!
train_ds_packed = ds_packed["train"]
eval_ds_packed  = ds_packed["eval"]
max_seq_len = artifact.metadata["max_seq_len"]


VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011114825044448178, max=1.0…

[34m[1mwandb[0m: Downloading large artifact packed_alpaca_hf:v0, 178.69MB. 7 files... 
[34m[1mwandb[0m:   7 of 7 files downloaded.  
Done. 0:0:1.5


## 3. 构建pytorch Dataloader
- 观察数据集，input和label其实就是向后移动了一位的差距，如下图所示：  
    <img src=./imgs/b3f82f6a.png width=auto height=200>

In [5]:
from torch.utils.data import DataLoader
from transformers import default_data_collator


batch_size = 8  # I have an A100 GPU with 40GB of RAM 😎


train_dataloader = DataLoader(
    train_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator, # we don't need any special collator 😎
)


eval_dataloader = DataLoader(
    eval_ds_packed,
    batch_size=batch_size,
    collate_fn=default_data_collator,
    shuffle=False,
)

b = next(iter(train_dataloader))
b.keys(), b["input_ids"][0][:25], b["labels"][0][:25]

(dict_keys(['input_ids', 'labels']),
 tensor([    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29889,
         14350,   263,  2933,   393,  7128,  2486,  1614,  2167,   278,  2009,
         29889,    13,    13,  2277, 29937]),
 tensor([13866,   338,   385, 15278,   393, 16612,   263,  3414, 29889, 14350,
           263,  2933,   393,  7128,  2486,  1614,  2167,   278,  2009, 29889,
            13,    13,  2277, 29937,  2799]))

## 3. 训练准备（函数实现与模型、训练器准备）
- 这里只训练部分层的参数，而不是所有参数。因此某些层会被freeze
- 这里使用了自动混合精度（[Automatic Mixed Precision](https://wandb.ai/wandb_fc/tips/reports/How-To-Use-Autocast-in-PyTorch--VmlldzoyMTk4NTky)）：用了半精度浮点数，因此会加快训练
- 梯度检查点技术（[Gradient Checkpointing](https://blog.csdn.net/Solo95/article/details/131606918)）：一部分前向激活值丢弃，另一部分保留。丢弃的部分需要计算梯度时，需要重新计算前向激活值。这样丢弃那一部分就节省了显存

### (1). 构建超参数

In [6]:
from types import SimpleNamespace


gradient_accumulation_steps = 32 // batch_size

# 超参数
config = SimpleNamespace(
    model_id='meta-llama/Llama-2-7b-hf',
    dataset_name="alpaca-gpt4",
    precision="bf16",  # faster and better than fp16, requires new GPUs
    n_freeze=24,  # How many layers we don't train, LLama 7B has 32.
    lr=2e-4,
    n_eval_samples=10, # How many samples to generate on validation
    max_seq_len=max_seq_len, # Length of the sequences to pack
    epochs=3,  # we do 3 pasess over the dataset.
    gradient_accumulation_steps=gradient_accumulation_steps,  # evey how many iterations we update the gradients, simulates larger batch sizes
    batch_size=batch_size,  # what my GPU can handle, depends on how many layers are we training  
    log_model=False,  # upload the model to W&B?
    mom=0.9, # optim param
    gradient_checkpointing = True,  # saves even more memory
    freeze_embed = True,  # why train this? let's keep them frozen ❄️
)


config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps
print(config.total_train_steps)

1050


### (2). 加载模型

In [7]:
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map=0,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    use_cache=False,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### (3). 冻结部分参数&gradient checkpointing
- 冻结了前面n_freeze层的transformer block
- 一共32个llama层

In [8]:
def param_count(m):
    params = sum([p.numel() for p in m.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in m.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M")
    return params, trainable_params

params, trainable_params = param_count(model)

n_freeze = 30 # you can play with this parameter
print(len(model.model.layers)) # 一共有32层
# print(model.model.layers)

# freeze layers (disable gradients)
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True

# Just freeze embeddings for small memory decrease
if config.freeze_embed:
    model.model.embed_tokens.weight.requires_grad_(False)
    
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})  # <- pytorch changed this
    


Total params: 6738.42M, Trainable: 6738.42M
32


In [9]:
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})  # <- pytorch changed this
    
params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 535.84M


### (4). Optimizer & Scheduler

In [10]:
from transformers import get_cosine_schedule_with_warmup


optim = torch.optim.Adam(model.parameters(), lr=config.lr, betas=(0.9,0.99), eps=1e-5)
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps=config.total_train_steps,
    num_warmup_steps=config.total_train_steps // 10,
)


def loss_fn(x, y):
    "A Flat CrossEntropy" 
    return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))


### (5). 测试所需函数

In [11]:
from types import SimpleNamespace
from transformers import GenerationConfig

gen_config = GenerationConfig.from_pretrained(config.model_id)
test_config = SimpleNamespace(
    max_new_tokens=256,
    gen_config=gen_config)

In [12]:
from transformers import GenerationConfig
gen_config = GenerationConfig.from_pretrained(config.model_id)


def generate(prompt, max_new_tokens=100, gen_config=gen_config):
    with torch.inference_mode():
        tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
        output = model.generate(tokenized_prompt, 
                            max_new_tokens=max_new_tokens, 
                            generation_config=gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)

In [13]:
import wandb
from tqdm.auto import tqdm

def prompt_table(examples, log=False, table_name="predictions"):
    table = wandb.Table(columns=["prompt", "generation", "concat", "output", "max_new_tokens", "temperature", "top_p"])
    for example in tqdm(examples, leave=False):
        prompt, gpt4_output = example["prompt"], example["output"]
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)
    if log:
        wandb.log({table_name:table})
    return table


def to_gpu(tensor_dict):
    return {k: v.to('cuda') for k, v in tensor_dict.items()}

class Accuracy:
    "A simple Accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.tp = 0.
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim=-1).view(-1).cpu(), labels.view(-1).cpu()
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        return tp / len(logits)
    def compute(self):
        return self.tp / self.count

In [14]:
@torch.no_grad()
def validate():
    model.eval()
    eval_acc = Accuracy()
    for step, batch in enumerate(tqdm(eval_dataloader)):
        batch = to_gpu(batch)
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss = loss_fn(out.logits, batch["labels"])  # you could use out.loss and not shift the dataset
        eval_acc.update(out.logits, batch["labels"])
    # we log results at the end
    wandb.log({"eval_loss": loss.item(),
               "eval_accuracy": eval_acc.compute()})
    prompt_table(eval_dataset[:config.n_eval_samples], log=True)
    model.train()

## 4. 训练

In [15]:
from pathlib import Path
def save_model(model, model_name, models_folder="models", log=False):
    """Save the model to wandb as an artifact
    Args:
        model (nn.Module): Model to save.
        model_name (str): Name of the model.
        models_folder (str, optional): Folder to save the model. Defaults to "models".
    """
    model_name = f"{wandb.run.id}_{model_name}"
    file_name = Path(f"{models_folder}/{model_name}")
    file_name.parent.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(file_name, safe_serialization=True)
    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(model.name_or_path)
    tokenizer.save_pretrained(model_name)
    if log:
        at = wandb.Artifact(model_name, type="model")
        at.add_dir(file_name)
        wandb.log_artifact(at)

In [None]:
from tqdm import tqdm


wandb.init(project="alpaca_ft", # the project I am working on
           tags=["baseline","7b"],
           job_type="train",
           config=config) # the Hyperparameters I want to keep track of


# Training
acc = Accuracy()
model.train()
train_step = 0
pbar = tqdm(total=config.total_train_steps)
for epoch in range(config.epochs):
    for step, batch in enumerate(train_dataloader):
        batch = to_gpu(batch)
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps  # you could use out.loss and not shift the dataset  
            loss.backward()
        if step%config.gradient_accumulation_steps == 0:
            # we can log the metrics to W&B
            wandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,
                       "train/accuracy": acc.update(out.logits, batch["labels"]),
                       "train/learning_rate": scheduler.get_last_lr()[0],
                       "train/global_step": train_step})
            optim.step()
            scheduler.step()
            optim.zero_grad(set_to_none=True)
            train_step += 1
            pbar.update(1)
    validate()
pbar.close()
# we save the model checkpoint at the end
save_model(
	model, 
	model_name=config.model_id.replace("/", "_"), 
	models_folder="models/", log=config.log_model)
    
wandb.finish()

VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112330288839681, max=1.0…

  0%|                                                                                                                     | 0/1050 [00:00<?, ?it/s]

In [None]:
# 测试
with wandb.init(project="alpaca_ft", # the project I am working on
           job_type="eval",
           config=config): # the Hyperparameters I want to keep track of
    model.eval()
    prompt_table(eval_dataset[:250], log=True, table_name="eval_predictions")

In [None]:
def gpt4_judge(instruction, gen1, gen2, model="gpt-4"):
    system_prompt = ("You will be presented with a choice of two possible responses for an instruction"
                     "You have to pick the best one and give a reason why.\n"
                     "The reponse should follow the instructions and use the provided context if there is some\n"
                    "If both answers are equivalent, pick the value 0")
    message = "{instruction}\n Answer 1: \n{gen1}\n Answer 2:\n{gen2}".format(instruction=instruction, gen1=gen1, gen2=gen2)
    completion = openai.chat.completions.create(
        model=model,
        messages=[{"role": "system",
                   "content": system_prompt,
                  },
                  {"role": "user",
                   "content": message,
                  },],
        function_call = {"name": "make_choice"},
        functions = [{
                "name": "make_choice",
                "description": "Select the best generation and explain why",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "choice": {
                            "type": "integer",
                            "description": "the choosen alternative, zero if equivalent",
                        },
                        "argument":{
                            "type": "string",
                            "description": "Reason why the choice was made",},},},
                    "required": ["choice", "argument"],},
        ],)
    return completion
