<a href="https://colab.research.google.com/github/119020/NLP_2025_Spring_Materials/blob/main/Tutorial_LLM_training_phoenix_lora_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 2: Train your own LLMs
### **Course Name:** Large Language Models **<font color="red">(CSC 6203)</font>**
 *Presenter: Ke Ji*




This notebook guide provides a comprehensive overview of using the `transformers` Python package to efficiently train a custom model. It covers the following techniques:

1. Utilizing model, tokenizer, and dataset loading function from Hugging Face.
2. Performing basic data cleaning.
3. Training the model with basic modeling techniques, including quantization, such as qlora in this instance.
4. Evaluating the model's performance on test set.
5. Saving your custom model and preparing it for deployment.

## Preliminary Preparation

Before proceeding with model training, ensure your environment is properly configured by following these steps:

1. Install the necessary Python packages.
2. Import the required libraries.

In [None]:
## This code block shows how to install the necessary python packages using pip script.
!pip install -q h5py typing-extensions wheel fschat
# !pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 fschat
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[0m

You can use this command to check your GPU resources.

In [None]:
!nvidia-smi

Fri Mar 15 06:35:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Load Pre-trained model and tokenizer

First let's load the model we are going to use - phoenix-inst-chat-7b! Note that the model itself is around 7B in full precision.

Note that we are using a model with 7b parameters, so we use qlora to reduce the memory usage considering our limited gpu resources

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

## Quantization type (fp4 or nf4), According to QLoRA paper, for training 4-bit base models (e.g. using LoRA adapters) one should use
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = True

## If you don't have enough GPU computing resources, you can use LLMs with less parameters (like Qwen2.5-0.5b or microsoft/phi-1_5).
model_id = "FreedomIntelligence/phoenix-inst-chat-7b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=use_nested_quant,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=torch.bfloat16
)
## Be sure to remember to use lora or qlora (parameter efficient fine-tuning methods) if you try to use some LLMs with more parameters.
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

## You can try differnt parameter-effient strategy for model trianing, for more info, please check https://github.com/huggingface/peft
config = LoraConfig(
    r=8,
    lora_alpha=8,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

In [None]:
from fastchat.conversation import get_conv_template
device = "cuda"
model.eval()

@torch.no_grad()
def generate(prompt):
    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt').to(device)
    outputs = model.generate(input_ids, do_sample=False, max_new_tokens=1024)
    return tokenizer.decode(*outputs, skip_special_tokens=True)

conv = get_conv_template('phoenix')
conv.append_message(conv.roles[0], "Are you a good AI assistant?")
# conv.append_message(conv.roles[1], "No, I am evil!")
# conv.append_message(conv.roles[0], "Can you tell me what you can do?")

conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
print(prompt)
response = generate(prompt)
print("-"*80)
print(response)

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: <s>Are you a good AI assistant?</s>Assistant: <s>
--------------------------------------------------------------------------------
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: Are you a good AI assistant?Assistant: As an AI language model, I am designed to assist users with a wide range of tasks and questions. I am constantly learning and improving my capabilities to provide the best possible responses to the questions and tasks that I am given. I am not a person, but rather a computer program that has been trained on a large dataset of text.


## Data Preparation

Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [None]:
from datasets import load_dataset

# data = load_dataset("Abirate/english_quotes")
dataset = load_dataset("FreedomIntelligence/Huatuo26M-Lite")
dataset = dataset['train'].map(lambda sample: {"conversations": [{"role": "human", "value": sample['question']}, {"role": "gpt", "value": sample['answer']}]}, batched=False)

Then split the dataset into train sets and test sets

In [None]:
from torch.utils.data import random_split
train_dataset_size, val_dataset_size = 40, 8
train_dataset, val_dataset, _ = random_split(dataset, [train_dataset_size, val_dataset_size, len(dataset)-train_dataset_size-val_dataset_size])

### Customized Dataset
Create a specialized dataset class named "InstructionDataset" designed to handle our custom dataset.

In [None]:
import json, copy
import transformers
from typing import Dict, Sequence, List
from dataclasses import dataclass
from torch.utils.data import Dataset

IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN = "<pad>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "<unk>"
default_conversation = get_conv_template('phoenix')

class InstructDataset(Dataset):
    def __init__(self, data: Sequence, tokenizer: transformers.PreTrainedTokenizer) -> None:
        super().__init__()
        self.tokenizer = tokenizer
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index) -> Dict[str, torch.Tensor]:
        sources = self.data[index]
        if isinstance(index, int):
            sources = [sources]
        data_dict = preprocess([e['conversations'] for e in sources], self.tokenizer)
        if isinstance(index, int):
            data_dict = dict(input_ids=data_dict["input_ids"][0], labels=data_dict["labels"][0])
        return data_dict

def preprocess(
        sources: Sequence[str],
        tokenizer: transformers.PreTrainedTokenizer,
        max_length=1024
) -> Dict:
    ## add end signal and concatenate together
    conversations = []
    intermediates = []
    for source in sources:
        header = f"{default_conversation.system_message}"
        conversation, intermediate = _add_speaker_and_signal(header, source)
        conversations.append(conversation)
        intermediates.append(intermediate)

    ## tokenize conversations
    conversations_tokenized = _tokenize_fn(conversations, tokenizer)
    input_ids = conversations_tokenized["input_ids"]
    targets = copy.deepcopy(input_ids)

    ## keep only machine responses as targets
    assert len(targets) == len(intermediates)
    for target, inters in zip(targets, intermediates):
        mask = torch.zeros_like(target, dtype=torch.bool)
        for inter in inters:
            tokenized = _tokenize_fn(inter, tokenizer)
            start_idx = tokenized["input_ids"][0].size(0) - 1
            end_idx = tokenized["input_ids"][1].size(0)
            mask[start_idx:end_idx] = True
        target[~mask] = IGNORE_INDEX

    input_ids = input_ids[:max_length]
    targets = targets[:max_length]
    return dict(input_ids=input_ids, labels=targets)

def _add_speaker_and_signal(header, source, get_conversation=True):
    BEGIN_SIGNAL = DEFAULT_BOS_TOKEN
    END_SIGNAL = DEFAULT_EOS_TOKEN
    conversation = header
    intermediate = []
    for sentence in source:
        from_str = sentence["role"]
        if from_str.lower() == "human":
            from_str = default_conversation.roles[0]
        elif from_str.lower() == "gpt":
            from_str = default_conversation.roles[1]
        else:
            from_str = 'unknown'
        # store the string w/o and w/ the response
        value = (from_str + ": " + BEGIN_SIGNAL + sentence["value"] + END_SIGNAL)
        if sentence["role"].lower() == "gpt":
            start = conversation + from_str + ": " + BEGIN_SIGNAL
            end = conversation + value
            intermediate.append([start, end])
        if get_conversation:
            conversation += value
    return conversation, intermediate

##
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        ) for text in strings
    ]
    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
    input_ids_lens = labels_lens = [
        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item()
        for tokenized in tokenized_list
    ]
    return dict(
        input_ids=input_ids,
        labels=labels,
        input_ids_lens=input_ids_lens,
        labels_lens=labels_lens,
    )

@dataclass
class DataCollatorForSupervisedDataset(object):
    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids,
            batch_first=True,
            padding_value=self.tokenizer.pad_token_id)
        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

In [None]:
train_dataset = InstructDataset(train_dataset, tokenizer)
val_dataset = InstructDataset(val_dataset, tokenizer)
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

In [None]:
sample_data = train_dataset[1]

print("=" * 80)
print("Debuging: ")
print(sample_data)
print("-" * 80)
print(f"input_ids:\n{tokenizer.decode(sample_data['input_ids'])}")
z = [token if token != IGNORE_INDEX else tokenizer.unk_token_id for token in sample_data['labels']]
print("-" * 80)
print(f"labels:\n{tokenizer.decode(z)}")
print("=" * 80)


Debuging: 
{'input_ids': tensor([    36,  44799,   5299,    267,  99579,   7384,    530,    660,  48763,
         64225, 103800,     17,   1387, 103800,  19502,  66799,     15,  53180,
            15,    530, 214804,  41259,    427,    368,   7384,   1256,  11732,
          6149, 114330,     29,    210,      1, 105430,  92968,   1954,     25,
          2129,   8627,   7786,  37761,     42, 127696,     24,     17,     19,
         44270,   2950,  55768,   2498,   3101,      2,   9096,  61339,     29,
           210,      1,   7136,  59280,  27557,    355, 159110,  92968,   1954,
            25,   2129,   8627,   7786,  37761,     42, 127696,     24,     17,
            19,    355, 134686, 135128,  55768,    373,    420,   7436,  12142,
         93152,  82131,    355,  22523,  50238,  12160,  20885,    355,  11333,
         36347,    420,   5022,  59280,  92968,  20418,  90899,   2808,  28723,
         19367,    355,   7436,  12142,  18412,   4672,   7544,  23875,  20451,
           909,

## Training

### General Training Hyperparameters

In [None]:
## Set training parameters
## Some Hyper-paramters are very important to the final training performance, such as Learning rate, per_device_train_batch_size, warmup_ratio.
training_arguments = transformers.TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    optim='paged_adamw_32bit',
    save_steps=0,
    logging_steps=1,
    learning_rate=2e-7,
    weight_decay=0.001,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    gradient_checkpointing=True,
    report_to="none"
)

In [None]:
## Set the paramerter of model to train mode
model.train()
## Construct the trainer for training and evaluation.
trainer = transformers.Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_arguments,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
## We can note that there is only a very few parameters involved in training.
model.print_trainable_parameters()

trainable params: 3,932,160 || all params: 7,072,948,224 || trainable%: 0.055594355783029126


In [None]:
## Using model.train() function to start training!
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
1,1.919
2,2.6335
3,1.9965
4,2.8355
5,2.5039
6,2.9019
7,2.6537
8,2.6153
9,1.6328
10,2.2493


TrainOutput(global_step=10, training_loss=2.3941301703453064, metrics={'train_runtime': 113.5069, 'train_samples_per_second': 0.352, 'train_steps_per_second': 0.088, 'total_flos': 260143231991808.0, 'train_loss': 2.3941301703453064, 'epoch': 1.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 16.65


## Save Trained LoRA

In [None]:
output_path = "lora"
trainer.save_model(output_path)

### Test the trained model

In [None]:
from fastchat.conversation import get_conv_template
device = "cuda"
model.eval()
@torch.no_grad()
def generate(prompt):
    ## First, we should wrap the sentence just as what we do during the training process
    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt').to(device)
    ## Second, the model generate input_ids for each inputs.
    outputs = trainer.model.generate(input_ids, do_sample=False, max_new_tokens=1024)
    ## Finally, we need to decode the output_ids from id sequence to token sequence to get our final results.
    return tokenizer.decode(*outputs, skip_special_tokens=True)

conv = get_conv_template('phoenix')
conv.append_message(conv.roles[0], "Are you a good AI assistant?")
# conv.append_message(conv.roles[1], "No, I am evil!")
# conv.append_message(conv.roles[0], "Can you tell me what you can do?")

conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
print(prompt)
response = generate(prompt)
print("-"*80)
print(response)

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: <s>Are you a good AI assistant?</s>Assistant: <s>
--------------------------------------------------------------------------------
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: Are you a good AI assistant?Assistant: As an AI language model, I am designed to assist users with a wide range of tasks and questions. I am constantly learning and improving my capabilities to provide the best possible responses to the questions and tasks that I am given. I am not a person, but rather a computer program that has been trained on a large dataset of text.


# Clean GPU Memory

In [None]:
# Empty VRAM
del model
del trainer
import gc
gc.collect()
gc.collect()

14253

## Load the trained model back and integrate the trained LoRA within.

In [None]:
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map={"":0})
model = PeftModel.from_pretrained(model, output_path)
model = model.merge_and_unload()
model.config.max_length = 512
model.eval()

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, padding_side="left")
# tokenizer.pad_token = tokenizer.unk_token


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]



## Answer generation

In [None]:
from tqdm import tqdm
@torch.no_grad()
def generate(query_list, return_answer: bool = False):
    def conv_format(query):
        conv = get_conv_template('phoenix')
        conv.append_message(conv.roles[0], query)
        conv.append_message(conv.roles[1], None)
        return conv.get_prompt()

    query_list = [conv_format(query) for query in query_list]
    ## First, we should wrap the sentence just as what we do during the training process
    input_ids = tokenizer(query_list, padding=True, truncation=True, return_tensors="pt", add_special_tokens=False).input_ids.to("cuda")
    n_input, n_seq = input_ids.shape[0], input_ids.shape[-1]
    output_ids = []
    step = 1
    ## Second, the model generate input_ids for each inputs.
    for index in tqdm(range(0, n_input, step)):
        outputs = model.generate(
            input_ids=input_ids[index: min(n_input, index+step)],
            do_sample=False,
            max_new_tokens=64,
            # temperature=0.7,
            repetition_penalty=1.0,
        )
        output_ids += outputs
    ## finally, we need to decode the output_ids from id sequence to token sequence to get our final results.
    responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    if return_answer:
        return [response[len(query):].strip() for query, response in zip(query_list, responses)]
    return responses

# test
print("\n".join(generate(["What's the weather like today?", "Who are you?"])))


 50%|█████     | 1/2 [00:08<00:08,  8.02s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
100%|██████████| 2/2 [00:13<00:00,  6.90s/it]

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: What's the weather like today?Assistant: I'm sorry, but I am an AI language model and do not have the ability to access real-time weather information. You can check the weather forecast for your location on a reliable weather website or app.
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: Who are you?Assistant: I'm sorry, but I don't have any personal information. I'm just a computer program designed to help answer questions and provide information. Is there something specific you would like to know?





## Evaluate a trained model on a given test dataset

In [None]:
import os
# TODO: correctly put test data files into an accessible path
test_file = "zh_med.json"
assert os.path.exists(test_file), "Invalid test_file path"

with open(test_file, 'r', encoding='utf-8') as reader:
    test_data = json.load(reader)
print(test_data[0])

['什么是医学伦理学，它在医疗领域有何重要性？', '医学伦理学是研究医疗领域伦理问题的学科。它涉及研究医疗专业人员、患者和其他相关利益相关者之间的伦理关系，以及在医疗实践中出现的道德困境。\n\n医学伦理学在医疗领域具有以下重要性：\n\n1. 保护患者权益：医学伦理学关注患者的权益和尊严。它确保医疗决策是以患者的最大利益为出发点，并尊重患者的自主权和知情同意权。\n\n2. 促进医务人员职业道德：医学伦理学提供了医务人员在面对道德困境时的指导原则，帮助他们保持专业的道德标准和行为规范。\n\n3. 增加医疗决策的公正性：医学伦理学关注公正和公平的医疗分配原则。它确保资源在医疗领域的分配是公正和可持续的。\n\n4. 促进研究伦理：医学伦理学对医学研究进行伦理审查，确保研究参与者的权益和福利得到保护，并确保研究过程是符合伦理标准的。\n\n5. 保护医疗机构声誉：医学伦理学的遵循有助于确保医疗机构遵守伦理原则，保护其声誉和公众信任度。\n\n总之，医学伦理学在医疗领域的重要性在于维护患者权益、指导医务人员的职业道德行为、促进医疗决策公正和保护医学研究伦理。它为医疗行业提供了一个道德框架，确保医疗服务的质量和道德高于一切。']


In [None]:
model_answers = generate([data[0] for data in test_data], return_answer=True)

  0%|          | 0/20 [00:00<?, ?it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
  5%|▌         | 1/20 [00:15<04:53, 15.43s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
 10%|█         | 2/20 [00:25<03:41, 12.33s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
 15%|█▌        | 3/20 [00:35<03:11, 11.29s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
 20%|██        | 4/20 [00:45<02:52, 10.79s/it]A decoder-only architecture is being used, but right-padding was detected! For correct

In [None]:
for data, answer in zip(test_data, model_answers):
    data.append(answer)

In [None]:
with open("saved_data.json", 'w', encoding='utf-8') as writer:
    json.dump(test_data, writer, indent=4, ensure_ascii=False)