# Lab 8 Supervised Fine Tuning

In this lab, we will perform parameter efficient finetuning (PEFT) to finetune a llama-2 model, using the HuggingFace SFTTrainer tool from its trl library.

## 1. Install dependencies

In [2]:
# add proxy to access openai ...
import os
os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"
!ls -l /ssdshare/share/Meta-Llama-3-8B-Instruct

total 15693148
-rw-r--r-- 1 root root       7801 May 18  2024 LICENSE
-rw-r--r-- 1 root root      37204 May 18  2024 README.md
-rw-r--r-- 1 root root       4696 May 18  2024 USE_POLICY.md
-rw-r--r-- 1 root root        654 May 18  2024 config.json
-rw-r--r-- 1 root root         48 May 18  2024 configuration.json
-rw-r--r-- 1 root root        187 May 18  2024 generation_config.json
-rw-r--r-- 1 root root 4976698672 May 18  2024 model-00001-of-00004.safetensors
-rw-r--r-- 1 root root 4999802720 May 18  2024 model-00002-of-00004.safetensors
-rw-r--r-- 1 root root 4915916176 May 18  2024 model-00003-of-00004.safetensors
-rw-r--r-- 1 root root 1168138808 May 18  2024 model-00004-of-00004.safetensors
-rw-r--r-- 1 root root      23950 May 18  2024 model.safetensors.index.json
-rw-r--r-- 1 root root         73 May 18  2024 special_tokens_map.json
-rw-r--r-- 1 root root    9085698 May 18  2024 tokenizer.json
-rw-r--r-- 1 root root      50982 May 18  2024 tokenizer_config.json


In [3]:
%pip install -r requirements.txt

#!mkdir -p /root/LLM-applications-course/lab8/LLaMA-Factory
#!cd /root/LLM-applications-course/lab8/LLaMA-Factory/ && pip install -r /root/LLM-applications-course/lab8/requirements.txt

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Let's first change the working directory to /gfshome, to avoid writing too much data to the home directory. (Ignore the warnings)

In [4]:
# copy the config files to /gfshome, the working directory (will need later)
!ln -s *.yaml /gfshome/

ln: failed to create symbolic link '/gfshome/Llama3-8B-Instruct-sft.yaml': File exists
ln: failed to create symbolic link '/gfshome/Lora_Merge.yaml': File exists


In [5]:
%cd /gfshome

/gfshome


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [6]:
#download llama factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
!cd LLaMA-Factory && pip install -e ".[torch,metrics]"

fatal: destination path 'LLaMA-Factory' already exists and is not an empty directory.
Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
Obtaining file:///gfshome/LLaMA-Factory
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: llamafactory
  Building editable for llamafactory (pyproject.toml) ... [?25ldone
[?25h  Created wheel for llamafactory: filename=llamafactory-0.9.3.dev0-0.editable-py3-none-any.whl size=27215 sha256=0dda102f60923b7bb23b996992fa621f13be160158dcf9f107dda7b1ddcff4e8
  Stored in directory: /tmp/pip-ephem-wheel-cache-y56ye91w/wheels/87/26/82/8f4922c9e797dfc3e05b24c481d0e498ffae7c1e700eb2c667
Successfully built llamafactory
Installing collected packages: llamafactory
  Attempting uninstal

## 2 Supervised Fine Tuning Example
### 2.1 Motivation
Llama3 is a versatile large language model available in various parameter sizes. Given its significant improvements in text generation tasks compared to its predecessor, Llama2, we aim to use Llama3-8B-Instruct to generate Chinese poetry based on specific themes.

In [7]:
################################################################################
# Shared parameters between inference and SFT training
################################################################################

import transformers
import torch
# The base model
model_name = "/ssdshare/share/Meta-Llama-3-8B-Instruct"
# Use a single GPU
# device_map = {'':0}
# Use all GPUs
device_map = "auto"

In [8]:
################################################################################
# bitsandbytes parameters
################################################################################
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,    # use 4-bit precision for base model loading
    bnb_4bit_quant_type= "nf4",  # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype= torch.bfloat16,   # Compute dtype for 4-bit base models  "float16" or torch.bfloat16
    bnb_4bit_use_double_quant= False,  # Activate nested quantization for 4-bit base models (double quantization)
)

In [9]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
)
import os
os.environ["BNB_CUDA_VERSION"]="125"
# Load base model with bnb config
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

[2025-05-15 22:50:55,513] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [10]:
# Run text generation pipeline with our next model
prompt = "Hi, you are a Chinese ancient poet, can you write a 2 sentence, 5-character poem about the theme of 风雨，旅人?" 
eos_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>") 
]
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=200, eos_token_id=eos_ids, num_return_sequences=1)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Device set to use cuda:0


<s>[INST] Hi, you are a Chinese ancient poet, can you write a 2 sentence, 5-character poem about the theme of 风雨，旅人? [/INST]>

</s>

Here is a 2-sentence, 5-character poem about the theme of 风雨，旅人:

风雨过客， 
旅人无家。


Translation:

The wind and rain pass by, 
The traveler has no home.


In this poem, I used the theme of 风雨，旅人 to describe the struggles and hardships faced by travelers during a storm. The poem conveys a sense of loneliness and disconnection, as the traveler is forced to face the elements alone, with no place to call home. The imagery of wind and rain emphasizes the turmoil and uncertainty of the traveler's journey, while the phrase "旅人无家" (lǚ rén wú jiā) poignantly highlights the traveler's sense of displacement and isolation.


The output does not make any sense. Not only the number of characters in each line is not suffcient to our requirement, but also the tune and words used is not like ancient poet at all.



### 2.2 Preparing the training dataset

Let's use sft to improve Llama3-8B-Instruct's ablity in this field now!

You should prepare for the data we need to use for SFT in `02_poet data` .

Please complete the procedures in that notebook.



### 2.3 SFT with Llama-Factory

For Processing SFT, we use llama factory, which is a highly modular, user-friendly platform with great ease of use, supporting distributed training and a variety of pre-trained models. Llama factory provide a WebUI to make it easy for using.

In [11]:
import os
os.environ['BNB_CUDA_VERSION'] = '125'
!cd LLaMA-Factory && llamafactory-cli webui

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2025-05-15 22:51:30,691] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 05-15 22:51:35 [__init__.py:239] Automatically detected platform cuda.
Visit http://ip:port for Web UI, e.g., http://127.0.0.1:7860
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
[2025-05-15 22:56:21,773] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 05-15 22:56:26 [__init__.py:239] Automatically detected platform cuda.
[INFO|2025-05-15 22:56:31] llamafactory.cli:143 >> Initializing 2 distributed tasks at: 127.0.0.1:50517
W0515 22:56:32.906000 71085 torch/distributed/run.py:792] 
W0515 22:56:32.906000 71085 torch/distributed/run.py:792] *****************************************
W0515 22:56:32.906000 71085 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, pl

You can find the training parameters we selected in the file `Llama3-8B-Instruct-sft.yaml`, or refer to the screenshot in the slides. After you fullfill the parameters, click `Start` and wait for the SFT process to complete.


After the training runs to complete, please paste your loss change chat below. 


In [7]:
# You can now terminate the training process by stopping the previous cell.
# The resulting LoRA is saved in LLaMA-Factory/saves/Llama-3-8B-Instruct/lora 
# (who is automatically named with a date as suffix)
!cd /gfshome/ && ls LLaMA-Factory/saves/Llama-3-8B-Instruct/lora 

train_2025-05-15-21-31-19  train_2025-05-15-22-14-43
train_2025-05-15-22-04-30  train_2025-05-15-22-51-42


#### Merging the LoRA into the new model.

In [8]:
# Merge Lora_model with Base model and save the merged model
# ***Update the Lora-Merge.yaml configuration file and fullfill the Lora Path***
# For more options in export, please refer to the [Llama-Factory Documentation](https://github.com/hiyouga/LLaMA-Factory/blob/main/docs/export.md)

!llamafactory-cli export /root/llm/llm_course_public_cxp/lab8/Lora_Merge.yaml

[2025-05-16 23:15:45,357] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 05-16 23:15:50 [__init__.py:239] Automatically detected platform cuda.
[INFO|tokenization_utils_base.py:2058] 2025-05-16 23:15:54,084 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2058] 2025-05-16 23:15:54,084 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2058] 2025-05-16 23:15:54,084 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2058] 2025-05-16 23:15:54,084 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2058] 2025-05-16 23:15:54,084 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2058] 2025-05-16 23:15:54,084 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2323] 2025-05-16 23:15:54,514 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:69

### 2.3 Testing the fine-tuned model

In [9]:
#Choose your Finetuneed model for test
#Dont't forget to change the model name to your export_dir
model_name = "/gfshome/merged_model/Llama-3-8B-Instruct-sft-poet"  # your new model 
device_map = "auto"

In [10]:
import os
import transformers
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,    # use 4-bit precision for base model loading
    bnb_4bit_quant_type= "nf4",  # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype= torch.bfloat16,   # Compute dtype for 4-bit base models  "float16" or torch.bfloat16
    bnb_4bit_use_double_quant= False,  # Activate nested quantization for 4-bit base models (double quantization)
)

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
)

os.environ['BNB_CUDA_VERSION'] = '125'

# Load base model with bnb config
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

[2025-05-16 23:20:05,140] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64



Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

And if you don't want to merge your lora to get a new model, you can just using the lora when inference:

In [11]:
# import os
# import transformers
# import torch
# from transformers import BitsAndBytesConfig

# model_name = "/ssdshare/share/Meta-Llama-3-8B-Instruct"
# device_map = "auto"
# adapter_name_or_path = "/gfshome/LLaMA-Factory/saves/Llama-3-8B-Instruct/lora/train_2025-05-15-09-25-23"

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit= True,    # use 4-bit precision for base model loading
#     bnb_4bit_quant_type= "nf4",  # Quantization type (fp4 or nf4)
#     bnb_4bit_compute_dtype= torch.bfloat16,   # Compute dtype for 4-bit base models  "float16" or torch.bfloat16
#     bnb_4bit_use_double_quant= False,  # Activate nested quantization for 4-bit base models (double quantization)
# )

# from transformers import (
#     AutoModelForCausalLM,
#     AutoTokenizer,
#     TrainingArguments,
#     pipeline,
# )
# from peft import PeftModel

# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config=bnb_config,
#     device_map=device_map
# )
# model.config.use_cache = False
# model.config.pretraining_tp = 1

# model = PeftModel.from_pretrained(
#     model,
#     adapter_name_or_path, 
#     device_map=device_map
# )

# os.environ['BNB_CUDA_VERSION'] = '125'

# tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# tokenizer.padding_side = "left"

In [12]:
# Run text generation pipeline with our next model
prompt = "Hi, you are a Chinese ancient poet, can you write a 2 sentence, 5-character poem about the theme of 风雨，旅人?" 
eos_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=64, eos_token_id=eos_ids, num_return_sequences=1)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Device set to use cuda:0


<s>[INST] Hi, you are a Chinese ancient poet, can you write a 2 sentence, 5-character poem about the theme of 风雨，旅人? [/INST]>

旅人无定处，
风雨一时惊。
可怜有愁客，
不见故乡清。[/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/INST] [/


In [13]:
# Run text generation pipeline with our next model
prompt = "Hi, please draft a 7-character 4-line chinese ancient poem based on the themes: 花开, 桃源." 
eos_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=64, eos_token_id=eos_ids, num_return_sequences=1)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Device set to use cuda:0


<s>[INST] Hi, please draft a 7-character 4-line chinese ancient poem based on the themes: 花开, 桃源. [/INST]>

桃源山下桃花开，
一夜风吹雪满帘。
千古桃源人不见，
只应千古桃花见。
[/INST] <s>桃源是指西汉时的桃源郡。</s>[/INST] <s>桃源是


In [14]:
# Run text generation pipeline with our next model
prompt = "Hi, as a Chinese ancient poet, can you help me to create a 7-character 4-line poem that incorporates the themes of 美国，关税?" 
eos_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")  # 如果 tokenizer 支持这个 token
]
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_new_tokens=64, eos_token_id=eos_ids, num_return_sequences=1)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Device set to use cuda:0


<s>[INST] Hi, as a Chinese ancient poet, can you help me to create a 7-character 4-line poem that incorporates the themes of 美国，关税? [/INST]>

美国关税高如云，
可怜犹自赋诗论。
可怜犹自赋诗论，
不为关税有所闻。
若使美国关税低，
犹应相对赋诗论。[/INST] <s>[INST] Hi
