<a href="https://www.kaggle.com/code/aisuko/supervise-fine-tuning-llm?scriptVersionId=165069244" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Supervised fine-tuning(SFT) is a crucial step in RLHF. Let's using it to fine-tune a casual language model.

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install trl==0.7.7
!pip install tqdm==4.66.1
!pip install flash-attn==2.4.2

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Supervise-fine-tune-models"
os.environ["WANDB_NOTES"] = "Supervise fine tune casual language models"
os.environ["WANDB_NAME"] = "sft-facebook-opt350m-with-openassistant-guanaco"
os.environ["MODEL_NAME"] = "facebook/opt-350m"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `facebook/opt-350m` from `transformers`...
config.json: 100%|█████████████████████████████| 644/644 [00:00<00:00, 2.83MB/s]
┌────────────────────────────────────────────────────┐
│    Memory Usage for loading `facebook/opt-350m`    │
├───────┬─────────────┬──────────┬───────────────────┤
│ dtype │Largest Layer│Total Size│Training using Adam│
├───────┼─────────────┼──────────┼───────────────────┤
│float32│   98.19 MB  │ 1.23 GB  │      4.94 GB      │
│float16│   49.09 MB  │631.71 MB │      2.47 GB      │
│  int8 │   24.55 MB  │315.85 MB │      1.23 GB      │
│  int4 │   12.27 MB  │157.93 MB │     631.71 MB     │
└───────┴─────────────┴──────────┴───────────────────┘


# Loading the Datasets

Here we are going to use the [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco?row=6) a subset of the Open Assistant dataset. It contains the highest-rated paths in the conversation treem with a total of 9846 samples.

In [4]:
from datasets import load_dataset

dataset_name="timdettmers/openassistant-guanaco"
dataset=load_dataset(dataset_name, split="train[:500]")
dataset=dataset.train_test_split(test_size=0.2)
dataset

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 400
    })
    test: Dataset({
        features: ['text'],
        num_rows: 100
    })
})

# Loading tokenizer

In [5]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"))

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

# Quantization model

NF4(normalized float 4) or pure FP4 quantization are all works well. However, based on theoretical considerations empirical results from the paper,we use NF4 quantization for better performance.

* **bnb_4bit_use_double_quant**: It uses a second quantization after the first one to save an additional 0.4 bits oer oarameters.

While 4-bit bitsandbytes stores weights in 4-bitsm the computation still happends in 16 or 32-bit and here any combination can be chosen (float16, bfloat16, float32 etc). So, here we load the model in 4bit using NF4 quantization below with double quantization with compute dtype bfloat16 faster training:

In [6]:
from transformers import BitsAndBytesConfig
import torch

load_in_4bit=True

if load_in_4bit:
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=load_in_4bit,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16
    )
    # copy the model to each device
    device_map="auto"
    torch_dtype=torch.bfloat16 # if the GPU cannot support it, replace it to fp16
else:
    device_map=None
    quantization_config=None
    torch_dtype=None

# Loading model

Here are quantize the model with 4bit and load it with Flash-attention2. It is a faster and more efficient implementation of the standard attention mechanism that can significantly speedup inferen by:

1. additionally parallelizing the attention computation over sequence length
2. partitioning the network between GPU threads to reduce communication and shared memory reads/writes between them.

And there [are the supported architectures list](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2)


**Note: 1.FlashAttention2 only be used when the model's dtype is fp16 or bf16. Make sure to cast the model to the appropriate dtype and load them on a supported device before using FlashAttention2. 2. Currently we do not have a hardware supports**

In [7]:
from transformers import AutoModelForCausalLM


def print_trainable_parameters(model):
    trainable_params=0
    all_params=0
    for _, param in model.named_parameters():
        all_params+=param.numel()
        if param.requires_grad:
            trainable_params+=param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_params} || trainable%: {100 * trainable_params/all_params:.2f}")


model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    quantization_config=quantization_config,
    device_map=device_map,
    trust_remote_code=True,
    torch_dtype=torch_dtype,
    # RuntimeError: FlashAttention only supports Ampere GPUs or newer.
#     attn_implementation="flash_attention_2"
)
          
print_trainable_parameters(model)

pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

trainable params: 27936768 || all params: 179677184 || trainable%: 15.55


# Freeze the Original Weights

In [8]:
from peft import prepare_model_for_kbit_training

#gradient checkpointing to save memory
model.gradient_checkpointing_enable()

#freeze base model layers and casr layernorm in fp32
prepared_model=prepare_model_for_kbit_training(
    model, use_gradient_checkpointing=True
)

prepared_model.get_memory_footprint()

264151040

In [9]:
from peft import LoraConfig, get_peft_model, TaskType

use_peft=True

peft_config=LoraConfig(
    r=16,
    lora_alpha=32,
    bias="none",
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
    target_modules=None
)

peft_model=get_peft_model(model,peft_config)
peft_model.print_trainable_parameters()

trainable params: 1,572,864 || all params: 332,769,280 || trainable%: 0.472659014678278


# Training Model

In [10]:
from transformers import TrainingArguments, Trainer
from trl import SFTTrainer

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    per_device_train_batch_size=8,
    gradient_checkpointing=True,
    gradient_accumulation_steps=4,
    learning_rate=1.41e-5,
    num_train_epochs=2,
    optim="paged_adamw_8bit",
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    save_steps=100,
    logging_steps=50,
    save_total_limit=1,
    push_to_hub=False,
)

sft_trainer=SFTTrainer(
    model=peft_model,
    args=training_args,
    max_seq_length=512,
    train_dataset=dataset['train'],
    dataset_text_field="text",
    tokenizer=tokenizer
)

sft_trainer.train()

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.2
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240302_003218-j0d8etx9[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33msft-facebook-opt350m-with-openassistant-guanaco[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Supervise-fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Supervise-fine-tune-models/runs/j0d8etx9[0m
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a metho

Step,Training Loss


TrainOutput(global_step=24, training_loss=2.6247194608052573, metrics={'train_runtime': 295.6855, 'train_samples_per_second': 2.706, 'train_steps_per_second': 0.081, 'total_flos': 713933299777536.0, 'train_loss': 2.6247194608052573, 'epoch': 1.92})

In [11]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': os.getenv('MODEL_NAME'),
    'tasks': 'Text Generation',
#     'dataset_tags':'',
    'dataset':'timdettmers/openassistant-guanaco'
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
sft_trainer.push_to_hub(**kwargs)

training_args.bin:   0%|          | 0.00/4.35k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/sft-facebook-opt350m-with-openassistant-guanaco/commit/fc1ccb935f5ec63dbd36e83bfcd869010c795c2d', commit_message='End of training', commit_description='', oid='fc1ccb935f5ec63dbd36e83bfcd869010c795c2d', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [12]:
del sft_trainer, tokenizer
torch.cuda.empty_cache()

In [13]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM

peft_model_name="aisuko/"+os.getenv("WANDB_NAME")

peft_config=PeftConfig.from_pretrained(peft_model_name)
base_model=AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path)

peft_model=PeftModel.from_pretrained(base_model, peft_model_name)

adapter_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

In [14]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)

In [15]:
prompt="The weather in Melbourne is"
inputs=tokenizer(prompt, return_tensors="pt")

In [16]:
outputs=peft_model.generate(**inputs)



In [17]:
# If we want more accuracy results, we need to increase the numbers of training loops.
tokenizer.batch_decode(outputs, skip_special_token=True)

["</s>The weather in Melbourne is so bad that I'm not sure if I should be excited or sad"]