# HuggingFace Supervised Fine-tuning Trainer (SFT)
## LoRA Fine-tuning

https://huggingface.co/docs/trl/en/sft_trainer

## TinyLlamma
https://arxiv.org/pdf/2401.02385
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.1
https://huggingface.co/facebook/opt-350m
https://huggingface.co/facebook/MobileLLM-125M

## Example scripts
https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py

## Inspired by
https://colab.research.google.com/github/huggingface/smol-course/blob/main/1_instruction_tuning/notebooks/sft_finetuning_example.ipynb


In [1]:
# ! pip install wandb

In [12]:
pip install peft

Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.14.0-py3-none-any.whl (374 kB)
Installing collected packages: peft
Successfully installed peft-0.14.0
Note: you may need to restart the kernel to use updated packages.


In [13]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

In [14]:
import os

# Select the base model
model_name = "HuggingFaceTB/SmolLM2-135M"

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v0.1"
model_name = "facebook/opt-350m"

# Requires code to be executed for loading the model
# model_name = "facebook/MobileLLM-125M"

os.environ["WANDB_PROJECT"] = "fb-opt-350-ft"
os.environ["WANDB_DIR"] = "./temp"
os.environ["WANDB_JOB_NAME"] = "some-job-name"

## 1. Prepare the dataset

**Dataset format support**

https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support

In [15]:
# Load a sample dataset
from datasets import load_dataset

dataset_name = "HuggingFaceTB/smoltalk"
dataset_split = "everyday-conversations"

ds = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")



## 2. Load the model to appropriate available device (CPU/GPU)

In [16]:
# Check the machine in use and set the device to use for training
# cuda = GPU, mps = Metal Performance Shaders on macOS or Apple GPU, cpu otherwise
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Print device info
print("Model loaded to: ", device)



# Load the pretrained model & move it to the specified device
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Setup for the model specific chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

Model loaded to:  cpu


## 3. Setup the training configuration

**SFTConfig**

https://huggingface.co/docs/trl/v0.12.2/en/sft_trainer#trl.SFTConfig

This object specifies hyperparameters and settings for the fine-tuning process. It’s tailored to supervised fine-tuning tasks, often used for adapting language models to specific tasks or datasets.

### 3.1 PEFT configuration

**LoraConfig**

https://huggingface.co/docs/peft/en/package_reference/lora#peft.LoraConfig

**Task type**

The `task_type` parameter in LoRA (Low-Rank Adaptation) and the `peft` library specifies the type of task for which the model is being fine-tuned. The following are the possible values for `task_type` based on the common use cases supported by the `peft` library:

### **Possible Values for `task_type`**

1. **`"CAUSAL_LM"`**
   - **Description**: Fine-tuning for **causal language modeling** tasks, where the model predicts the next token in a sequence based on previous tokens. 
   - **Examples**: GPT-like autoregressive models.

2. **`"SEQ2SEQ_LM"`**
   - **Description**: Fine-tuning for **sequence-to-sequence language modeling** tasks, where the input and output are sequences. Used in tasks like translation or summarization.
   - **Examples**: T5, BART.

3. **`"TOKEN_CLASSIFICATION"`**
   - **Description**: Fine-tuning for tasks where the goal is to classify tokens in the input sequence.
   - **Examples**: Named Entity Recognition (NER), Part-of-Speech (POS) tagging.

4. **`"SEQ_CLASSIFICATION"`**
   - **Description**: Fine-tuning for **sequence classification** tasks, where the entire sequence is classified into categories.
   - **Examples**: Sentiment analysis, text classification.

5. **`"MULTIPLE_CHOICE"`**
   - **Description**: Fine-tuning for tasks involving selecting the correct choice from multiple options.
   - **Examples**: Tasks like the SWAG dataset or common-sense reasoning.

6. **`"QUESTION_ANSWERING"`**
   - **Description**: Fine-tuning for tasks where the model predicts an answer span from the input context and question.
   - **Examples**: SQuAD dataset.

---

### **How `task_type` Influences Model Behavior**
The `task_type` parameter guides the integration of LoRA layers and optimizations depending on the task. For instance:
- In `"CAUSAL_LM"`, LoRA layers are applied in a manner that respects the autoregressive nature of the task.
- In `"SEQ2SEQ_LM"`, both encoder and decoder modules may be adapted.

---

### **Custom/Library-Specific Extensions**
Some implementations of LoRA or similar libraries may add other task types or extend the list. To ensure compatibility, always refer to the latest documentation of the `peft` library or framework you're using.

In [17]:
from peft import LoraConfig

# Achieve higher compression with lower values
rank_dimension = 6

# Scaling factor
lora_alpha = 8

# Prevents overfitting - 5%
lora_dropout = 0.05

peft_config = LoraConfig(
    # Rank dimension - between (4,32)
    r = rank_dimension,

    # Scaling factor - a good starting point is (2 x r)
    lora_alpha=lora_alpha,

    # Dropout probability for LoRA layers
    lora_dropout=lora_dropout,

    # Can be ‘none’, ‘all’ or ‘lora_only’
    bias = "none",

    # The names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. 
    target_modules = "all-linear",

    # Task type for model architecture
    task_type = "CAUSAL_LM"
)


### 3.2 Trainer configuration


This configuration defines a set of hyperparameters and settings for fine-tuning a model using **Supervised Fine-Tuning (SFT)** with recommendations inspired by the **QLoRA paper**. QLoRA is designed for efficient fine-tuning of large language models, leveraging optimizations such as quantization and memory-saving techniques.

Here’s a detailed explanation of each parameter:

---

### **1. Output Settings**
- **`output_dir=finetune_name`**: 
  - Specifies the directory where model checkpoints and outputs will be saved. The variable `finetune_name` should hold the name of the fine-tuning task or model.

---

### **2. Training Duration**
- **`num_train_epochs=1`**:
  - Defines the number of full passes through the dataset during training. 
  - A single epoch is recommended in QLoRA to avoid overfitting when using large datasets and pre-trained models.

---

### **3. Batch Size Settings**
- **`per_device_train_batch_size=2`**:
  - The number of training samples processed per GPU during one forward and backward pass.
  - A small batch size is chosen to save memory, especially for large models.

- **`gradient_accumulation_steps=2`**:
  - Accumulates gradients over multiple steps before performing a weight update, effectively creating a larger batch size (`effective batch size = batch size × gradient accumulation steps`).
  - Helps achieve the benefits of larger batch training without requiring as much memory.

---

### **4. Memory Optimization**
- **`gradient_checkpointing=True`**:
  - Enables recomputation of intermediate activations during the backward pass instead of storing them, reducing memory usage at the cost of additional computation.
  - Useful for fine-tuning large models with limited GPU memory.

---

### **5. Optimizer Settings**
- **`optim="adamw_torch_fused"`**:
  - Uses a fused implementation of AdamW optimizer for better efficiency and performance on modern hardware.
  - AdamW is a variant of Adam that includes weight decay, making it a popular choice for transformer models.

- **`learning_rate=2e-4`**:
  - Learning rate specifies the step size for updating weights during optimization.
  - The value is chosen based on QLoRA recommendations for fine-tuning large models.

- **`max_grad_norm=0.3`**:
  - Implements gradient clipping to prevent excessively large gradients, which can destabilize training.
  - A low value of 0.3 is recommended for fine-tuning pre-trained models.

---

### **6. Learning Rate Schedule**
- **`warmup_ratio=0.03`**:
  - Specifies the proportion of the total training steps to gradually increase the learning rate from 0 to the target value (warmup phase).
  - Helps avoid large updates at the start of training.

- **`lr_scheduler_type="constant"`**:
  - Maintains a constant learning rate after the warmup phase. Simpler than decay schedules and works well for short fine-tuning tasks.

---

### **7. Logging and Saving**
- **`logging_steps=10`**:
  - Logs training metrics (e.g., loss, accuracy) every 10 steps for monitoring progress.

- **`save_strategy="epoch"`**:
  - Saves model checkpoints at the end of each epoch, ensuring periodic backups without overloading storage.

---

### **8. Precision Settings**
- **`bf16=True`**:
  - Uses **bfloat16 (Brain Floating Point)** precision instead of full 32-bit precision to reduce memory usage and speed up computations.
  - Bfloat16 maintains a wide range of numerical values, making it suitable for training large models.

---

### **9. Integration Settings**
- **`push_to_hub=False`**:
  - Disables automatic pushing of the model to the Hugging Face Hub.

- **`report_to=None`**:
  - Prevents reporting training progress to external tools like TensorBoard or Weights & Biases.

---

### **Summary**
This configuration is tailored for efficient fine-tuning of large language models using QLoRA techniques:
- **Memory optimization**: Gradient checkpointing and bfloat16 precision.
- **Learning efficiency**: Warmup and fused AdamW optimizer.
- **Minimal overfitting**: Single epoch, gradient clipping, and small learning rate.
- **Practicality**: Saves checkpoints per epoch and limits external reporting/logging. 

It is a practical setup for training large models while minimizing computational and memory overhead.

In [22]:
from datetime import datetime

# Get the current timestamp
current_time = datetime.now()

# Create a readable timestamp
formatted_time = current_time.strftime("%b-%d-%Y-%H-%M-%S")

# Create a name for the run
wandb_run_name = f"FT_run_{formatted_time}"

# Adjust the model
fine_tuned_model_name = f"fine-tuned-chat-model-lora"

# Model assets output folder
model_output_folder = "c:/temp/sft_output"

# SFTrainer configuration
sft_config = SFTConfig(
    
    ########################
    #### Output setting ####
    ########################
    # Output directory for model assets
    # Specifies the directory where model checkpoints and outputs will be saved.
    # The variable finetune_name should hold the name of the fine-tuning task or model.
    output_dir = model_output_folder,  

    ###########################
    #### Training duration ####
    ###########################
    
    # Hyperparameter : Number of epochs
    num_train_epochs=1,
    
    # Hyperparameter : Controls maximum number of steps to be executed
    # Maximum number of gradient update steps during training.
    max_steps=100,  


    ####################
    #### Batch size ####
    ####################
    # Set according to your GPU memory capacity
    # Number of training samples per device in each batch. Smaller values help fit large models into memory-constrained GPUs.
    per_device_train_batch_size=2,  

    # Useful for fine-tuning large models with limited GPU memory
    # Accumulates gradients over multiple steps before performing a weight update, 
    # effectively creating a larger batch size (effective batch size = batch size × gradient accumulation steps).
    gradient_accumulation_steps=2,


    #############################
    #### Memory optimization ####
    #############################

    # Enables recomputation of intermediate activations during the backward pass instead of storing them, 
    # reducing memory usage at the cost of additional computation. Useful for fine-tuning large models with limited GPU memory
    gradient_checkpointing=True,
    

    ###########################
    #### Optimizer setting ####
    ###########################
    # AdamW is a variant of Adam that includes weight decay
    optim="adamw_torch_fused",

    # Learning rate specifies the step size for updating weights during optimization.
    # The initial learning rate for the optimizer.
    # Value from QLoRA paper
    learning_rate=2e-4,  

    # Implements gradient clipping to prevent excessively large gradients, which can destabilize training
    # A low value of 0.3 is recommended for fine-tuning pre-trained models.
    max_grad_norm=0.3,

    ###########################################
    #### Learning rate schedule/dynamicity ####
    ###########################################

    # Specifies the proportion of the total training steps to gradually increase the learning rate from 0 to the target value (warmup phase)
    # Helps avoid large updates at the start of training.
    warmup_ratio=0.03,

    # Maintains a constant learning rate after the warmup phase. Simpler than decay schedules and works well for short fine-tuning tasks.
    lr_scheduler_type="constant",


    #######################################
    #### Evaluatio/validation strategy ####
    #######################################

    # Evaluate every N steps
    eval_strategy="steps",

    # Reload the best model at the end of training
    # load_best_model_at_end=True,  


    ##########################
    #### Logging & saving ####
    ##########################

    # Frequency of logging training metrics
    # Logs metrics (e.g., loss) every 10 steps during training.
    logging_steps=10,  

    # Saves model checkpoints at the end of each epoch, ensuring periodic backups without overloading storage.
    save_strategy="epoch",


    ###########################
    #### Precision setting ####
    ###########################

    # Uses bfloat16 (Brain Floating Point) precision instead of full 32-bit precision to reduce memory usage and speed up computations.
    bf16=True,

    ##############################
    #### Integration settings ####
    ##############################

    # Disables automatic pushing of the model to the Hugging Face Hub
    push_to_hub=False,

    # Prevents reporting training progress to external tools like TensorBoard or Weights & Biases
    report_to="wandb",  # None for disabling reporting

    # Set this if you enable wandb
    run_name = wandb_run_name,

    # Set this if you enable HF Hub push
    # Set a unique name for your model - used for HuggingFace hub
    hub_model_id=fine_tuned_model_name,  

)



## 3. Setup the Supervised Fine-tuning trainer with LoRA

**SFTrainer**

https://huggingface.co/docs/trl/v0.12.2/en/sft_trainer#trl.SFTTrainer

**SFTrainer extends the transformers.Trainer class**

https://huggingface.co/docs/transformers/en/main_classes/trainer#api-reference%20][%20transformers.Trainer

In [23]:
# Initialize the SFTTrainer


# determines the maximum number of tokens allowed in an input sequence.
# If a sequence exceeds this length, it is truncated (or padded if shorter).
# This affects both the computational efficiency and memory usage during training and evaluation.
max_seq_length = 1512  # max sequence length for model and packing of the dataset


trainer = SFTTrainer(

    # The language model being fine-tuned.
    model=model,

    # Passes the fine-tuning configuration defined above 
    args=sft_config,

    # Training dataset
    train_dataset=ds["train"],

    # Evaluation dataset
    eval_dataset=ds["test"],

    # Tokenizer used
    tokenizer=tokenizer,

    # Maximum sequence length
    max_seq_length=max_seq_length,

    # Enables input packing, which combines multiple short sequences into a single batch to maximize GPU utilization and training efficiency.
    packing=True,

    # PEFT configuration
    peft_config=peft_config,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


## 4. Train the model

Wandb - configuration
https://docs.wandb.ai/guides/track/environment-variables/

import os
os.environ["WANDB_DISABLED"] = "True"

In [None]:
import os 

# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./temp/{fine_tuned_model_name}")

## 5. Upload to HF hub


In [10]:
import getpass

print("Provide the HUGGINGFACEHUB_API_TOKEN:")
HUGGINGFACEHUB_API_TOKEN=getpass.getpass()

trainer.push_to_hub(token=HUGGINGFACEHUB_API_TOKEN)


Provide the HUGGINGFACEHUB_API_TOKEN:


 ········


model.safetensors:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.56k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/acloudfan/fine-tuned-chat-model/commit/e0a163df1518126d0a5d8833a50ef84f45f541fd', commit_message='End of training', commit_description='', oid='e0a163df1518126d0a5d8833a50ef84f45f541fd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/acloudfan/fine-tuned-chat-model', endpoint='https://huggingface.co', repo_type='model', repo_id='acloudfan/fine-tuned-chat-model'), pr_revision=None, pr_num=None)

## 6. Try out the model

In [16]:
from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="acloudfan/fine-tuned-chat-model") #, device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

I'd choose the future. I'd like to see the future, but I don't want to be stuck in the past. What if I had a time machine and could travel back in time? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person? Would I still be the same person?
