# 🧠 Week 5: Supervised Finetuning (SFT) - I
**Theme:** Teaching LLMs to Follow Instructions  
**Project:** LoRA vs. Full Finetuning on HuggingFace with Deepspeed/TRL


## 📘 1. What is Supervised Finetuning (SFT)?
SFT = Pretrained model + Instruction-following data → Task-specific model

**Supervised Fine-Tuning (SFT)** is the process of further training a pre-trained language model on a labeled dataset to specialize it for specific tasks or domains.

**Key points:**
- Builds on top of a pretrained model like LLaMA, GPT, or Mistral.
- Uses instruction-response pairs (like question-answer).
- Enhances instruction-following ability.
- It's a middle stage between pretraining and alignment (e.g., RLHF).

**SFT Pipeline:**
1. Pretrained Model
2. Supervised Dataset
3. Fine-tuned Instruction Model


**Example:**
| Before SFT                 | After SFT                          |
|---------------------------|------------------------------------|
| Random generic responses  | Follows user instructions clearly |




## 📊 2. How to Get SFT Data

**4 Types of Data Sources:**
1. **Manual Curation**: Human-created prompts and responses.
2. **AI-Generated**: Use GPT models to self-generate instruction data.
3. **Open Datasets**: Alpaca, OASST1, Dolly, HH-RLHF, etc.
4. **Data Augmentation**: Rephrasing, adding context, changing perspective.

**Goal**: Create high-quality, diverse, and instruction-aligned examples.

* here we use the second way to generate our data using openAI


Create an virtual environment

(base) C:\Users\ch939>conda create -n sft_env python = 3.10

Activate the virtual environment

(base) C:\Users\ch939>conda activate sft_env

Now, choose the kernel to be sft_env(python3.10.18)

In the Anaconda Prompt, we can see:

(sft_env) C:\Users\ch939>

Install libraries

In [3]:
! conda install openai -y

3 channel Terms of Service accepted
Channels:
 - conda-forge
 - pytorch
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: C:\Users\ch939\anaconda3\envs\sft_env

  added / updated specs:
    - openai


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    distro-1.9.0               |     pyhd8ed1ab_1          41 KB  conda-forge
    jiter-0.10.0               |  py310hc226416_0         178 KB  conda-forge
    openai-1.99.9              |     pyhd8ed1ab_0         303 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         522 KB

The following NEW packages will be INSTALLED:

  annotated-types    conda-forge/noarch::annotated-types-0.7.0-pyhd8ed1ab_1 
  anyio              conda-forge/noarch::anyio-4.10.0-pyhe01879c_0

Copy the .env file to the working directory and install python-dotenv for the retrieval of the api keys

In [9]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1


Verify that dotenv has been installed

In [3]:
import dotenv
print(dotenv.__file__)

c:\Users\ch939\anaconda3\envs\sft_env\lib\site-packages\dotenv\__init__.py


In [4]:
from openai import OpenAI
import os
import json
from dotenv import load_dotenv

load_dotenv()

def get_ai_generated_data():
    if not os.getenv("OPENAI_API_KEY"):
        print("⚠️ No OpenAI API key - using placeholder data")
        return [{"instruction": "What are your technical skills?", "response": "Python, data analysis"}]

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    prompt = "Create 2 interview Q&A pairs for a software developer in JSON format. Output only JSON."

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )

    content = response.choices[0].message.content

    # If model returns Markdown-style JSON block
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
    elif "```" in content:
        content = content.split("```")[1]

    try:
        data = json.loads(content)
        if isinstance(data, dict):
            return data.get("examples", [])
        elif isinstance(data, list):
            return data
        else:
            print("⚠️ Unexpected JSON format:", type(data))
            return []
    except Exception as e:
        print("❌ Failed to parse JSON:", e)
        print("Raw content:", content)
        return []

# Example call
get_ai_generated_data()


[{'question': 'Can you explain the difference between a stack and a queue?',
  'answer': 'A stack is a data structure that follows the Last In First Out (LIFO) principle, meaning that the last element added to the stack is the first one to be removed. Conversely, a queue operates on a First In First Out (FIFO) basis, where the first element added is the first one to be removed. This means that stacks are more suitable for scenarios like function call management, while queues are ideal for scheduling tasks.'},
 {'question': 'What is the purpose of version control systems like Git?',
  'answer': 'Version control systems like Git are used to track changes in source code over time. They allow multiple developers to collaborate on a project efficiently, manage different versions of code, revert to previous versions if needed, and maintain a history of changes. Git specifically enables branching and merging, which helps in developing features in isolation before integrating them into the mai

## 🧩 3. Formatting: ChatML
**ChatML** is a structured dialogue format used to simulate role-based conversations during SFT training.

**Structure:**
```
<|im_start|>user
What's the capital of France?
<|im_end|>
<|im_start|>assistant
Paris.
<|im_end|>
```

**Why it matters:**
- Improves consistency
- Helps multi-turn dialogue modeling
- Matches formatting expectations for LLaMA and OpenAI-style models



## 🔍 4. Full Finetune vs. LoRA


| Aspect               | Full Fine-Tuning      | LoRA (Low-Rank Adaptation) |
|----------------------|-----------------------|-----------------------------|
| Trainable Params     | 100%                  | ~0.5–1%                     |
| Memory Usage         | Very High             | Low                         |
| Flexibility          | Maximum               | Good for most tasks         |
| Training Time        | Longer                | Faster                      |
| Use Case             | Critical domain shift | Resource-efficient tuning   |

**Recommendation**: Use LoRA for most educational and practical settings unless full retraining is justified.


The below installation of PyTorch with CUDA has to be done in Anaconda Prompt. It takes 15 minutes. It is not working using the below command using python environment.

!conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Execute the below in Anaconda Prompt

conda install numpy --force-reinstall --yes

Reinstall pytorch and transformers to fix any broken dependencies

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia --force-reinstall --yes

pip install --force-reinstall transformers peft accelerate datasets

There were errors. Rework is needed as follows:


conda deactivate
conda env remove -n sft_env
conda create -n sft_env python=3.10
conda activate sft_env

# Install matching torch stack
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y

# Install numpy via conda (critical for Windows)
conda install numpy -y

# Install Hugging Face stack via pip
pip install transformers peft accelerate datasets

In [3]:
import torch
import fsspec

print('PyTorch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
print('fsspec version:', fsspec.__version__)

# Optional: test transformers
try:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    print('✅ All good!')
except Exception as e:
    print('❌ Error:', e)

PyTorch version: 2.5.1
CUDA available: True
fsspec version: 2025.3.0
❌ Error: There was a specific connection error when trying to load gpt2:
401 Client Error: Unauthorized for url: https://huggingface.co/gpt2/resolve/main/config.json (Request ID: Root=1-68a2bb82-758a71b036d37c7903d168eb;50198f83-f8f6-49c5-a561-c5f9a6be3b34)

Invalid credentials in Authorization header




If you're just loading public models like gpt2, you don't need to be logged in to Hugging Face.


Ihuggingface-cli logout

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

print("CUDA:", torch.cuda.is_available())

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

print("Model device:", model.device)
print("✅ Success!")

CUDA: True


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model device: cpu
✅ Success!


In [5]:
import transformers
print(transformers.__file__)

c:\Users\ch939\anaconda3\envs\sft_env\lib\site-packages\transformers\__init__.py


In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"]
)
lora_model = get_peft_model(base_model, lora_config)
lora_model.print_trainable_parameters()


tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

trainable params: 1,622,016 || all params: 126,061,824 || trainable%: 1.2867




You're seeing several warnings and messages during model loading — but the good news is: your code is working! 

CUDA is working in this case.

## ⚡ 5. DeepSpeed

**DeepSpeed** is a library from Microsoft that allows efficient distributed training of large models.

**Modes:**
- **ZeRO-1/2/3** for optimizer/shard parallelism.
- **CPU Offload** to reduce GPU memory usage.
- **Mixed Precision** for speed and efficiency.

**Best for**: Scaling training to large models like 13B+, saving memory, or training on multiple GPUs.

Enables memory-efficient training. Example config:
```json
{
  "zero_optimization": {"stage": 2},
  "fp16": {"enabled": true}
}
```

## 🛠️ 6. TRL Package (SFTTrainer)

**Transformers Reinforcement Learning (TRL)** by Hugging Face includes:

- `SFTTrainer`: Simplified supervised training loop.
- `PPOTrainer`: RLHF with Proximal Policy Optimization.
- `DPOTrainer`: Direct Preference Optimization.
- `RewardTrainer`: For reward model training.

**Why TRL?**
- Abstracts away complex setup.
- Faster experimentation.
- Supports all major fine-tuning and alignment workflows.


Install the trl Package

pip install trl

Verify Installation

python -c "from trl import SFTTrainer; print('trl is installed successfully')"

In [6]:
# for CUDA mechain run following 
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import Dataset

# ✅ --- 1.Prepare the dataset (must have a 'text' field)
demo_dataset = Dataset.from_list([
    {"text": "Human: What is Python?\nAssistant: Python is a programming language."},
    {"text": "Human: How do I learn coding?\nAssistant: Start with basic concepts and practice regularly."}
])

# ✅ --- 2. Load Base Model and Tokenizer ---
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # Small, fast, public
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set pad_token to eos_token if not defined
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # Add device_map="auto" if using GPU and want to leverage accelerate
    # device_map="auto",
)

# ✅ --- 3. Define LoRA Configuration ---
lora_config = LoraConfig(
    r=8,                     # Rank of LoRA approximation
    lora_alpha=16,           # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Try to target attention layers
    lora_dropout=0.05,       # Dropout for LoRA layers
    bias="none",             # No bias in LoRA
    task_type="CAUSAL_LM"    # For language modeling
)

# Wrap the model with LoRA
lora_model = get_peft_model(model, lora_config)

# Optional: Print how many parameters are being trained
lora_model.print_trainable_parameters()  # Should show small % (e.g., ~0.1%)


# ✅ --- 4. Training arguments
training_args = TrainingArguments(
    output_dir="./trl_sft_demo",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    learning_rate=5e-4,
    logging_steps=1,
    save_steps=10,
    save_total_limit=1,
    fp16=True, # Set to True if using GPU and CUDA
    optim="adamw_torch",  # Good default
    report_to=None,
)

def formatting_func(example):
    return example["text"]

# ✅ --- 5. Initialize SFTTrainer — no config, no extras
trainer = SFTTrainer(
    model=lora_model,
    args=training_args,
    train_dataset=demo_dataset,
    processing_class=tokenizer, # tokenizer=tokenizer, this statement is outdated
    # dataset_text_field="text",  # Required by SFTTrainer, # this statement is outdated
    formatting_func=formatting_func,        # ✅ Use this instead
    # max_seq_length=512,         # Optional: set sequence length   # this statement is to be removed.
    peft_config=lora_config, # newly added
)

# ✅ --- 6. Train the model
trainer.train()

# ✅ --- 7. Save LoRA Adapter (Optional) ---
trainer.save_model("./sft_finetuned_lora")

trainable params: 1,126,400 || all params: 1,101,174,784 || trainable%: 0.1023




Applying formatting function to train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2 [00:00<?, ? examples/s]

Step,Training Loss
1,3.4285
2,3.9144


For diagnosis: find that tokenizer=tokenizer, this statement is outdated

In [4]:
# save as diagnose_trl.py or run in prompt
import trl
from trl import SFTTrainer
from inspect import signature

print(f"TRL version: {trl.__version__}")

# Print all __init__ parameters of SFTTrainer
init_params = list(signature(SFTTrainer.__init__).parameters.keys())
print(f"SFTTrainer.__init__ parameters: {init_params}")

# Check for 'tokenizer'
if 'tokenizer' in init_params:
    print("✅ 'tokenizer' is supported!")
else:
    print("❌ 'tokenizer' is NOT in the signature — something is wrong")

TRL version: 0.21.0
SFTTrainer.__init__ parameters: ['self', 'model', 'args', 'data_collator', 'train_dataset', 'eval_dataset', 'processing_class', 'compute_loss_func', 'compute_metrics', 'callbacks', 'optimizers', 'optimizer_cls_and_kwargs', 'preprocess_logits_for_metrics', 'peft_config', 'formatting_func']
❌ 'tokenizer' is NOT in the signature — something is wrong


Explanation:
1. global_step=2

    This means the training process completed 2 optimization steps (i.e., two batches were processed and used to update the model's parameters).

2. training_loss=9.432311534881592

    This is the final training loss averaged over the training steps. A higher loss suggests the model hasn't learned much yet, likely because:
        * It’s early in training (only 2 steps).
        * The model needs more tuning or a better learning rate.
        * The data might be complex or noisy.

3. train_runtime=2.921

    Total time in seconds the training took — in this case, around 2.9 seconds.

4. train_samples_per_second=0.685

    The average number of training examples processed per second. Since only 2 steps were taken, the dataset or batch size may have been small.

5. train_steps_per_second=0.685

    How many steps (i.e., parameter updates) were completed per second. It matches the sample rate, implying 1 sample per step.

6. total_flos=18722451456.0

    The total number of floating point operations (FLOPs) executed during training. It's a proxy for how computationally intensive the training was.




### Explanation of Each Metric during model training:
- ✅ loss
    *What it is*: Measures the model's prediction error — how far off the model is from the target output.

    *What to look for*: We want this to decrease over time.

    A value of 0.1097 or 0.1366 is relatively low, which is promising, assuming it continues trending downward.

    Temporary small increases (like from 0.1097 to 0.1366) can happen due to learning rate fluctuations or noisy batches.

- ✅ grad_norm (Gradient Norm)
    *What it is*: L2 norm of the gradients — essentially, how large the updates to the model's weights are.

    *What to look for*:

    If this value is too large, it may indicate exploding gradients.

    If too small (near zero), it may mean vanishing gradients or that training is plateauing.

    for example:

    0.969 → healthy magnitude, meaning the model is still learning.

    0.278 → much smaller, which could mean learning is slowing down — possibly nearing convergence, or may need LR adjustment.

- ✅ learning_rate
    *What it is*: The rate at which the model updates its weights. Often decays over time (e.g., cosine scheduler).

    *What to look for*:

    A decaying learning rate is common and helps fine-tune the model toward convergence.

    0.000126 → slightly higher; 0.000120 → lower. This drop suggests a learning rate schedule is being applied, as expected.

- ✅ epoch
    *What it is*: Indicates how far along you are in training (e.g., 4.67 = 67% through the 5th epoch).

    *What to look for*: Helps track progression. You’d want to compare loss and grad_norm across epochs to evaluate learning trends.

## ✅ Summary

You’ve learned:
- What SFT is and why it’s essential
- Where and how to get quality data
- How to use ChatML format
- When to choose LoRA vs full tuning
- How to leverage DeepSpeed and TRL for scale and alignment

## - For the full llama3 sft code, check out class_5_llama3.py