# Utilizing HuggingFace SFTTrainer for QLORA/PEFT Training
In this example we'll explore leveraging the HuggingFace SFTTrainer class to conduct fine-tuning of a sample [Llama 3-1-8B model](https://huggingface.co/meta-llama/Llama-3.1-8B?library=transformers) utilizing [PEFT](https://huggingface.co/docs/peft/en/index).

This example builds on the Trainer HF Demo from earlier: https://www.youtube.com/watch?v=wGKU46ZNFnw&t=805s. We expland into training LLMs leveraging HuggingFace's different Trainer classes from the TRL library: https://github.com/huggingface/trl. [TRL](https://github.com/huggingface/trl) supports different types of fine-tuning techniques such as Supervised Fine-Tuning, GRPO, DPO, and more. In this example we utilize the SFTTrainer to conduct PEFT/QLORA based fine-tuning of a [Llama 3.1-8b model](https://huggingface.co/meta-llama/Llama-3.1-8B).

### Setting
In a SageMaker AI Classic NB Instance, utilizing conda_python3 kernel and g5.16xlarge (might be overkill for this model, can reduce).

### Prerequisites
Ensure you have a HF Access Token: https://huggingface.co/docs/hub/en/security-tokens and have requested access to the Llama3.1 model linked above.

## Setup
We generate a mock dataset here with dummy questions about myself, you can substitute this with your own dataset.

In [None]:
!pip install peft transformers[torch] trl bitsandbytes -U --q

In [None]:
import os
import random
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, prepare_model_for_kbit_training

In [None]:
import os

# ensure you have requested access for Llama 3.1-8B as well
os.environ["HF_TOKEN"] = "Enter HF Hub Token"
os.environ["HUGGINGFACE_HUB_TOKEN"] = os.environ["HF_TOKEN"]

In [None]:
# HF hub ID
BASE_MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"

# Push adapter artifacts post training
OUTPUT_DIR = "./qlora-peft-output"
ADAPTER_DIR = os.path.join(OUTPUT_DIR, "adapter")

# -----------------------------
# 0. Synthetic "about me" dataset (1000 rows)
# This dataset might not make full sense due to the random matching, but gives a general idea of my profession and interests to train on
# -----------------------------
base_facts = [
    "The user is a senior machine learning engineer at a large cloud company.",
    "The user specializes in building and deploying large language models in production.",
    "The user lives in a big US city and enjoys boxing, basketball, and tennis.",
    "The user creates technical YouTube videos about ML infrastructure and LLM serving.",
    "The user works with services similar to Amazon SageMaker and managed model hosting.",
    "The user likes combining fitness, combat sports, and engineering in their daily routine.",
    "The user often helps other engineers optimize GPU usage and model throughput.",
    "The user enjoys experimenting with multi-adapter inference and LoRA fine-tuning.",
    "The user is training for an amateur boxing fight while working full-time as an engineer.",
    "The user prefers concise, practical explanations with real-world deployment context.",
]

qa_templates = [
    lambda f: f"""### Instruction:
Answer the question about the user based on the known facts.

### Input:
What is the user's job and domain?

### Response:
{f}""",
    lambda f: f"""### Instruction:
You are an assistant that knows specific facts about one user.

### Input:
Summarize the user's background in one or two sentences.

### Response:
{f}""",
    lambda f: f"""### Instruction:
Use the stored personal profile of the user to answer this question.

### Input:
What kind of projects does the user usually work on?

### Response:
{f}""",
    lambda f: f"""### Instruction:
You are personalizing responses for this specific user.

### Input:
Describe the user's interests and profession together.

### Response:
{f}""",
]

examples = []
for i in range(1000):
    fact = random.choice(base_facts)
    template = random.choice(qa_templates)
    examples.append({"text": template(fact)})

In [None]:
#sample input
examples[1]

In [None]:
# convert to a HF Dataset
dataset = Dataset.from_list(examples).train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

## Tokenization & Model Loading

In [None]:
# -----------------------------
# 1. Tokenizer
# -----------------------------
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [None]:
# -----------------------------
# 2. 4-bit quantization (QLoRA)
# -----------------------------
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

In [None]:
# -----------------------------
# 3. Load base model in 4-bit + prep for k-bit training
# -----------------------------
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)

## LoRA & SFT Setup

In [None]:
# -----------------------------
# 4. LoRA config: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
# -----------------------------
lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # Llama-style
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
from trl import SFTConfig, SFTTrainer

# Define ALL configuration parameters within SFTConfig
config = SFTConfig(
    # --- Standard TrainingArguments parameters ---
    output_dir=OUTPUT_DIR,
    learning_rate=2e-4,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    report_to="none",
    
    # --- SFT-specific parameters (moved here from previous iterations) ---
    dataset_text_field="text",
    max_length=512,
    packing=False,
)

# Initialize the SFTTrainer, passing the SFTConfig object to the 'args' parameter
trainer = SFTTrainer(
    model=model,
    args=config,  # Pass the combined SFTConfig object here
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    # No other config parameters needed here
)

# Training

In [None]:
trainer.train()

## Save Adapter Weights

In [None]:
# -----------------------------
# 6. Save adapter weights
# -----------------------------
os.makedirs(ADAPTER_DIR, exist_ok=True)
trainer.model.save_pretrained(ADAPTER_DIR)
tokenizer.save_pretrained(ADAPTER_DIR)

print(f"Adapter saved to: {ADAPTER_DIR}")

In [None]:
ADAPTER_DIR = "./qlora-peft-output/adapter"

## Inference with Base Model & Merged Model (with Adapter)

### Base Model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Llama-3.1-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",          # puts on GPU if available
    torch_dtype=torch.float16,  # safer for VRAM than fp32
    trust_remote_code=True
)

prompts = [
    "What types of projects do I usually work on?",
    "Where do I live and what city am I based in?",
    "What sports or hobbies am I known for?",
    "Summarize my background and personal interests.",
    "Based on your knowledge of me, what is my professional expertise?",
]

def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=False
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\n================ BASE MODEL INFERENCE ================\n")

for p in prompts:
    wrapped = (
        "You are a helpful assistant. The user is asking a question about their personal background.\n"
        "If you do not know the answer, say so clearly.\n\n"
        f"Question: {p}\n"
        f"Answer:"
    )
    print("PROMPT:", p)
    result = generate(wrapped)
    print(result)
    print("-" * 100)

### Merged Model
We merge the adapter to the base model, you can run more proper evaluations but for POC we'll just manually see the difference in knowledge for some of these prompts here.

Merging: https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, AutoPeftModelForCausalLM

# -------------------------------------------------------
# Paths
# -------------------------------------------------------
BASE_ID = "meta-llama/Llama-3.1-8B-Instruct"
ADAPTER_DIR = "./qlora-peft-output/adapter"
MERGED_DIR = "./merged-model"

# -------------------------------------------------------
# Load tokenizer
# -------------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained(
    BASE_ID,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# -------------------------------------------------------
# Load base model in fp16/bf16
# (merged model will be full precision LLM)
# -------------------------------------------------------
print("Loading base model...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_ID,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

# -------------------------------------------------------
# Attach adapter
# -------------------------------------------------------
print("Loading adapter onto base model...")
model = PeftModel.from_pretrained(
    base_model,
    ADAPTER_DIR,
)

# -------------------------------------------------------
# Merge adapter â†’ base model
# -------------------------------------------------------
print("Merging LoRA adapter into base model weights...")
merged_model = model.merge_and_unload()   # <-- key line

# -------------------------------------------------------
# Save merged full model (optional)
# -------------------------------------------------------
merged_model.save_pretrained(MERGED_DIR)
tokenizer.save_pretrained(MERGED_DIR)

print(f"\nMerged model saved to: {MERGED_DIR}\n")

In [None]:
def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(merged_model.device)
    with torch.no_grad():
        outputs = merged_model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=False,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# -------------------------------------------------------
# Run merged-model inference
# -------------------------------------------------------
print("==================== MERGED MODEL OUTPUT ====================\n")

for p in prompts:
    wrapped = (
        "You are a personalized assistant that knows details about the user based "
        "on prior fine-tuning data.\n\n"
        f"Question: {p}\nAnswer:"
    )

    print(f"PROMPT: {p}\n")
    output = generate(wrapped)
    print(output)
    print("-" * 120)