# Module 2: Fine-tuning Overview

## 2.1 What is Fine-tuning?

 - Fine-tuning is the process of adapting a pretrained model to a specific task or domain.
 - Fine-tuning / training / post-training models customizes its behavior, enhances + injects knowledge, and optimizes performance for domains and specific tasks. 
 
 - For example:

OpenAI’s GPT-5 was post-trained to improve instruction following and helpful chat behavior.

The standard method of post training is called **Supervised Fine-Tuning (SFT)**
- This is the standard and most common method of post-training.
- You give the model examples where the correct answer is already known, and it learns to copy that behavior.

----------------------
```json
User: What is AI?

Assistant: AI is the field of building machines that can think and learn.
```
-------------------------

- The model learns, "When I see a question like this, this is how I should respond."



**Other methods include;**

- Preference Optimization 
   - Direct Preference Optimization(DPO)
   - Odds Ratio Preference Optimization(ORPO)
    
- Distillation 

- Reinforcement Learning (RL) 
   - Group Relative Policy Optimization(GRPO)
   - Group Based Supervised Policy Optimization(GSPO)

where a model  called an "agent" learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties then improves over time.

Example: Training a child (Reward good behavior and punishor correct bad behaviour)



| Method       | Full Meaning                               | Simple Explanation                       |
| ------------ | ------------------------------------------ | ---------------------------------------- |
| SFT          | Supervised Fine-Tuning                     | Learn from correct examples              |
| DPO          | Direct Preference Optimization             | Learn which answer humans prefer         |
| ORPO         | Odds Ratio Preference Optimization         | Combines SFT + preference learning       |
| Distillation | Knowledge Distillation                     | Big model teaches small model            |
| RL           | Reinforcement Learning                     | Learn by reward and penalty              |
| GRPO         | Group Relative Policy Optimization         | Compare multiple answers and reward best |
| GSPO         | Generalized Supervised Policy Optimization | Structured reward + supervision training |


##### **Fine-tuning a Pre-trained Model on a Dataset**

By fine-tuning a pre-trained model on a dataset, you can:


- **Update + Learn New Knowledge:** Inject and learn new domain-specific information.

- **Customize Behavior:** Adjust the model’s tone, personality, or response style.

- **Optimize for Tasks:** Improve accuracy and relevance for specific use cases.

**Example fine-tuning use-cases:**

- Enables LLMs to predict if a headline impacts a company positively or negatively.

- Can use historical customer interactions for more accurate and custom responses.

- Fine-tune LLM on legal texts for contract analysis, case law research, and compliance.

You can think of a fine-tuned model as a specialized agent designed to do specific tasks more effectively and efficiently. Shey you get?


Some debate whether to use Retrieval-Augmented Generation (RAG) instead of fine-tuning, but fine-tuning can incorporate knowledge and behaviors directly into the model in ways RAG cannot. In practice, combining both approaches yields the best results - leading to greater accuracy, better usability, and fewer hallucinations.


It is important to note that fine-tuning can replicate all of RAG's capabilities, but not vice versa. Does this make any sense?

##### **Real-World Applications of Fine-Tuning**

Fine-tuning can be applied across various domains and needs. Here are a few practical examples of how it makes a difference:

**1. Sentiment Analysis for Finance** – Train an LLM to determine if a news headline impacts a company positively or negatively, tailoring its understanding to financial context.

**2. Customer Support Chatbots** – Fine-tune on past customer interactions to provide more accurate and personalized responses in a company’s style and terminology.

**3. Legal Document Assistance** – Fine-tune on legal texts (contracts, case law, regulations) for tasks like contract analysis, case law research, or compliance support, ensuring the model uses precise legal language.

##### **The Benefits of Fine-Tuning**

Fine-tuning offers several notable benefits beyond what a base model or a purely retrieval-based system can provide

1. Fine-Tuning vs. RAG: What’s the Difference?
- Fine-tuning can do mostly everything RAG can - but not the other way around. During training, fine-tuning embeds external knowledge directly into the model. This allows the model to handle niche queries, summarize documents, and maintain context without relying on an outside retrieval system. That’s not to say RAG lacks advantages as it is excels at accessing up-to-date information from external databases. It is in fact possible to retrieve fresh data with fine-tuning as well, however it is better to combine RAG with fine-tuning for efficiency.

2. Task-Specific Mastery
- Fine-tuning deeply integrates domain knowledge into the model. This makes it highly effective at handling structured, repetitive, or nuanced queries, scenarios where RAG-alone systems often struggle. In other words, a fine-tuned model becomes a specialist in the tasks or content it was trained on.

3. Independence from Retrieval
- A fine-tuned model has no dependency on external data sources at inference time. It remains reliable even if a connected retrieval system fails or is incomplete, because all needed information is already within the model’s own parameters. This self-sufficiency means fewer points of failure in production.

4. Faster Responses
- Fine-tuned models don’t need to call out to an external knowledge base during generation. Skipping the retrieval step means they can produce answers much more quickly. This speed makes fine-tuned models ideal for time-sensitive applications where every second counts.


5. Custom Behavior and Tone
- Fine-tuning allows precise control over how the model communicates. This ensures the model’s responses stay consistent with a brand’s voice, adhere to regulatory requirements, or match specific tone preferences. You get a model that not only knows what to say, but how to say it in the desired style.

6. Reliable Performance
- Even in a hybrid setup that uses both fine-tuning and RAG, the fine-tuned model provides a reliable fallback. If the retrieval component fails to find the right information or returns incorrect data, the model’s built-in knowledge can still generate a useful answer. This guarantees more consistent and robust performance for your system.

#### **Fine-tuning misconceptions**

You may have heard that fine-tuning does not make a model learn new knowledge or RAG performs better than fine-tuning. **That is false**.

You can train a specialized coding model with fine-tuning and RL while RAG can’t change the model’s weights and only augments what the model sees at inference time.

1. Does Fine-Tuning Add New Knowledge to a Model?

**Yes - it absolutely can**. A common myth suggests that fine-tuning doesn’t introduce new knowledge, but in reality it does. If your fine-tuning dataset contains new domain-specific information, the model will learn that content during training and incorporate it into its responses. In effect, fine-tuning can and does teach the model new facts and patterns from scratch.

2. Is RAG Always Better Than Fine-Tuning?

**Not necessarily**. Many assume RAG will consistently outperform a fine-tuned model, but that’s not the case when fine-tuning is done properly. In fact, a well-tuned model often matches or even surpasses RAG-based systems on specialized tasks. Claims that “RAG is always better” usually stem from fine-tuning attempts that weren’t optimally configured - for example, using insufficient training or  incorrect **LoRA parameters**(<-- abeg keep this word for mind)

3. Is Fine-Tuning Expensive?

**Not at all!** While full fine-tuning or pretraining can be costly, these are not necessary (pretraining is especially not necessary). In most cases, LoRA or QLoRA fine-tuning can be done for minimal cost. In fact, with Unsloth’s free notebooks for Colab or Kaggle, you can fine-tune models without spending a dime. Better yet, you can even fine-tune locally on your own device.

#### **Pretraining vs Fine-tuning**

- **Pretraining:** Training on large, general datasets.
- **Fine-tuning:** Training on smaller, task-specific datasets.

##### **Why fine-tune instead of prompting**

Fine-tuning improves performance for specific tasks, reduces prompt engineering.

##### **Domain adaptation vs task adaptation**
- **Domain adaptation:** Adapting to a new domain.
- **Task adaptation:** Adapting to a new task.

In [1]:
%%capture
!pip install torch transformers datasets peft bitsandbytes numpy pandas

In [None]:
# Verify installations and check versions
import sys
import importlib

def check_package(package_name):
    try:
        module = importlib.import_module(package_name)
        version = getattr(module, '__version__', 'unknown')
        print(f"✓ {package_name}: {version}")
        return True
    except ImportError as e:
        print(f"✗ {package_name}: NOT FOUND")
        return False

print("Checking required packages:")
print("-" * 40)
check_package("torch")
check_package("transformers")
check_package("datasets")
check_package("peft")
print("-" * 40)
print("If all packages show ✓, you're ready to run the examples below!")
print("\nTo use the models, import them like this:")
print("from transformers import AutoModelForSequenceClassification")

Checking required packages:
----------------------------------------
✓ torch: 2.9.1+cpu


  from .autonotebook import tqdm as notebook_tqdm


✓ transformers: 4.57.6


## 2.2 Types of Fine-tuning

Fine-tuning approaches vary based on computational resources, model size, and specific use cases. Each method has distinct trade-offs between performance, computational cost, and ease of implementation.

### 1. Full Fine-tuning

**What it is:** Updates all model parameters during training. Every weight and bias in the entire neural network is adjusted.

**When to use:**
- Unlimited computational resources
- Very specialized tasks requiring significant adaptation
- Smaller models (< 7B parameters)

**Advantages:**
- Maximum model adaptation to your specific task
- Can achieve the best performance

**Disadvantages:**
- Extremely computationally expensive
- Requires significant memory (24GB+ GPU memory for larger models)
- Slow training process
- Risk of catastrophic forgetting of general knowledge

**Memory requirement:** $\text{Model Parameters} \times 4 \text{ bytes (fp32)} \times 4 \text{ (optimizer states)} \approx$ 16-40GB for 7B models

**When NOT to use:**
- Limited GPU memory
- Large models (>13B parameters)
- Production environments with budget constraints

### 2. Feature Extraction

**What it is:** Freezes all pretrained model weights and uses the model as a fixed feature extractor. Only trains a small head layer on top.

**When to use:**
- Very limited training data
- Task very similar to pretraining task
- Minimal computational resources

**Advantages:**
- Extremely fast and cheap
- Preserves all general knowledge
- Good for downstream tasks similar to pretraining

**Disadvantages:**
- Limited adaptation to domain-specific patterns
- May underperform on very different tasks
- Cannot leverage task-specific representations

**Use case:** Using BERT embeddings for semantic similarity on small datasets

### 3. Parameter-Efficient Fine-Tuning (PEFT)

The most practical approach for most practitioners. Only trains a small subset of parameters while keeping most weights frozen.

#### **3.1 LoRA: Low-Rank Adaptation**

**Concept:** Instead of updating the full weight matrix $W$, we learn two smaller matrices $A$ and $B$ such that the weight update is $\Delta W = A \times B$ where $A$ and $B$ have much lower rank $r$ than the original matrix.

$$W_{\text{new}} = W_{\text{original}} + A \times B$$

**Memory reduction:** From 16-40GB to **200-300MB** for 7B models

**Trade-offs:**
- Rank $r$ (default: 8 or 16) — higher = better performance but more parameters
- Alpha parameter — scaling factor for LoRA updates
- Target modules — which layers to apply LoRA to (usually attention layers)

**When to use:**
- Limited GPU memory (most practical scenarios)
- Medium to large models (7B-70B parameters)
- Production fine-tuning
- Multiple fine-tuning tasks (can stack multiple LoRA adapters)

#### **3.2 QLoRA: Quantized LoRA**

**Concept:** Combines LoRA with 4-bit quantization of the base model. The large model is quantized to 4 bits, and only LoRA adapters are trained in full precision.

**Memory reduction:** From 16-40GB to **50-100MB** for 7B models (even more aggressive than LoRA)

**Trade-off:** Minimal performance loss while achieving dramatic memory savings

**When to use:**
- Very limited GPU memory (consumer GPUs like RTX 3090)
- Can fine-tune 70B models on 24GB GPU
- Best balance of cost and performance

#### **3.3 Adapters**

**Concept:** Inserts small trainable modules (bottleneck adapters) between layers of the frozen base model.

**Advantages:**
- Can be more efficient than LoRA for some tasks
- Modular approach
- Task-specific adaptations

#### **3.4 Prefix Tuning**

**Concept:** Prepends a learned prefix to the input embeddings without modifying model weights.

**Advantages:**
- No weight updates needed
- Good for generation tasks

### 4. Instruction Tuning

**What it is:** Fine-tuning on instruction-response pairs to make models follow instructions better.

**Format:** `[INSTRUCTION, INPUT]` → `[OUTPUT]`

**When to use:**
- Want to improve instruction-following capabilities
- Building chat or assistant models
- Making general models task-specific

**Example training data:**
```
Instruction: Classify the sentiment of this review.
Input: "This movie was absolutely terrible."
Output: Negative
```

### 5. Supervised Fine-tuning (SFT)

**What it is:** Fine-tuning with labeled input-output pairs. The classic approach where you provide labeled examples of desired behavior.

**When to use:**
- You have quality labeled data
- Clear input-output relationships
- Any task with ground truth examples

**Example:**
- Classification tasks with labeled examples
- Named Entity Recognition with annotated text
- Question-answering with Q&A pairs

**Key consideration:** Quality > Quantity. 1,000 high-quality examples often beat 100,000 noisy examples.

### 6. Reinforcement Learning from Human Feedback (RLHF) — Conceptual Overview

**What it is:** A two-stage process:
1. Train a reward model on human preference judgments
2. Use the reward model to fine-tune the language model via reinforcement learning

**When to use:**
- When you want to align model outputs with human preferences
- Quality is hard to define with labels but easy to judge
- Building chat models or instruction-following systems

**Example:** Human raters compare two model outputs and prefer one. The reward model learns this preference, then RL optimizes the model to maximize predicted reward.

**Challenge:** Computationally complex, requires expertise in RL


### Code Examples for Each Fine-tuning Type

> **Note:** All required packages (torch, transformers, datasets, peft) are already installed. If you encounter import errors, restart the kernel by clicking "Restart Kernel" in the notebook toolbar.


In [None]:
# ============================================================================
# TROUBLESHOOTING: If you see ImportError about PyTorch
# ============================================================================
# This is a kernel caching issue. Run the cell below to fix it:

import os
import sys

# Clear the kernel's module cache to reload transformers properly
if 'transformers' in sys.modules:
    del sys.modules['transformers']
if 'torch' in sys.modules:
    del sys.modules['torch']

# Reinstall torch in the kernel
os.system('pip install --upgrade torch --quiet')

print("Kernel reset and torch reinstalled. Try running the examples again!")
print("\nIf the issue persists, click 'Restart Kernel' in the notebook toolbar.")

Kernel reset and torch reinstalled. Try running the examples again!

If the issue persists, click 'Restart Kernel' in the notebook toolbar.


In [None]:
# ============================================================================
# CODE EXAMPLE 2: Feature Extraction (Freeze All, Train Only Head)
# ============================================================================

print("Feature Extraction Approach")
print("=" * 50)
print("\n# Strategy: Freeze base model, train only classifier head\n")

code_example = """
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

# Freeze ALL base model parameters
for param in model.base_model.parameters():
    param.requires_grad = False

# Only classification head is trainable
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable}")
# Output: ~2,000 parameters (classifier head only)
"""

print(code_example)

print("Characteristics:")
print("  ✓ Extremely fast training")
print("  ✓ Minimal memory requirement")
print("  ✓ Preserves general knowledge")
print("  ✗ Limited adaptation to domain-specific patterns")
print("\nBest for: Classification with very limited training data")

Feature Extraction Approach

# Strategy: Freeze base model, train only classifier head


from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

# Freeze ALL base model parameters
for param in model.base_model.parameters():
    param.requires_grad = False

# Only classification head is trainable
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable}")
# Output: ~2,000 parameters (classifier head only)

Characteristics:
  ✓ Extremely fast training
  ✓ Minimal memory requirement
  ✓ Preserves general knowledge
  ✗ Limited adaptation to domain-specific patterns

Best for: Classification with very limited training data


In [None]:
# ============================================================================
# CODE EXAMPLE 3: LoRA Fine-tuning (MOST PRACTICAL)
# ============================================================================

print("LoRA (Low-Rank Adaptation) - THE GOLD STANDARD")
print("=" * 50)
print("\n# LoRA Configuration Example\n")

code_example = """
from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

# Define LoRA config
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,                          # Rank (lower = fewer params)
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,
    bias="none",
    target_modules=["query", "value"]  # Attention layers
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Output:
# trainable params: 73,728 || all params: 109,482,240
# trainable%: 0.067%
"""

print(code_example)

print("Key LoRA Parameters:")
print("  r (rank):      8-16 typical. Higher = more parameters")
print("  lora_alpha:    Usually 2x the rank value")
print("  lora_dropout:  0.05-0.1 for regularization")
print("  target_modules: Which layers to apply LoRA to")
print("\nMemory: 200-300MB for 7B models (vs 28GB full fine-tuning!)")
print("Performance: 95-99% of full fine-tuning quality")

LoRA (Low-Rank Adaptation) - THE GOLD STANDARD

# LoRA Configuration Example


from peft import LoraConfig, TaskType, get_peft_model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

# Define LoRA config
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,                          # Rank (lower = fewer params)
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,
    bias="none",
    target_modules=["query", "value"]  # Attention layers
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Output:
# trainable params: 73,728 || all params: 109,482,240
# trainable%: 0.067%

Key LoRA Parameters:
  r (rank):      8-16 typical. Higher = more parameters
  lora_alpha:    Usually 2x the rank value
  lora_dropout:  0.05-0.1 for regularization
  target_modules: Which layers to apply LoRA to

Memory: 200-300MB f

In [None]:
# ============================================================================
# CODE EXAMPLE 4: QLoRA (Quantized LoRA) - ULTRA MEMORY EFFICIENT
# ============================================================================

print("QLoRA - Extreme Memory Efficiency")
print("=" * 50)
print("\n# QLoRA = 4-bit Quantization + LoRA\n")

code_example = """
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load 70B model in 4-bit (fits on 24GB GPU!)
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-70b',
    quantization_config=bnb_config,
    device_map="auto"
)

# Add LoRA on top
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
"""

print(code_example)

print("Memory Requirements:")
print("  Full 70B model:    280GB+ (impossible on consumer hardware)")
print("  QLoRA approach:    24GB GPU (fits on RTX 4090!)")
print("\nTrade-offs:")
print("  ✓ 99.8% memory reduction")
print("  ✓ Minimal performance loss (1-2%)")
print("  ✓ Can fine-tune models 100x larger locally")
print("  ✗ Slightly slower training than LoRA")

QLoRA - Extreme Memory Efficiency

# QLoRA = 4-bit Quantization + LoRA


from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load 70B model in 4-bit (fits on 24GB GPU!)
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-70b',
    quantization_config=bnb_config,
    device_map="auto"
)

# Add LoRA on top
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Memory Requirements:
  Full 70B model:    280GB+ (impossible on consumer hardware)
  QLoRA approach:    24GB GPU (fits on RTX 4090!)

Trade-offs:

In [None]:
# ============================================================================
# CODE EXAMPLE 5: Instruction Tuning - Data Formatting
# ============================================================================

print("Instruction Tuning - Prepare Data")
print("=" * 50)
print("\n# Data Format: Instruction + Input → Output\n")

from datasets import Dataset
import pandas as pd

# Create instruction tuning dataset
instruction_data = {
    "instruction": [
        "Classify the sentiment of this review.",
        "Classify the sentiment of this review.",
        "Translate to French:",
        "Summarize in 2 sentences:",
    ],
    "input": [
        "This product is amazing!",
        "Terrible quality, broke immediately.",
        "The weather is beautiful today.",
        "Machine learning transforms industries..."
    ],
    "output": [
        "Positive",
        "Negative",
        "Le temps est magnifique aujourd'hui.",
        "ML revolutionizes sectors. Impact grows daily."
    ]
}

df = pd.DataFrame(instruction_data)
dataset = Dataset.from_pandas(df)

# Format as prompt-response pairs
def format_instruction(examples):
    texts = []
    for instr, inp, output in zip(examples["instruction"], examples["input"], examples["output"]):
        text = f"Instruction: {instr}\nInput: {inp}\nOutput: {output}"
        texts.append(text)
    return {"text": texts}

formatted = dataset.map(format_instruction, batched=True)

print("Example formatted training data:")
print("-" * 50)
print(formatted[0]["text"])
print("-" * 50)
print(f"\nDataset size: {len(formatted)} examples")
print("\nThis format teaches the model to:")
print("  1. Read instructions")
print("  2. Process inputs")
print("  3. Generate correct outputs")

Instruction Tuning - Prepare Data

# Data Format: Instruction + Input → Output



Map: 100%|██████████| 4/4 [00:00<00:00, 829.45 examples/s]

Example formatted training data:
--------------------------------------------------
Instruction: Classify the sentiment of this review.
Input: This product is amazing!
Output: Positive
--------------------------------------------------

Dataset size: 4 examples

This format teaches the model to:
  1. Read instructions
  2. Process inputs
  3. Generate correct outputs





In [None]:
# ============================================================================
# CODE EXAMPLE 6: Supervised Fine-tuning (SFT) - Full Training Pipeline
# ============================================================================

print("Supervised Fine-tuning - Complete Example")
print("=" * 50)

from transformers import AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd

# Step 1: Create training data
data = {
    "text": [
        "This movie is fantastic!",
        "I loved this product!",
        "Terrible experience.",
        "Awful, not recommended.",
        "Amazing quality and shipping!",
    ],
    "label": [1, 1, 0, 0, 1]  # 1=positive, 0=negative
}

dataset = Dataset.from_pandas(pd.DataFrame(data))
dataset_split = dataset.train_test_split(test_size=0.2, seed=42)

print(f"✓ Created dataset: {len(dataset_split['train'])} train, {len(dataset_split['test'])} test")

# Step 2: Tokenize
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized = dataset_split.map(tokenize, batched=True)
print(f"✓ Tokenized data")

# Step 3: Define training args
training_args = TrainingArguments(
    output_dir="./sft_model",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
)

print(f"✓ Training configuration set:")
print(f"  - Epochs: 3")
print(f"  - Learning rate: 2e-5")
print(f"  - Batch size: 8")

print("\n# To run training, uncomment and execute:")
print("""
# from transformers import AutoModelForSequenceClassification
# model = AutoModelForSequenceClassification.from_pretrained(
#     "bert-base-uncased", num_labels=2
# )
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=tokenized['train'],
#     eval_dataset=tokenized['test'],
# )
# trainer.train()
""")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Supervised Fine-tuning - Complete Example
✓ Created dataset: 4 train, 1 test


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|██████████| 4/4 [00:00<00:00, 318.49 examples/s]
Map: 100%|██████████| 1/1 [00:00<00:00, 189.84 examples/s]

✓ Tokenized data
✓ Training configuration set:
  - Epochs: 3
  - Learning rate: 2e-5
  - Batch size: 8

# To run training, uncomment and execute:

# from transformers import AutoModelForSequenceClassification
# model = AutoModelForSequenceClassification.from_pretrained(
#     "bert-base-uncased", num_labels=2
# )
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=tokenized['train'],
#     eval_dataset=tokenized['test'],
# )
# trainer.train()








### Summary: Which Fine-tuning Method Should YOU Use?

**Quick Decision Guide:**

| Your Situation | Recommended Method | Why |
|---|---|---|
| **Very limited data** | Feature Extraction | Fast, cheap, works well on similar tasks |
| **Standard scenario** | LoRA | 95-99% quality, 1% cost |
| **Limited GPU memory** | QLoRA | Fits 70B models on 24GB |
| **Want better following** | Instruction Tuning | Trains on (instruction, output) pairs |
| **Have labeled data** | SFT | Standard supervised learning |
| **Want best performance** | Full Fine-tuning + RL | Most expensive, but best results |

### Comparison: Memory & Speed Trade-offs

| Method | Memory | Speed | Performance | Best For |
|--------|--------|-------|-------------|----------|
| **Full Fine-tuning** | 28GB+ | Slow | Highest (if optimized) | Unlimited budget, small models |
| **Feature Extraction** | 2-4GB | Fastest | Lowest | Limited data, simple tasks |
| **LoRA** | 0.2-0.3GB | Medium | High (near FT) | **Most practical choice** |
| **QLoRA** | 0.05-0.1GB | Medium | Good (slight ↓) | Consumer GPUs, large models |
| **Instruction Tuning** | (depends on method) | Medium | High | Chat/assistant models |
| **SFT** | (depends on method) | Medium | High | Labeled task data |

### Decision Tree: Which Fine-tuning Method to Use?

```
Do you have quality labeled data?
├─ YES → Use Supervised Fine-tuning (SFT)
│         └─ Want instruction-following? → Use Instruction Tuning
└─ NO → Skip to next question

Do you have unlimited GPU memory?
├─ YES → Use Full Fine-tuning
└─ NO → Use LoRA or QLoRA (RECOMMENDED)
        └─ How much GPU memory?
           ├─ > 16GB → Use LoRA
           └─ < 16GB → Use QLoRA
```




#### **Choose the Right Model And Methods**

When preparing for fine-tuning, one of the first decisions you'll face is selecting the right model. Here's a step-by-step guide to help you choose

1. Choose a model that aligns with your usecase

- E.g. For image-based training, select a vision model such as Llama 3.2 Vision. For code datasets, opt for a specialized model like Qwen Coder 2.5.

- Licensing and Requirements: Different models may have specific licensing terms and *system requirements*. Be sure to review these carefully to avoid compatibility issues.

2. Assess your storage, compute capacity and dataset

- Use the unsloth *VRAM guideline to determine the VRAM requirements for the model you’re considering.
[guideline](https://unsloth.ai/docs/get-started/fine-tuning-for-beginners/unsloth-requirements#system-requirements)
- Your dataset will reflect the type of model you will use and amount of time it will take to train.

3. Select a Model and Parameters

- We recommend using the latest model for the best performance and capabilities. For instance, as of January 2025, the leading 70B model is Llama 3.3.

- You can stay up to date by exploring our model catalog to find the newest and relevant options.

4. Choose Between Base and Instruct Models

Further details below:

**Instruct or Base Model?**

- When preparing for fine-tuning, one of the first decisions you'll face is whether to use an instruct model or a base model.

**Instruct Models**

- Instruct models are pre-trained with built-in instructions, making them ready to use without any fine-tuning. These models, including GGUFs and others commonly available, are optimized for direct usage and respond effectively to prompts right out of the box. Instruct models work with conversational chat templates like ChatML or ShareGPT.

**Base Models**

- Base models, on the other hand, are the original pre-trained versions without instruction fine-tuning. These are specifically designed for customization through fine-tuning, allowing you to adapt them to your unique needs. Base models are compatible with instruction-style templates like Alpaca or Vicuna, but they generally do not support conversational chat templates out of the box.

**How To Choose**

The decision often depends on the quantity, quality, and type of your data:

- 1,000+ Rows of Data: If you have a large dataset with over 1,000 rows, it's generally best to fine-tune the base model.

- 300–1,000 Rows of High-Quality Data: With a medium-sized, high-quality dataset, fine-tuning the base or instruct model are both viable options.

- Less than 300 Rows: For smaller datasets, the instruct model is typically the better choice. Fine-tuning the instruct model enables it to align with specific needs while preserving its built-in instructional capabilities. This ensures it can follow general instructions without additional input unless you intend to significantly alter its functionality.