# Fine-Tuning a Language Model for Healthcare Question Answering

**Module 02 | Notebook 4 of 4**

In this notebook, we will fine-tune a specialized language model (`TinyLlama`) to answer medical questions using the `MedQuad` dataset.

## Learning Objectives
1.  Understand the difference between **Context (RAG)** vs **Fine-Tuning**.
2.  Learn about **PEFT (Parameter-Efficient Fine-Tuning)** and **LoRA**.
3.  Fine-tune a model on a custom dataset.
4.  Export the model for local use.

---

## What You'll Build Today

By the end of this notebook, you will have:
- ✅ A custom medical question-answering model
- ✅ Understanding of when to use fine-tuning vs RAG
- ✅ Hands-on experience with LoRA (Low-Rank Adaptation)
- ✅ A model that uses only ~2% of the original parameters to learn!

**Estimated Time:** 45-60 minutes (depending on hardware)
**Prerequisites:** Basic Python, understanding of neural networks


## 1. When to use What? (RAG vs. Fine-Tuning)

Before we start, it's important to know *when* to fine-tune.

| Feature | **RAG (Retrieval-Augmented Gen)** | **Fine-Tuning** |
| :--- | :--- | :--- |
| **Analogy** | Giving the model an open textbook during the exam. | Sending the model to medical school for 4 years. |
| **Goal** | Add new *knowledge* (facts, data). | Change *behavior*, style, or learn specialized jargon. |
| **Pros** | Cheaper, easier to update facts. | Better performance on specific tasks, faster inference (no retrieval). |
| **Cons** | Limited context window. | Expensive to train, hard to update facts (requires re-training). |

**In this notebook**, we are doing **Fine-Tuning** to teach the model how to *act* like a medical assistant and understand medical terminology, not necessarily to memorize every drug interaction (RAG would be better for that).

### 🌍 Real-World Examples

**When to use RAG:**
- A company chatbot that needs to answer questions about constantly changing product documentation
- A legal assistant that references current laws and regulations
- A customer service bot with access to your latest knowledge base

**When to use Fine-Tuning:**
- Teaching a model to write in a specific tone (e.g., Shakespearean English)
- Making a model understand medical/legal jargon and respond appropriately
- Creating a coding assistant that follows your company's specific style guide
- Teaching a model to structure outputs in a particular format (e.g., always JSON)

**💡 Pro Tip:** Many production systems use BOTH! They fine-tune for style/behavior and use RAG for facts.

### 📚 Think of it this way:

**RAG = Open Book Exam**
- You (the model) can look up answers in provided documents
- You don't need to memorize everything
- If the documents are updated, you automatically have new information

**Fine-Tuning = Closed Book Exam (After Studying)**
- You've internalized the patterns and style
- You respond faster (no need to search documents)
- But updating your knowledge requires studying again (retraining)


In [None]:
%%capture
!pip install transformers datasets accelerate peft trl bitsandbytes

In [None]:
import torch
import os
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
if torch.backends.mps.is_available():
    device = "mps"  # For Mac users

print(f"Using device: {device}")

---

## 2. Load the Dataset

We will use `MedQuad` from generic sources. It contains pairs of `Question` and `Answer`.

In [None]:
# Load dataset
dataset_name = "keivalya/MedQuad-MedicalQnADataset"
dataset = load_dataset(dataset_name, split="train")

# Use a small subset for demonstration (Top 500 examples)
dataset = dataset.select(range(500))

print(f"Training on {len(dataset)} examples")
print("Sample:", dataset[0])

### 📊 Understanding Our Dataset

Let's explore what we're working with:

**MedQuad Dataset:**
- Contains medical questions and expert answers
- Sourced from trusted health organizations (NIH, CDC, etc.)
- Format: {Question, Answer} pairs

**Why only 500 examples?**
- Full dataset has 16,000+ examples
- For learning purposes, 500 is enough to see results quickly
- In production, you'd use the full dataset (takes 2-4 hours to train)

**Dataset Quality Matters!**
- ✅ Good: Clear questions, accurate answers, consistent formatting
- ❌ Bad: Contradictory information, poor grammar, inconsistent style
- **Rule of thumb:** 500 high-quality examples > 5,000 mediocre ones


### 🎯 Why Data Formatting is Critical

The model doesn't "know" what's a question vs an answer. We need to teach it!

**Bad formatting:**
```
What are the symptoms of diabetes? Increased thirst, frequent urination...
```
The model sees one continuous text blob - it doesn't know where the question ends!

**Good formatting:**
```
### Question:
What are the symptoms of diabetes?

### Answer:
Increased thirst, frequent urination...
```
Now the model learns the pattern:
1. Text after "### Question:" = what the user asks
2. Text after "### Answer:" = what I should respond

**This is called "prompt formatting" - it's one of the most important parts of fine-tuning!**


### Formatting
To train a chat model, we format the data clearly so the model knows what is the input and what is the output.

```
### Question:
{User's Question}

### Answer:
{Model's Answer}
```

In [None]:
def formatting_func(example):
    text = f"""### Question:
{example['Question']}

### Answer:
{example['Answer']}"""
    return text

print(formatting_func(dataset[0]))

## 2.5 Understanding Quantization (Memory Optimization)

Before we load our model, let's understand how we'll fit a billion-parameter model on your laptop!

### What is Quantization?

**The Problem:**
- Modern LLMs are HUGE. TinyLlama (our "small" model) has 1.1 BILLION parameters
- Each parameter is typically stored as a 32-bit float = 4 bytes
- 1.1B × 4 bytes = 4.4 GB just for the model weights!

**The Solution: Quantization**
- Store numbers in fewer bits (4-bit or 8-bit instead of 32-bit)
- **4-bit quantization:** 1.1B × 0.5 bytes ≈ 550 MB (8× smaller!)

### The Trade-off
```
Higher Precision → More Memory → Better Quality (slightly)
Lower Precision → Less Memory → Faster Training → Tiny quality loss
```
**For learning and experimentation, 4-bit is perfect!**

### Hardware-Specific Notes:
- **NVIDIA GPU:** We use `bitsandbytes` library for 4-bit quantization
- **Mac (M1/M2/M3):** We use float16 (16-bit) instead
  - Why? The `bitsandbytes` library doesn't support Mac GPUs yet
  - Good news: TinyLlama is small enough that float16 works fine!
- **CPU Only:** We also use float16, but training will be slower


---

## 3. Model Setup (with Conditional Quantization)

We use **TinyLlama-1.1B**. It's small enough to run on most laptops.

### Hardware Note
*   **NVIDIA GPU**: We can use **4-bit quantization** to save massive memory.
*   **Mac (M1/M2/M3)**: 4-bit quantization (`bitsandbytes`) is not natively supported. We will load the model in `float16` instead. TinyLlama is small (2GB), so this works fine!

In [None]:
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Determine optimized settings based on hardware
if device == "cuda":
    # Quantization Config (NVIDIA only)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
    model_kwargs = {"quantization_config": bnb_config}
else:
    # Mac/CPU: Load in half-precision (float16) to save RAM
    bnb_config = None
    model_kwargs = {"torch_dtype": torch.float16}

print(f"Loading model with config: {model_kwargs}")

# Load Model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto" if device == "cuda" else None, # MPS/CPU mapping handled manually or by defaults
    trust_remote_code=True,
    **model_kwargs
)

# For Mac MPS specifically, we explicit move if needed, but 'auto' usually avoids MPS for some models unless explicit
if device == "mps":
    model.to("mps")

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

---

## 4. Setting up LoRA (Low-Rank Adaptation)

### The Valid Analogy
Imagine you want to customize your car (Pre-trained Model).
*   **Full Fine-Tuning**: Rebuilding the entire engine. Powerful, but expensive and slow.
*   **LoRA**: Adding a "Turbocharger" plugin. You don't touch the engine; you just add a small, focused part that modifies the performance.

### Parameters
*   **`r` (Rank)**: The size of the "plugin". Bigger = smarter but slower. (Common: 8, 16)
*   **`target_modules`**: Where to attach the plugin. In a Transformer, we usually attach to the Attention layers (`q_proj`, `v_proj`).

### 🔬 How LoRA Actually Works (Technical)

Let's understand what's happening under the hood:

**Traditional Fine-Tuning:**
```
Original Model: 1.1 Billion parameters
Fine-Tuning: Update ALL 1.1 Billion parameters
Memory needed: ~20 GB
Time: Many hours
```
**LoRA Fine-Tuning:**
```
Original Model: 1.1 Billion parameters (FROZEN ❄️)
LoRA Adapters: ~2-20 Million parameters (TRAINABLE 🔥)
Memory needed: ~4 GB
Time: Much faster!
```
**The Math (Simplified):**
- In a transformer, we have weight matrices like `W` (e.g., 4096 × 4096)
- LoRA adds two small matrices: `A` (4096 × r) and `B` (r × 4096)
- Instead of updating `W`, we learn `A` and `B`
- Final output: `W·x + B·(A·x)` ≈ Updated model behavior

**Why does this work?**
- The "important" changes to a model often live in a lower-dimensional space
- `r` (rank) controls this dimension - usually 8 or 16 is enough!

### Visualizing Parameter Count
```
Full Fine-Tuning: [████████████████████] 100% parameters
LoRA (r=16):      [██░░░░░░░░░░░░░░░░░░] 1.8% parameters
```
You're only training ~20 million out of 1.1 billion parameters!


### 🎛️ Understanding LoRA Parameters

Let's break down each parameter in our config:

**`r=16` (Rank)**
- The "size" of our adaptation
- Higher = More powerful but slower and more memory
- Typical values: 8, 16, 32, 64
- **Our choice (16):** Good balance for most tasks

**`lora_alpha=32` (Scaling Factor)**
- Controls how much influence LoRA has
- Rule of thumb: `alpha = 2 × r`
- **Our choice (32):** Standard scaling for r=16

**`target_modules` (Where to apply LoRA)**
```
Transformer Layer:
┌────────────────┐
│   Attention    │ ← We target these!
│ - q_proj     ✓ │ (Query, Key, Value, Output)
│ - k_proj     ✓ │
│ - v_proj     ✓ │
│ - o_proj     ✓ │
├────────────────┤
│  Feed Forward  │ ← We skip these
│ - gate_proj    │ (To save memory)
│ - up_proj      │
│ - down_proj    │
└────────────────┘
```
- **Why attention layers?** They control how the model "understands" relationships
- **Pro tip:** For maximum quality, target FFN layers too (but uses more memory)

**`lora_dropout=0.05`**
- Randomly "turns off" 5% of LoRA neurons during training
- Prevents overfitting (memorizing instead of learning)
- **Our choice (0.05):** Conservative, works well for small datasets

**`task_type="CAUSAL_LM"`**
- Tells LoRA we're doing language generation (not classification)
- CAUSAL_LM = predict next word (like GPT)


In [None]:
if bnb_config: # If using quantization
    model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

---

## 5. Training

## 🏋️ Understanding Training Parameters

Let's demystify each training parameter:

### Core Training Settings

**`num_train_epochs=1`**
- An "epoch" = one complete pass through the entire dataset
- 1 epoch on 500 examples = Model sees each Q&A pair once
- **Why just 1?** For demo purposes. Production uses 3-5 epochs.

**`per_device_train_batch_size=2`**
- How many examples to process at once
- Larger = faster training but more memory
- **Our choice (2):** Conservative for laptops with 8-16 GB RAM
- If you have a beefy GPU: try 4 or 8

**`gradient_accumulation_steps=4`**
- This is clever! We simulate a larger batch size without using more memory
- Effective batch size = 2 × 4 = 8
- **How it works:**
```
Step 1: Process 2 examples, calculate gradients (don't update yet)
Step 2: Process 2 more, accumulate gradients
Step 3: Process 2 more, accumulate gradients
Step 4: Process 2 more, accumulate gradients
NOW: Update the model with accumulated gradients from 8 examples
```
**`learning_rate=2e-4` (0.0002)**
- How big of a "step" to take when updating parameters
- Too high → Model explodes 💥 or doesn't learn
- Too low → Training takes forever 🐌
- **Our choice (2e-4):** Standard for LoRA fine-tuning

### Advanced Settings

**`bf16=True` (Brain Float 16)**
- Use 16-bit precision instead of 32-bit for faster training
- "Brain Float" = special 16-bit format optimized for AI training
- Saves memory and speeds up computation

**`max_length=512`**
- Maximum tokens (words/subwords) in a training example
- Longer sequences = more context but more memory
- Medical Q&A rarely exceeds 512 tokens

**`packing=False`**
- Could we pack multiple short examples into one sequence?
- No - we want clean question-answer separation

### 📊 What to Expect:
```
Training on 500 examples:
- With GPU: ~10-15 minutes
- With CPU: ~45-60 minutes
- You'll see loss decreasing (good!) - aim for < 1.0
```


In [None]:
training_args = SFTConfig(
    output_dir="./tinyllama-medical",
    num_train_epochs=1,
    per_device_train_batch_size=2, # Keep low for standard laptops
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    bf16=True,  # Changed from fp16=True
    dataset_text_field="text",
    max_length=512,
    packing=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    formatting_func=formatting_func,
    args=training_args,
    processing_class=tokenizer
)

print("Starting Training...")
trainer.train()

### 📈 Understanding the Training Output

While training, you'll see output like this:
```
| Step | Training Loss |
| ---- | ------------- |
| 10   | 2.456         |
| 20   | 1.892         |
| 30   | 1.423         |
| 40   | 1.156         |
```
**What does this mean?**
- **Loss** = How "wrong" the model's predictions are
- Lower loss = Better predictions
- Starting loss (~2-3) = Model is guessing randomly
- Good final loss (~0.8-1.2) = Model is learning the patterns!

**⚠️ Warning Signs:**
- Loss increasing → Learning rate too high
- Loss stuck at same value → Learning rate too low or data too small
- Loss drops to nearly 0 → Model is memorizing (overfitting)

**💡 What's Actually Happening During Training:**
```
For each Q&A pair:
1. Model reads the question
2. Tries to predict the answer word-by-word
3. Compares prediction to actual answer
4. Calculates "how wrong" it was (loss)
5. Updates the LoRA adapters to do better next time
6. Repeat!
```


In [None]:
# Save the adapter (the plugin)
trainer.model.save_pretrained("./tinyllama-medical-adapter")
tokenizer.save_pretrained("./tinyllama-medical-adapter")
print("Adapter saved!")

---

## 6. Testing

In [None]:
# Put model in evaluation mode and disable gradient checkpointing
model.eval()
model.config.use_cache = True

### 🎲 Understanding Text Generation Parameters

When we ask our model a question, we need to control HOW it generates the answer:

**`max_new_tokens=100`**
- Maximum number of tokens (words/subwords) to generate
- Think of it as "answer length limit"
- 100 tokens ≈ 75-80 words

**`do_sample=True`**
- Should the model be creative or deterministic?
- **True:** Model picks from top probable words (varied, interesting)
- **False:** Always picks THE most probable word (boring, repetitive)

**`temperature=0.7`**
- Controls randomness (only matters if do_sample=True)
- Range: 0.0 to 2.0
```
Temperature 0.1 → Very focused, conservative
Temperature 0.7 → Balanced (our choice)
Temperature 1.0 → More creative
Temperature 2.0 → Wild, incoherent
```
**Visual Example:**
```
Question: "What helps a headache?"
Temperature 0.1: "Take aspirin and rest." (boring but safe)
Temperature 0.7: "Try aspirin or ibuprofen. Rest in a dark room may help."
Temperature 1.5: "Aspirin, meditation, cucumber slices, purple thoughts..."
```
**For Medical Applications:**
- Keep temperature low (0.3-0.7) for accuracy
- Higher temperature risks hallucinations (making up facts)


In [None]:
def ask(question):
    prompt = f"### Question:\n{question}\n\n### Answer:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=True, 
        temperature=0.7
    )
    
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ask("What are the symptoms of a cold?")

### 🧪 Test Your Fine-Tuned Model

Let's compare the model's performance on different types of questions:

**Test 1: Medical Knowledge (From Training)**
```python
ask("What are the symptoms of diabetes?")
# Expected: Should give medically accurate symptoms
```
**Test 2: General Medical (Not in Training)**
```python
ask("How can I prevent the flu?")
# Expected: Reasonable medical advice, even if not trained on this exact question
```
**Test 3: Off-Topic Question**
```python
ask("What's the capital of France?")
# Interesting: Model might still answer, or might try to frame as medical
```
### 📊 Evaluating Quality

**Good Signs ✅:**
- Answers are medically relevant
- Uses appropriate medical terminology
- Maintains professional tone
- Acknowledges uncertainty when appropriate

**Bad Signs ❌:**
- Makes up fake medical facts
- Uses inappropriate casual language
- Gives dangerous medical advice
- Refuses to answer simple questions

**Remember:** This model is NOT ready for real medical advice! It's a learning exercise.

### 🔬 Advanced: Compare to Base Model

Want to see the difference fine-tuning made? Load the base model and compare:
```python
# Load base model (not fine-tuned)
base_model = AutoModelForCausalLM.from_pretrained(model_id)
# Ask the same question to both models
# See the difference!
```


## 🎯 Student Challenge: Build an Empathetic Mental Health Bot

### The Goal
Medical data is factual and clinical. But what if we want a bot that's warm and comforting?

### Why This Matters
- Tone and style matter in AI applications
- Same model, different fine-tuning = completely different personality
- This demonstrates fine-tuning for BEHAVIOR, not just knowledge

### Your Mission (Step-by-Step)

**STEP 1: Create Your Dataset**
```python
from datasets import Dataset

mental_health_data = [
    {"question": "I feel sad.", "answer": "I'm sorry to hear that. It's okay to feel down sometimes. Do you want to talk about it?"},
    {"question": "I am anxious.", "answer": "Take a deep breath. Anxiety is tough, but you are not alone. Let's focus on the present moment."},
    {"question": "Nobody likes me.", "answer": "That must be a painful thought. I care about you, and I'm sure others do too, even if it's hard to see right now."},
    {"question": "I can't sleep.", "answer": "Trouble sleeping is really difficult. Have you tried any relaxation techniques? I'm here to help."},
    {"question": "I feel overwhelmed.", "answer": "It's completely understandable to feel that way. Let's break things down together, one step at a time."}
]

mh_dataset = Dataset.from_list(mental_health_data)
print(f"Created dataset with {len(mh_dataset)} examples")
```
**STEP 2: Modify the Formatting Function**

The key difference: Add a system message to set the tone!
```python
def empathetic_format(example):
    text = f"""You are a caring, empathetic friend who listens without judgment.

### Question:
{example['question']}

### Answer:
{example['answer']}"""
    return text

# Test it:
print(empathetic_format(mh_dataset[0]))
```
**STEP 3: Train the Model**
```python
# Same LoRA config as before

empathetic_training_args = SFTConfig(
    output_dir="./tinyllama-empathetic",
    num_train_epochs=3, # Note: More epochs for small dataset!
    per_device_train_batch_size=1, # Smaller because examples are longer
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=5,
    bf16=True,
    dataset_text_field="text",
    max_length=512,
    packing=False,
)

empathetic_trainer = SFTTrainer(
    model=model, # Start from your medical model or reload base
    train_dataset=mh_dataset,
    peft_config=lora_config,
    formatting_func=empathetic_format,
    args=empathetic_training_args,
    processing_class=tokenizer
)

print("Training empathetic model...")
empathetic_trainer.train()
```
**STEP 4: Test Your Creation**
```python
def ask_empathetic(question):
    prompt = f"""You are a caring, empathetic friend who listens without judgment.

### Question:
{question}

### Answer:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=150, # Longer for empathetic responses
        do_sample=True,
        temperature=0.8 # Slightly higher for warmth
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Try it!
ask_empathetic("I'm worried about my exams.")
ask_empathetic("I had a bad day.")
```
### 🤔 Discussion Questions

1. **How does the tone differ from the medical model?**
   - Medical: factual, clinical
   - Empathetic: warm, validating

2. **Could we combine both?**
   - Yes! Train on both datasets
   - Model would be medically accurate AND empathetic

3. **What are the risks?**
   - AI should not replace professional mental health care
   - Could give harmful advice if not carefully supervised
   - Important to include disclaimers

### 🌟 Extension Ideas

1. **Add more training data:**
   - Find mental health conversation datasets
   - Create 100+ examples for better quality

2. **Add safety guardrails:**
   - Train model to suggest professional help for serious issues
   - Add crisis hotline information to responses

3. **Combine with RAG:**
   - Fine-tune for empathetic tone
   - Use RAG to pull from verified mental health resources

4. **Multi-turn conversations:**
   - Most mental health chats are multi-turn
   - Challenge: Modify the format to include conversation history


## 🔧 Troubleshooting Common Issues

### Issue 1: "CUDA Out of Memory"

**Problem:** GPU runs out of RAM

**Solutions:**
```python
# Option A: Reduce batch size
per_device_train_batch_size=1 # Down from 2

# Option B: Reduce LoRA rank
r=8 # Down from 16

# Option C: Reduce max_length
max_length=256 # Down from 512

# Option D: Enable gradient checkpointing
model.gradient_checkpointing_enable()
```
### Issue 2: Model Outputs Gibberish

**Problem:** Loss decreased but outputs are nonsensical

**Likely Causes:**
- Learning rate too high
- Trained for too many epochs (overfit)
- Bad data formatting

**Solutions:**
```python
# Lower learning rate
learning_rate=1e-4 # Down from 2e-4

# Fewer epochs
num_train_epochs=1 # Down from 3

# Check your formatting function output manually
print(formatting_func(dataset[0]))
```
### Issue 3: Model Just Repeats the Question

**Problem:** Output looks like:
```
### Question:
What is diabetes?
### Answer:
What is diabetes?
```
**Cause:** Model hasn't learned where answers should go

**Solutions:**
- Check formatting function includes "### Answer:" marker
- Train for more steps
- Increase dataset size

### Issue 4: Training is Extremely Slow

**Speed Benchmarks:**
- **GPU (NVIDIA):** 10-15 minutes for 500 examples
- **Mac M1/M2:** 30-45 minutes
- **CPU Only:** 60-120 minutes

**Speed-up Tips:**
```python
# 1. Reduce number of examples for testing
dataset = dataset.select(range(100)) # Just 100 examples

# 2. Increase batch size (if memory allows)
per_device_train_batch_size=4

# 3. Reduce gradient accumulation steps
gradient_accumulation_steps=2

# 4. Use fewer logging steps
logging_steps=50 # Log less frequently
```
### Issue 5: "ImportError: bitsandbytes not found"

**On Mac:** This is expected! Use this instead:
```python
# For Mac, use this loading code:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("mps") # Move to Mac GPU
```
### Issue 6: Model Refuses to Follow Fine-Tuning


---

## Key Takeaways
1.  **LoRA** allows us to fine-tune significantly faster by freezing the main model.
2.  **Quantization** is great for NVIDIA GPUs, but smaller models (1B) run fine on Mac/CPU in `float16`.
3.  **Data Formatting** is critical in teaching the model *how* to speak (e.g., Q&A format).