# How to Fine-Tune LLMs: The No-BS Guide

### *From "What's fine-tuning?" to "I just trained my own AI" in one notebook*

---

## Wait, What Even IS Fine-Tuning?

Okay so imagine this...

You just hired the **smartest person alive**. We're talking genius-level IQ. They've literally read:
- Every book ever written
- All of Wikipedia (yes, even the weird articles about obscure 18th-century philosophers)
- Every Stack Overflow answer
- Billions of random blog posts, tweets, and Reddit comments

They know **EVERYTHING**. Grammar? Perfect. Facts? Encyclopedia brain. Writing? Shakespeare who?

### But here's the problem...

This genius has **NO CLUE** about YOUR specific stuff:

| What they know | What they DON'T know |
|----------------|---------------------|
| General English | Your company's slang |
| Random facts | Your product details |
| How to write | YOUR writing style |
| Generic advice | Your domain expertise |

Ask them to answer mental health questions like a therapist? They'll give you a generic Wikipedia response. 

Ask them to sound like YOUR customer support team? They'll sound like a robot reading a manual.

**That's exactly what a pre-trained LLM is like.**

Smart? Absolutely. Useful for YOUR specific task on day one? Not really.

---

## So What's Fine-Tuning Then?

**Fine-tuning** = Giving this genius a crash course in YOUR specific stuff.

Think of it like onboarding a new employee:
- Day 1: "Here's how WE talk to customers"
- Day 2: "Here's what WE know about mental health"
- Day 3: "Here's OUR style, OUR vibe, OUR domain"

After onboarding, that genius now sounds like they've worked at YOUR company for years.

**That's fine-tuning in a nutshell.**

### Before vs After Fine-Tuning:

**BEFORE** (vanilla pre-trained model):
```
User: "I'm feeling anxious"

AI: "Anxiety is a feeling of worry, nervousness, or unease, 
     typically about an imminent event or something with an 
     uncertain outcome. It is considered a normal reaction to 
     stress and can sometimes be beneficial in certain situations."
     
     ...thanks Wikipedia, very helpful (not)
```

**AFTER** (our fine-tuned model):
```
User: "I'm feeling anxious"  

AI: "I hear you, and I'm sorry you're going through that. 
     Anxiety can feel really overwhelming sometimes. Would you 
     like to talk about what's on your mind? I'm here to listen 
     and help however I can."
     
     ...NOW we're talking! Empathetic, supportive, actually useful.
```

See the difference? Same base intelligence, but now it knows HOW to respond for our specific use case.

---

## What You'll Actually Learn (Not Just Copy-Paste)

By the end of this notebook, you'll genuinely understand:

### 1. The "What" and "Why" Behind Every Line

Most tutorials be like: "Just run this code, trust me bro"

And then when something breaks, you're Googling for 3 hours.

**Not this notebook.** 

We explain EVERYTHING:
- What each line does
- WHY it's there
- What happens if you change it
- How to debug when things inevitably go wrong

You won't just follow along. You'll actually UNDERSTAND.

### 2. The Cheat Codes: LoRA + Quantization

Here's the dirty secret about fine-tuning:

**The Old Way (Full Fine-Tuning):**
```
Fine-tuning a 7B parameter model:

Memory needed:     ~112 GB of GPU RAM
Hardware required: A100 80GB ($10,000+) or multiple GPUs
Your budget:       Crying
```

**The Cheat Code (LoRA + 4-bit Quantization):**
```
Same 7B model, same results:

Memory needed:     ~4-8 GB of GPU RAM
Hardware required: Free Colab/Kaggle T4
Your budget:       Zero dollars
```

That's not a typo. Same results. **1/20th the resources.**

LoRA = Only train 1-4% of the model (the rest stays frozen)
Quantization = Compress the model from 32-bit to 4-bit (8x smaller)

Combined = You can fine-tune models that "shouldn't" fit on your GPU.

This is why fine-tuning went from "big tech only" to "anyone with a laptop" in like 2 years.

### 3. Train a REAL Model on REAL Data

This isn't a toy example with 10 fake sentences.

We're using:
- **Real Model:** TinyLlama 1.1B (that's 1,100,000,000 parameters)
- **Real Data:** Mental health conversational dataset from Kaggle
- **Real Result:** A working chatbot that gives empathetic mental health responses

When you finish, you'll have an actual fine-tuned model that you built yourself.

### 4. Not Crash Your GPU (The Memory Management Survival Guide)

If you've done any ML, you know this error:
```
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. 
GPU 0 has a total capacity of 15.78 GiB of which 1.24 GiB is free.
```

The training crashes. Your progress is lost. You question your career choices.

**We'll teach you ALL the tricks:**
- 4-bit quantization (8x smaller model)
- Gradient checkpointing (trade speed for memory)
- LoRA (only train 4% of parameters)
- Gradient accumulation (fake large batches)
- Proper batch sizing

Your GPU will survive. Your training will complete. No tears.

---

## The Roadmap

Here's what we're doing, step by step:

| Chapter | What We Do | Why It Matters |
|---------|-----------|----------------|
| 1-3 | Load data from Kaggle | Good data = good model. Garbage in = garbage out. |
| 4 | Quick theory on LoRA and Quantization | Know WHY the tricks work, not just that they work |
| 5-6 | Load model + Apply LoRA | The magic memory-saving setup |
| 7 | Format the data **(CRITICAL!)** | The #1 cause of "my model outputs garbage" |
| 8-9 | Configure training + Actually train | Where the learning happens |
| 10 | Test the model | Watch your creation come to life |
| 11 | Save and load for later | So you don't lose your work |

---

## The One Thing You MUST Remember

Before we write a single line of code, burn this into your brain:

### THE GOLDEN RULE OF FINE-TUNING
```
+------------------------------------------------------------------+
|                                                                  |
|   Your INFERENCE format must EXACTLY match your TRAINING format  |
|                                                                  |
|   - Miss one </s> token?        --> Garbage output               |
|   - Wrong special tokens?       --> Garbage output               |
|   - Extra whitespace?           --> Probably garbage output      |
|   - Different prompt structure? --> Definitely garbage output    |
|                                                                  |
|   FORMAT. MUST. MATCH.                                           |
|                                                                  |
+------------------------------------------------------------------+
```

I'm telling you this now because it will save you HOURS of debugging later. 

Every week on Reddit/Discord/Twitter, someone posts:
> "My fine-tuned model outputs random HTML tags and nonsense, what's wrong?"

And the answer is almost always: "Your inference format doesn't match your training format."

Don't be that person. We'll show you the correct format and explain exactly why it matters.

---

## Let's Do This

Here's what you'll have by the end:

- [x] A working fine-tuned mental health chatbot
- [x] Understanding of every single line of code
- [x] The ability to adapt this to YOUR own dataset
- [x] Knowledge of LoRA, quantization, and memory management
- [x] Confidence to fine-tune other models in the future

**Don't forget to upvote if this helps you!**

Alright, enough talk. Let's build something cool.

## Chapter 1: The Data

We're using **Mental Health Conversational Data** from Kaggle - Q&A pairs about mental health.

**Why this dataset?**
- It's conversational (Q&A format) - perfect for teaching a chatbot
- It's domain-specific - we can see if the model actually learned something
- It's small enough to train on free Colab GPUs

Dataset: https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data

## Chapter 2: Setup

Installing all the tools we need.

In [1]:
# INSTALLATION
# transformers - THE library for LLMs from Hugging Face
# datasets - Easy data loading
# peft - Parameter-Efficient Fine-Tuning (LoRA lives here)
# trl - Has SFTTrainer for easy fine-tuning
# bitsandbytes - 4-bit quantization magic
# accelerate - Makes training go brrr

!pip install -q transformers datasets peft trl bitsandbytes accelerate
!pip install -q scipy kagglehub

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

In [2]:
# IMPORTS
import torch
# torch = PyTorch
# 
# Think of PyTorch as the ELECTRICITY that powers everything
# Every single AI calculation happens through PyTorch:
#   - Matrix multiplications? PyTorch
#   - GPU computations? PyTorch  
#   - Gradients for learning? PyTorch
#   - Tensors (fancy arrays)? PyTorch
#
# Without PyTorch, nothing works. It's the foundation.
# TensorFlow is the alternative but PyTorch won the war lol
#
# WHY WE NEED IT: Everything else (transformers, peft, trl) is built ON TOP of PyTorch

from transformers import (
    AutoModelForCausalLM,
    # Let's break this name down:
    #   Auto = "Hey library, figure out the architecture yourself"
    #          You just say "TinyLlama" and it knows it's a LlamaForCausalLM
    #          You say "GPT2" and it knows it's a GPT2LMHeadModel
    #          No need to memorize 100 different class names
    #
    #   Model = The actual neural network with billions of numbers (weights)
    #           This IS the AI. The weights store all the knowledge.
    #
    #   ForCausalLM = "For Causal Language Modeling"
    #                 Causal = can only look at PAST tokens, not future
    #                 Language Modeling = predicting the next word
    #                 This is how ChatGPT works: predict next token, add it, repeat
    #
    # WHY WE NEED IT: To load the pre-trained TinyLlama model

    AutoTokenizer,
    # Models don't understand "Hello how are you"
    # They understand [15496, 703, 527, 345]
    #
    # Tokenizer does TWO things:
    #   1. Text -> Numbers (encoding): "Hello" -> [15496]
    #   2. Numbers -> Text (decoding): [15496] -> "Hello"
    #
    # Every model has its OWN tokenizer with its OWN vocabulary
    # TinyLlama's tokenizer is different from GPT-4's tokenizer
    # "Auto" means it automatically picks the right one for your model
    #
    # WHY WE NEED IT: To convert our text data into numbers the model understands

    BitsAndBytesConfig,
    # This is the MEMORY CHEAT CODE config
    #
    # Normal model: each weight = 32 bits (float32)
    #   7 billion weights * 32 bits = 28 GB just to LOAD the model
    #   Training needs 4x more = 112 GB (nobody has this)
    #
    # With BitsAndBytes: each weight = 4 bits
    #   7 billion weights * 4 bits = 3.5 GB
    #   That fits on a free Colab GPU!
    #
    # BitsAndBytesConfig = the settings for HOW to compress
    #   - Which 4-bit format? (nf4 is best)
    #   - What precision for math? (bfloat16)
    #   - Double quantization? (yes, saves more memory)
    #
    # WHY WE NEED IT: To run big models on small GPUs

    TrainingArguments,
    # A container for all training settings
    # Learning rate, batch size, epochs, etc.
    #
    # ACTUALLY we don't use this one - SFTConfig replaces it
    # This import is unnecessary but doesn't hurt
    #
    # WHY IT'S HERE: Probably copy-pasted from old code lol
)

from peft import (
    # PEFT = Parameter-Efficient Fine-Tuning
    # The library that makes LoRA possible
    # Made by HuggingFace
    
    LoraConfig,
    # The settings for LoRA adapters:
    #   - r = rank (size of adapters, 8-64 typically)
    #   - lora_alpha = scaling factor
    #   - lora_dropout = regularization
    #   - target_modules = which layers get adapters
    #
    # Think of it as the blueprint for the adapters
    #
    # WHY WE NEED IT: To tell LoRA how to set up the adapters

    get_peft_model,
    # This function does the magic transformation:
    #   INPUT: normal model (frozen, not trainable)
    #   OUTPUT: model with LoRA adapters attached (trainable)
    #
    # It wraps your model and injects tiny trainable matrices
    # The original weights stay frozen (unchanged)
    # Only the new small matrices get trained
    #
    # WHY WE NEED IT: To actually ADD LoRA to our model

    prepare_model_for_kbit_training,
    # Quantized (4-bit) models are weird and need special prep
    #
    # This function does behind-the-scenes stuff:
    #   1. Enables gradient checkpointing (memory trick)
    #   2. Casts layer norms to float32 (stability)
    #   3. Prepares embeddings for training
    #   4. Handles other quantization edge cases
    #
    # Without this, training a 4-bit model would crash or give garbage
    #
    # WHY WE NEED IT: To make quantized models trainable

    TaskType,
    # Tells LoRA what kind of task you're doing
    # Different tasks need slightly different setups
    #
    # Options:
    #   CAUSAL_LM = next token prediction (us, ChatGPT-style)
    #   SEQ_2_SEQ_LM = translation, summarization
    #   SEQ_CLS = classification (sentiment, spam detection)
    #   TOKEN_CLS = NER (finding names, places in text)
    #   QUESTION_ANS = extractive QA
    #
    # WHY WE NEED IT: To tell LoRA we're doing text generation
)

from trl import SFTTrainer, SFTConfig
# TRL = Transformer Reinforcement Learning library
# But we're using it for SFT (Supervised Fine-Tuning), not RL
#
# SFTTrainer = The coach that runs training
#   - Loads batches of data
#   - Runs forward pass (model makes predictions)
#   - Calculates loss (how wrong was it?)
#   - Runs backward pass (calculate gradients)
#   - Updates weights (optimizer step)
#   - Logs metrics, saves checkpoints
#   - ALL AUTOMATIC - we just call trainer.train()
#
# SFTConfig = All the training hyperparameters
#   - How many epochs?
#   - What batch size?
#   - What learning rate?
#   - When to save?
#   - etc.
#
# WHY WE NEED IT: So we don't write 200 lines of training loop code

from datasets import Dataset
# HuggingFace's data container class
#
# Why not just use a Python list?
#   - Dataset is optimized for large data (memory-mapped)
#   - Has .map() for applying functions to all examples
#   - Has .train_test_split() for splitting data
#   - Works seamlessly with HuggingFace trainers
#   - Can shuffle, batch, filter easily
#
# WHY WE NEED IT: Trainer expects data in this format

import os
# Built-in Python library for operating system stuff
#   - os.makedirs() = create folders
#   - os.path.join() = combine paths
#   - os.listdir() = list files in folder
#
# WHY WE NEED IT: To create output directories for saving models

import random
# Built-in Python library for randomness
#   - random.choice() = pick random item
#   - random.shuffle() = shuffle a list
#   - random.seed() = reproducibility
#
# WHY WE NEED IT: Actually not used in this notebook lol
#                 But good habit to import for ML work

import pandas as pd
# THE data manipulation library
# Think Excel but in Python
#
# We use it for:
#   - pd.read_json() = load the intents.json file
#   - Iterating through rows with .iterrows()
#
# Could we do this without pandas? Yes
# Is pandas easier? Way easier
#
# WHY WE NEED IT: To load and process the JSON dataset

# ============================================================
# CHECKING OUR SETUP
# ============================================================

print("All imports successful!")
# If any import failed, we'd get an error before this
# Seeing this message = everything installed correctly

print(f"PyTorch: {torch.__version__}")
# Shows PyTorch version like "2.8.0+cu126"
#   - 2.8.0 = PyTorch version
#   - cu126 = CUDA 12.6 (GPU support)
# Good for debugging ("it worked on version X")

print(f"CUDA available: {torch.cuda.is_available()}")
# CUDA = NVIDIA's GPU computing platform
#
# True = You have an NVIDIA GPU AND PyTorch can use it
#        Training will be fast (minutes)
#
# False = No GPU, CPU only
#         Training will be SLOW (hours)
#         Go to Runtime > Change runtime type > GPU

if torch.cuda.is_available():
    # Only run this if we have a GPU
    
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    # Shows GPU name like "Tesla P100-PCIE-16GB"
    #   - Tesla P100 = the GPU model
    #   - 16GB = VRAM amount
    #
    # The 0 means "first GPU" (index 0)
    # Most people only have 1 GPU so it's always 0
    #
    # Common free GPUs:
    #   - Tesla T4: 16GB, good
    #   - Tesla P100: 16GB, good
    #   - Tesla K80: 12GB, older but works
    
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    # Shows VRAM in GB
    #
    # .total_memory = bytes of VRAM
    # / 1e9 = divide by 1 billion = convert to GB
    # :.2f = format as decimal with 2 places
    #
    # More VRAM = can train bigger models, use bigger batches
    # 16GB is plenty for TinyLlama with 4-bit quantization

2025-12-30 16:05:14.924398: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767110715.109335      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767110715.164769      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767110715.601098      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767110715.601152      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767110715.601155      55 computation_placer.cc:177] computation placer alr

All imports successful!
PyTorch: 2.8.0+cu126
CUDA available: True
GPU: Tesla P100-PCIE-16GB
Memory: 17.06 GB


## Chapter 3: Loading the Data

In [3]:
import kagglehub
# kagglehub = Kaggle's official library for downloading datasets
#
# Kaggle is like GitHub but for datasets and ML competitions
# They host thousands of free datasets
# Instead of manually downloading + uploading to Colab, kagglehub does it automatically
#
# Alternative methods (old way):
#   1. Download ZIP from kaggle.com manually
#   2. Upload to Colab/Kaggle notebook
#   3. Unzip it
#   4. Find the path
#   ... super annoying
#
# kagglehub way:
#   1. One line of code
#   ... that's it
#
# WHY WE NEED IT: To grab the dataset without manual downloading BS

path = kagglehub.dataset_download("elvis23/mental-health-conversational-data")
# This one line does A LOT:
#   1. Connects to Kaggle's servers
#   2. Downloads the dataset ZIP file
#   3. Extracts it to a local folder
#   4. Returns the path to that folder
#
# The string "elvis23/mental-health-conversational-data" is the dataset ID
#   - "elvis23" = the username who uploaded it
#   - "mental-health-conversational-data" = the dataset name
#   - You can find this in the Kaggle URL:
#     kaggle.com/datasets/elvis23/mental-health-conversational-data
#                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#                        this part is the ID
#
# path = where the files got saved
#   - On Kaggle: "/kaggle/input/mental-health-conversational-data"
#   - On Colab: "/root/.cache/kagglehub/datasets/..."
#
# WHY WE NEED IT: To actually get the data onto our machine

print("Dataset path:", path)
# Shows where the dataset was saved
#
# Output looks like:
#   "Dataset path: /kaggle/input/mental-health-conversational-data"
#
# We need this path to load the files in the next step
# If something went wrong, we'd see an error here instead
#
# WHY WE NEED IT: To confirm download worked + know where files are

print("Files:", os.listdir(path))
# os.listdir(path) = list all files in that folder
#
# Output looks like:
#   "Files: ['intents.json']"
#
# This tells us:
#   - Download worked (folder exists)
#   - What files we have to work with
#   - The filename we need to load (intents.json)
#
# Some datasets have multiple files (train.csv, test.csv, etc.)
# This one just has intents.json
#
# WHY WE NEED IT: To see what files we're working with
#                 So we know what to load next

Dataset path: /kaggle/input/mental-health-conversational-data
Files: ['intents.json']


In [4]:
# Load the JSON file
dataset = pd.read_json(os.path.join(path, 'intents.json'))
# Let's break this down from the INSIDE OUT:
#
# STEP 1: os.path.join(path, 'intents.json')
#   - path = "/kaggle/input/mental-health-conversational-data" (from previous cell)
#   - 'intents.json' = the filename we saw in os.listdir()
#   - os.path.join() = combines them into a full path
#   - Result: "/kaggle/input/mental-health-conversational-data/intents.json"
#
#   Why not just use path + '/intents.json'?
#   - os.path.join() handles slashes correctly on ANY operating system
#   - Windows uses backslashes: C:\folder\file
#   - Mac/Linux use forward slashes: /folder/file
#   - os.path.join() figures it out automatically
#   - It's a good habit even if you're always on Linux
#
# STEP 2: pd.read_json(...)
#   - pd = pandas (we imported it as pd)
#   - read_json() = reads a JSON file into a DataFrame
#   - DataFrame = pandas's table structure (rows and columns, like Excel)
#
#   JSON looks like this:
#   {
#     "intents": [
#       {"tag": "greeting", "patterns": ["Hi", "Hey"], "responses": ["Hello!"]},
#       {"tag": "goodbye", "patterns": ["Bye"], "responses": ["See ya!"]}
#     ]
#   }
#
#   pandas converts it to a table:
#   | intents                                                    |
#   |------------------------------------------------------------|
#   | {'tag': 'greeting', 'patterns': ['Hi'...], 'responses':... |
#   | {'tag': 'goodbye', 'patterns': ['Bye'...], 'responses':... |
#
# STEP 3: dataset = ...
#   - Stores the DataFrame in a variable called 'dataset'
#   - Now we can work with the data
#
# WHY WE NEED IT: To get the raw data into Python so we can process it

print(f"Loaded {len(dataset)} intent categories")
# len(dataset) = number of rows in the DataFrame
#
# Our dataset has 80 rows, each row is an "intent category"
# Intent categories are like topics:
#   - greeting (Hi, Hey, Hello)
#   - anxiety (I feel anxious, I'm worried)
#   - depression (I feel sad, I'm depressed)
#   - etc.
#
# Output: "Loaded 80 intent categories"
#
# This confirms:
#   - File loaded successfully (no error)
#   - We have 80 different conversation topics
#
# WHY WE NEED IT: Sanity check - make sure data loaded correctly

print(dataset.head())
# .head() = show first 5 rows of the DataFrame
#
# Output looks like:
#                                              intents
# 0  {'tag': 'greeting', 'patterns': ['Hi', 'Hey'...
# 1  {'tag': 'morning', 'patterns': ['Good morning...
# 2  {'tag': 'afternoon', 'patterns': ['Good after...
# 3  {'tag': 'evening', 'patterns': ['Good evening...
# 4  {'tag': 'night', 'patterns': ['Good night'],...
#
# Each row has ONE column called 'intents'
# Inside that column is a dictionary with:
#   - 'tag': the category name ("greeting")
#   - 'patterns': list of user questions ["Hi", "Hey", "Hello"]
#   - 'responses': list of bot answers ["Hello there!", "Hi!"]
#
# .head() options:
#   - .head() = first 5 rows (default)
#   - .head(10) = first 10 rows
#   - .tail() = last 5 rows
#   - .sample(5) = random 5 rows
#
# WHY WE NEED IT: To peek at the data structure
#                 So we know how to extract Q&A pairs in the next step

Loaded 80 intent categories
                                             intents
0  {'tag': 'greeting', 'patterns': ['Hi', 'Hey', ...
1  {'tag': 'morning', 'patterns': ['Good morning'...
2  {'tag': 'afternoon', 'patterns': ['Good aftern...
3  {'tag': 'evening', 'patterns': ['Good evening'...
4  {'tag': 'night', 'patterns': ['Good night'], '...


In [5]:
# EXPAND INTO Q&A PAIRS
# Each pattern (question) gets paired with EACH response (answer)
# This gives us MORE training data!
#
# THE PROBLEM:
#   Raw data looks like:
#   {
#     "patterns": ["Hi", "Hey", "Hello"],      <- 3 questions
#     "responses": ["Hello!", "Hi there!"]     <- 2 answers
#   }
#
#   But for training, we need INDIVIDUAL pairs:
#   Q: Hi        A: Hello!
#   Q: Hi        A: Hi there!
#   Q: Hey       A: Hello!
#   Q: Hey       A: Hi there!
#   Q: Hello     A: Hello!
#   Q: Hello     A: Hi there!
#
#   3 questions x 2 answers = 6 training examples from 1 intent!
#
# WHY THIS MATTERS:
#   - More training data = better model
#   - Original: 80 intents
#   - After expansion: 661 Q&A pairs
#   - That's 8x more training examples!

qa_pairs = []
# Creating an empty list to store our Q&A pairs
# We'll fill it up with dictionaries like:
#   {'question': 'Hi', 'answer': 'Hello there!'}
#   {'question': 'Hey', 'answer': 'Hello there!'}
#   ... etc
#
# WHY A LIST: Easy to append to, easy to convert to Dataset later

for idx, row in dataset.iterrows():
    # .iterrows() = loop through DataFrame row by row
    #
    # Each iteration gives us:
    #   idx = row number (0, 1, 2, ... 79)
    #   row = the actual row data (a pandas Series)
    #
    # It's like doing:
    #   for i in range(len(dataset)):
    #       row = dataset.iloc[i]
    # But cleaner
    #
    # We don't actually use idx, but iterrows() always returns both
    # Some people write: for _, row in dataset.iterrows()
    # The _ means "I don't care about this value"
    #
    # WHY ITERROWS: To process each intent category one at a time

    intent = row['intents']
    # row['intents'] = get the 'intents' column from this row
    #
    # Remember the DataFrame structure:
    #   | intents                                           |
    #   |---------------------------------------------------|
    #   | {'tag': 'greeting', 'patterns': [...], ...}       |
    #
    # So row['intents'] gives us that dictionary:
    #   {'tag': 'greeting', 'patterns': ['Hi', 'Hey'], 'responses': ['Hello!']}
    #
    # Now we can access patterns and responses from it
    #
    # WHY: To extract the dictionary containing patterns and responses

    patterns = intent.get('patterns', [])   # Questions
    # intent.get('patterns', []) = safely get 'patterns' key
    #
    # Two ways to get dictionary values:
    #   intent['patterns']      <- CRASHES if key doesn't exist
    #   intent.get('patterns')  <- Returns None if key doesn't exist
    #   intent.get('patterns', [])  <- Returns [] if key doesn't exist
    #
    # The [] is the DEFAULT value if 'patterns' is missing
    # This prevents crashes on malformed data
    #
    # patterns = ['Hi', 'Hey', 'Hello', 'Howdy'] (list of user inputs)
    #
    # WHY .get(): Defensive coding - don't crash on bad data

    responses = intent.get('responses', []) # Answers
    # Same thing but for responses
    #
    # responses = ['Hello there!', 'Hi! How are you?'] (list of bot replies)
    #
    # WHY: To get the list of possible answers

    if patterns and responses:
        # This checks TWO things:
        #   1. patterns is not empty (has at least 1 question)
        #   2. responses is not empty (has at least 1 answer)
        #
        # Empty list = False in Python (falsy value)
        # Non-empty list = True in Python (truthy value)
        #
        # So this is shorthand for:
        #   if len(patterns) > 0 and len(responses) > 0:
        #
        # WHY: Skip intents that are missing questions or answers
        #      Can't make a Q&A pair without both!

        # Pair EACH pattern with EACH response for more data
        for pattern in patterns:
            # Loop through each question
            # pattern = "Hi" then "Hey" then "Hello" etc.
            
            for response in responses:
                # NESTED LOOP: For each question, loop through ALL answers
                # response = "Hello there!" then "Hi! How are you?" etc.
                #
                # This creates the CARTESIAN PRODUCT:
                #   pattern="Hi" + response="Hello there!"
                #   pattern="Hi" + response="Hi! How are you?"
                #   pattern="Hey" + response="Hello there!"
                #   pattern="Hey" + response="Hi! How are you?"
                #   ... and so on
                #
                # 3 patterns x 2 responses = 6 combinations
                #
                # WHY NESTED LOOPS: To create ALL possible Q&A combinations
                #                   More training data = better model

                qa_pairs.append({
                    'question': pattern.strip(),
                    'answer': response.strip()
                })
                # .append() = add item to end of list
                #
                # We're adding a dictionary with:
                #   'question': the user input
                #   'answer': the bot response
                #
                # .strip() = remove whitespace from both ends
                #   "  Hi  " -> "Hi"
                #   "\nHello\n" -> "Hello"
                #   Cleans up messy data
                #
                # After all loops, qa_pairs looks like:
                # [
                #   {'question': 'Hi', 'answer': 'Hello there!'},
                #   {'question': 'Hi', 'answer': 'Hi! How are you?'},
                #   {'question': 'Hey', 'answer': 'Hello there!'},
                #   ... 661 total pairs
                # ]
                #
                # WHY THIS FORMAT: Easy to convert to HuggingFace Dataset
                #                  Easy to access with pair['question']

print(f"Created {len(qa_pairs)} Q&A pairs!")
# len(qa_pairs) = how many pairs we created
#
# Output: "Created 661 Q&A pairs!"
#
# Started with 80 intents, ended with 661 pairs
# That's the power of the cartesian product!
#
# WHY: Confirm the expansion worked and see how much data we have

print(f"\nExamples:")
# \n = newline, just adds blank line for readability

for i in range(3):
    # Loop 3 times: i = 0, 1, 2
    # Show first 3 examples as a sanity check
    
    print(f"Q: {qa_pairs[i]['question']}")
    # qa_pairs[i] = the i-th pair (a dictionary)
    # qa_pairs[i]['question'] = the question from that pair
    #
    # Output: "Q: Hi"
    
    print(f"A: {qa_pairs[i]['answer'][:80]}...\n")
    # qa_pairs[i]['answer'][:80] = first 80 characters of answer
    #
    # [:80] is string slicing - prevents super long answers from flooding the screen
    # Some answers are like 200 characters, this keeps output clean
    #
    # The "..." at the end shows we truncated it
    # \n adds blank line between examples
    #
    # Output: "A: Hello there. Tell me how are you feeling today?..."
    #
    # WHY: Visual confirmation that our data looks right
    #      Always inspect your data before training!

Created 661 Q&A pairs!

Examples:
Q: Hi
A: Hello there. Tell me how are you feeling today?...

Q: Hi
A: Hi there. What brings you here today?...

Q: Hi
A: Hi there. How are you feeling today?...



In [6]:
# Convert to HuggingFace Dataset format
train_data = Dataset.from_list(qa_pairs)
# Dataset = HuggingFace's data container class (we imported it earlier)
# .from_list() = create a Dataset from a list of dictionaries
#
# Our qa_pairs looks like:
# [
#   {'question': 'Hi', 'answer': 'Hello there!'},
#   {'question': 'Hey', 'answer': 'Hi! How are you?'},
#   {'question': 'Hello', 'answer': 'Hello there!'},
#   ... 661 total
# ]
#
# Dataset.from_list() converts it to a table structure:
#   | question | answer                    |
#   |----------|---------------------------|
#   | Hi       | Hello there!              |
#   | Hey      | Hi! How are you?          |
#   | Hello    | Hello there!              |
#   | ...      | ...                       |
#
# Each dictionary KEY becomes a COLUMN
# Each dictionary becomes a ROW
#
# WHY NOT JUST USE THE LIST?
#   - SFTTrainer EXPECTS a HuggingFace Dataset
#   - Dataset has useful methods: .map(), .train_test_split(), .shuffle()
#   - Dataset is memory-efficient for large data (memory-mapped)
#   - Dataset integrates perfectly with the HuggingFace ecosystem
#
# OTHER WAYS TO CREATE A DATASET:
#   - Dataset.from_dict({'question': [...], 'answer': [...]})
#   - Dataset.from_pandas(dataframe)
#   - Dataset.from_csv('file.csv')
#   - load_dataset('huggingface/dataset_name')  <- from HuggingFace Hub
#
# .from_list() is cleanest when you already have list of dicts
#
# WHY WE NEED IT: Trainer won't accept a raw Python list
#                 This is the format HuggingFace tools expect

print(f"Dataset: {len(train_data)} examples")
# len(train_data) = number of rows in the Dataset
#
# Output: "Dataset: 661 examples"
#
# Same as len(qa_pairs) - just confirming conversion worked
# If this number was different, something went wrong
#
# WHY: Sanity check - make sure no data was lost in conversion

print(f"Columns: {train_data.column_names}")
# .column_names = list of column names in the Dataset
#
# Output: "Columns: ['question', 'answer']"
#
# These came from the dictionary keys!
# Every dict in qa_pairs had 'question' and 'answer' keys
# So Dataset has 'question' and 'answer' columns
#
# If you had dicts like {'q': '...', 'a': '...'}
# You'd get columns ['q', 'a'] instead
#
# WHY: Confirm the structure is what we expect
#      We'll reference these column names later when formatting
#
# BONUS - OTHER USEFUL DATASET PROPERTIES:
#   train_data[0]              <- first row as dict
#   train_data['question']     <- all questions as list
#   train_data[0:5]            <- first 5 rows as new Dataset
#   train_data.shape           <- (num_rows, num_columns)
#   train_data.features        <- column types (string, int, etc.)

Dataset: 661 examples
Columns: ['question', 'answer']


## Chapter 4: Quick Theory

**Why LoRA?**
- Full fine-tuning a 7B model needs ~112GB RAM (for gradients + optimizer states)
- LoRA freezes the original model and adds tiny "adapter" layers
- We only train ~1% of parameters
- Same results, fraction of the memory

**Why Quantization?**
- Normal: 32 bits per weight -> 7B model = 28GB
- 4-bit: 4 bits per weight -> 7B model = 3.5GB
- Combine both = fine-tune 7B models on consumer GPUs

## Chapter 5: Model Setup

In [7]:
# 4-BIT QUANTIZATION CONFIG
# This compresses the model to fit in GPU memory
#
# THE PROBLEM:
#   Model weights are normally stored as float32 (32 bits per number)
#   TinyLlama has 1.1 billion weights
#   1.1B * 32 bits = 4.4 GB just to LOAD the model
#
#   Bigger models are worse:
#   7B model * 32 bits = 28 GB  (won't fit on most GPUs)
#   13B model * 32 bits = 52 GB (definitely won't fit)
#   70B model * 32 bits = 280 GB (lol good luck)
#
#   And that's just to LOAD it!
#   Training needs 3-4x more memory for gradients + optimizer states
#
# THE SOLUTION: Quantization
#   Store weights in fewer bits
#   32 bits -> 4 bits = 8x smaller!
#   7B model: 28 GB -> 3.5 GB (fits on free Colab!)
#
# WHY WE NEED IT: To run models that would otherwise need $10,000 GPUs

bnb_config = BitsAndBytesConfig(
    # BitsAndBytesConfig = settings container for quantization
    # BitsAndBytes = the library that does the actual compression
    # Created by Tim Dettmers (legend in ML efficiency)
    #
    # We're creating a config object with our preferred settings
    # This config gets passed to the model loader later

    load_in_4bit=True,
    # load_in_4bit = compress weights to 4-bit when loading
    #
    # Bit depth comparison:
    #   float32 = 32 bits = full precision (default)
    #   float16 = 16 bits = half precision (2x smaller)
    #   int8    = 8 bits  = 4x smaller
    #   int4    = 4 bits  = 8x smaller  <- WE'RE USING THIS
    #
    # True = yes, compress to 4-bit
    # False = don't compress (you'd need way more VRAM)
    #
    # The compression happens WHEN the model loads
    # Original model on disk: float16 or float32
    # Model in memory: 4-bit (thanks to this setting)
    #
    # WHY: 8x memory savings. The main reason we can run big models.

    bnb_4bit_quant_type="nf4",
    # bnb_4bit_quant_type = WHICH 4-bit format to use
    #
    # Options:
    #   "fp4" = Float Point 4-bit
    #           Standard 4-bit, works okay
    #
    #   "nf4" = NormalFloat 4-bit  <- THE GOOD ONE
    #           Specifically designed for neural network weights
    #           Based on the observation that neural net weights
    #           follow a NORMAL DISTRIBUTION (bell curve)
    #           So nf4 spaces out its 16 values (2^4=16) to match
    #           that bell curve, giving better precision where it matters
    #
    # Research showed nf4 beats fp4 on basically every benchmark
    # There's no reason to use fp4 unless you're experimenting
    #
    # WHY "nf4": Best quality for neural networks, basically free upgrade

    bnb_4bit_compute_dtype=torch.bfloat16,
    # bnb_4bit_compute_dtype = precision for CALCULATIONS
    #
    # IMPORTANT DISTINCTION:
    #   - Weights are STORED in 4-bit (saves memory)
    #   - Calculations are DONE in higher precision (better accuracy)
    #
    # When the model does math:
    #   1. Decompress 4-bit weights to bfloat16 (on the fly)
    #   2. Do the matrix multiplication in bfloat16
    #   3. Keep result in bfloat16
    #
    # Why not compute in 4-bit?
    #   - 4-bit math is imprecise and causes errors to accumulate
    #   - Decompressing to 16-bit for math is fast (GPU is good at this)
    #   - Best of both worlds: small storage, accurate compute
    #
    # Options:
    #   torch.float32  = most precise, but slower and more memory
    #   torch.float16  = half precision, good but can overflow
    #   torch.bfloat16 = "brain float 16", same range as float32 but less precise
    #                    designed by Google specifically for ML
    #                    handles big/small numbers without overflow
    #                    THE BEST CHOICE for training
    #
    # WHY bfloat16: Stable training, good speed, designed for exactly this use case

    bnb_4bit_use_double_quant=True,
    # bnb_4bit_use_double_quant = quantize the quantization constants too
    #
    # NERDY DETAIL (skip if you want):
    #   When you quantize weights, you need to store "scaling factors"
    #   These tell you how to convert 4-bit back to real numbers
    #   
    #   Example: weights [0.12, 0.15, 0.11, 0.14]
    #   Quantized: [2, 3, 1, 2] with scale=0.05
    #   To get back: 2*0.05=0.10, 3*0.05=0.15, etc.
    #
    #   The scales themselves take up memory (usually float32)
    #   Double quant = quantize the scales too (to 8-bit)
    #   Saves another ~0.4 bits per parameter
    #
    # In practice:
    #   double_quant=False: ~4.5 bits per weight
    #   double_quant=True:  ~4.0 bits per weight
    #   That's about 10% extra memory savings!
    #
    # True = yes, do the extra compression
    # False = don't bother (slightly faster loading, slightly more memory)
    #
    # WHY True: Free memory savings with no quality loss. Why not?
)

print("Quantization config ready!")
# Just confirms we created the config without errors
#
# This config doesn't DO anything yet
# It's just settings saved in a variable
# The actual quantization happens when we load the model
# We'll pass this config to AutoModelForCausalLM.from_pretrained()
#
# WHY: Confirmation message, good practice to verify steps completed

# ============================================================
# MEMORY SAVINGS SUMMARY
# ============================================================
#
# TinyLlama 1.1B without quantization:
#   1.1B params * 32 bits = 4.4 GB
#   Training memory: ~17 GB (gradients, optimizer, activations)
#
# TinyLlama 1.1B WITH this config:
#   1.1B params * 4 bits = 0.55 GB
#   Training memory: ~3-4 GB (fits easily on free GPUs!)
#
# For a 7B model:
#   Without: 28 GB (needs A100 or better)
#   With: 3.5 GB (runs on T4/P100!)
#
# THE TRADEOFF:
#   - Quality loss: ~1-3% worse on benchmarks (barely noticeable)
#   - Speed: slightly slower (decompression overhead)
#   - Memory: 8x smaller (HUGE win)
#
# For fine-tuning on consumer hardware, this is basically mandatory

Quantization config ready!


In [8]:
# MODEL SELECTION
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# This is the Hugging Face model ID
# Format: "organization/model-name"
#
# Let's break down the name:
#   TinyLlama/         = the organization that made it (TinyLlama team)
#   TinyLlama          = model family name
#   1.1B               = 1.1 billion parameters (the "size" of the brain)
#   Chat               = fine-tuned for conversation (not just raw text completion)
#   v1.0               = version 1.0
#
# WHERE THIS COMES FROM:
#   Hugging Face Hub = like GitHub but for AI models
#   URL: huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
#   
#   Anyone can upload models there
#   You just use the ID string to download them
#   No manual downloading - the library handles everything
#
# WHY TINYLLAMA?
#   1. It's SMALL: 1.1B parameters
#      - Loads in ~0.7 GB with 4-bit quantization
#      - Trains fast (minutes, not hours)
#      - Perfect for learning/experimenting
#
#   2. It's CAPABLE: Despite being small
#      - Trained on 3 trillion tokens (massive dataset)
#      - Punches above its weight class
#      - Actually produces coherent responses
#
#   3. It's CHAT-TUNED: The "-Chat" part matters
#      - Base models just complete text randomly
#      - Chat models understand conversation format
#      - Already knows <|system|>, <|user|>, <|assistant|> format
#      - We're building on top of existing chat ability
#
#   4. It's FREE: No API keys, no payments
#      - Open source, open weights
#      - Use it however you want
#
# Small enough for free Colab, but still capable
# ^ This is the key tradeoff
#   Bigger model = smarter but needs more VRAM
#   Smaller model = fits anywhere but less capable
#   TinyLlama hits a sweet spot for learning

# Other options: "microsoft/phi-2", "mistralai/Mistral-7B-v0.1"
#
# IF YOU HAVE MORE GPU MEMORY, TRY THESE:
#
# "microsoft/phi-2" (2.7B parameters)
#   - 2.5x bigger than TinyLlama
#   - Microsoft's "small but mighty" model
#   - Surprisingly good at reasoning
#   - Needs ~2 GB with 4-bit quant
#   - Good middle ground
#
# "mistralai/Mistral-7B-v0.1" (7B parameters)
#   - 6x bigger than TinyLlama
#   - One of the best open source models
#   - Beats many 13B models in benchmarks
#   - Needs ~4 GB with 4-bit quant
#   - Use if you have T4/P100 with room to spare
#
# "meta-llama/Llama-2-7b-chat-hf" (7B parameters)
#   - Meta's official Llama 2
#   - Requires accepting license on HuggingFace
#   - Very capable, widely used
#   - Similar requirements to Mistral-7B
#
# "mistralai/Mistral-7B-Instruct-v0.2" (7B parameters)
#   - Instruction-tuned version of Mistral
#   - Better at following commands
#   - Great for chat applications
#
# PARAMETER COUNT REFERENCE:
#   1B params   = small, fast, limited capability
#   3B params   = decent balance
#   7B params   = good quality, needs ~16GB GPU
#   13B params  = high quality, needs ~24GB GPU
#   70B params  = state of the art, needs multiple GPUs
#
# FOR THIS TUTORIAL:
#   Stick with TinyLlama - guaranteed to work on free GPUs
#   Once you understand the process, scale up to bigger models

print(f"Using model: {model_name}")
# Output: "Using model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#
# Just confirms which model we're using
# Useful when you copy-paste code and forget what you set
#
# This doesn't LOAD the model yet
# We're just storing the ID string in a variable
# Actual loading happens in the next cell with AutoModelForCausalLM.from_pretrained()
#
# WHY A VARIABLE?
#   We reference model_name multiple times:
#   - Loading the model
#   - Loading the tokenizer
#   - Saving metadata
#   
#   If it's a variable, we change it in ONE place
#   If we hardcoded it everywhere, we'd have to change multiple lines
#   
#   model_name = "different/model"  <- change once, works everywhere
#
# WHY: Good practice, keeps code DRY (Don't Repeat Yourself)

Using model: TinyLlama/TinyLlama-1.1B-Chat-v1.0


In [9]:
# LOAD TOKENIZER
# Converts text <-> token IDs
#
# WHY WE NEED A TOKENIZER:
#   Models don't understand text. At all.
#   They only understand numbers (tensors of integers)
#
#   Human: "Hello, how are you?"
#   Model: "wtf is this gibberish?"
#
#   Human: [15496, 11, 703, 527, 345, 30]
#   Model: "Ah yes, I know exactly what to do!"
#
#   Tokenizer is the translator between human world and model world

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Let's break this down:
#
# AutoTokenizer
#   - "Auto" = automatically detect the right tokenizer class
#   - TinyLlama uses LlamaTokenizer under the hood
#   - GPT-2 uses GPT2Tokenizer
#   - BERT uses BertTokenizer
#   - You don't need to know which - Auto figures it out
#
# .from_pretrained(model_name, ...)
#   - Downloads the tokenizer files from Hugging Face Hub
#   - model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#   - It fetches:
#       tokenizer.json     (the actual vocabulary)
#       tokenizer_config.json  (settings)
#       special_tokens_map.json  (like <s>, </s>, etc.)
#
#   IMPORTANT: Tokenizer MUST match the model!
#   - TinyLlama's tokenizer knows 32,000 tokens
#   - Those exact 32,000 tokens map to the model's embedding layer
#   - If you use wrong tokenizer, token ID 5000 means different things
#   - Result: complete garbage output
#   - Always load tokenizer from SAME model ID as the model
#
# trust_remote_code=True
#   - Some models have custom tokenizer code
#   - This allows running that custom code
#   - Security note: only use for trusted models (like TinyLlama)
#   - For sketchy random models, set False and hope it works
#
# WHAT'S IN A TOKENIZER:
#   1. Vocabulary: mapping of text pieces to IDs
#      "hello" -> 15496
#      "the" -> 278
#      "ing" -> 292  (yes, word PIECES, not just words)
#
#   2. Encoding rules: how to break text into pieces
#      "unhappiness" -> ["un", "happiness"] or ["unhapp", "iness"]?
#      Different tokenizers do it differently
#
#   3. Special tokens: control tokens
#      <s> = start of sequence
#      </s> = end of sequence
#      <pad> = padding
#      <unk> = unknown token
#
# WHY WE NEED IT: Can't feed text to model without converting to numbers first

# Set padding token (many models don't have one by default)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"Set pad_token to: '{tokenizer.eos_token}'")
#
# THE PADDING PROBLEM:
#   When training, we process BATCHES of examples
#   Batch = multiple examples at once (faster than one-by-one)
#
#   But examples have different lengths:
#     Example 1: "Hi"                    -> [15496]           (1 token)
#     Example 2: "How are you doing?"    -> [5765, 526, 345, 892]  (4 tokens)
#
#   GPUs need rectangular tensors (same length for all)
#   Can't have jagged arrays!
#
#   Solution: PAD shorter examples to match longest:
#     Example 1: [15496, PAD, PAD, PAD]  (padded to 4 tokens)
#     Example 2: [5765, 526, 345, 892]   (already 4 tokens)
#
#   Now both are length 4 - can stack into a batch!
#
# THE MISSING PAD TOKEN:
#   Many models (like Llama family) weren't trained with padding
#   They only have: <s> (start), </s> (end), <unk> (unknown)
#   No <pad> token exists in their vocabulary!
#
#   tokenizer.pad_token is None = no padding token defined
#
# THE FIX:
#   tokenizer.pad_token = tokenizer.eos_token
#   "Just use the end-of-sequence token for padding"
#   
#   </s> = token ID 2 (for TinyLlama)
#   Now padding uses ID 2
#
#   Is this a hack? Kind of.
#   Does it work? Yes, perfectly fine for fine-tuning.
#   
#   The model learns to ignore padding tokens anyway
#   Using </s> for padding doesn't confuse it
#
# WHY WE CHECK FIRST (if ... is None):
#   Some models DO have a pad token already
#   Don't want to override it if it exists
#   Only set it if it's missing

tokenizer.padding_side = "right"
# padding_side = WHERE to add the padding tokens
#
# Two options:
#   "right" = pad at the END      [Hello, <pad>, <pad>]
#   "left"  = pad at the START    [<pad>, <pad>, Hello]
#
# FOR DECODER-ONLY MODELS (GPT, Llama, TinyLlama):
#   Use "right" padding
#
#   Why? These models generate LEFT to RIGHT
#   They predict: given everything before, what comes next?
#   
#   With right padding:
#     [Hello, how, are, you, <pad>, <pad>]
#     Model sees real tokens first, padding after
#     Makes sense - real content, then filler
#
#   With left padding:
#     [<pad>, <pad>, Hello, how, are, you]
#     Model sees padding first, then real tokens
#     Can confuse the model during training
#     (Though left padding IS used for inference sometimes)
#
# FOR ENCODER MODELS (BERT):
#   Either side works, usually use "right"
#
# FOR ENCODER-DECODER (T5):
#   Usually "right" for both
#
# TinyLlama is decoder-only, so "right" is correct
#
# WHY IT MATTERS:
#   Wrong padding side can hurt training performance
#   Model might learn weird patterns from padding position
#   "right" is the safe default for causal LMs

print(f"Vocab size: {tokenizer.vocab_size:,}")
# tokenizer.vocab_size = how many tokens the tokenizer knows
#
# Output: "Vocab size: 32,000"
#
# This means:
#   - 32,000 unique tokens in the vocabulary
#   - Token IDs range from 0 to 31,999
#   - Any text gets broken into these 32,000 building blocks
#
# The :, in the f-string adds commas for readability
#   32000 -> 32,000
#   1000000 -> 1,000,000
#
# VOCAB SIZE COMPARISON:
#   TinyLlama: 32,000 tokens
#   GPT-2: 50,257 tokens
#   Llama 2: 32,000 tokens
#   GPT-4: ~100,000 tokens (estimated)
#
# Bigger vocab = more tokens to represent concepts
#              = more efficient (fewer tokens per word)
#              = but bigger embedding matrix
#
# WHY WE PRINT IT: Sanity check, confirms tokenizer loaded correctly

print(f"EOS token: '{tokenizer.eos_token}' (ID: {tokenizer.eos_token_id})")
# EOS = End Of Sequence
#
# Output: "EOS token: '</s>' (ID: 2)"
#
# tokenizer.eos_token = the string representation "</s>"
# tokenizer.eos_token_id = the numeric ID (2)
#
# WHY EOS MATTERS:
#   1. Tells model when to STOP generating
#      Without EOS, model rambles forever
#      "Hello how are you I am fine the weather is nice today and..."
#      With EOS: "Hello how are you?</s>" -> model stops
#
#   2. We use it in training data
#      Every example ends with </s>
#      Model learns: "when I generate </s>, I'm done"
#
#   3. We used it as pad_token (see above)
#
# OTHER SPECIAL TOKENS:
#   tokenizer.bos_token = "<s>" (Beginning Of Sequence)
#   tokenizer.unk_token = "<unk>" (Unknown - rare/unseen words)
#   tokenizer.pad_token = "</s>" (we just set this)
#
# WHY WE PRINT IT: 
#   - Confirms EOS is what we expect
#   - We'll reference this token in our prompt format
#   - Good to know the actual string and ID

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Vocab size: 32,000
EOS token: '</s>' (ID: 2)


In [10]:
# LOAD MODEL
print("Loading model (might take a minute)...")
# Just a heads up message because loading takes time
#
# What happens during loading:
#   1. Download model files from Hugging Face (if not cached)
#      - First time: downloads ~2.2 GB of files
#      - After that: loads from local cache (fast)
#   2. Read the weight files into memory
#   3. Apply quantization (compress to 4-bit)
#   4. Move to GPU
#
# This can take 30 seconds to a few minutes depending on:
#   - Internet speed (if downloading)
#   - Disk speed (if loading from cache)
#   - GPU speed (for quantization)
#
# WHY PRINT THIS: So you don't think it crashed when nothing happens for a while

model = AutoModelForCausalLM.from_pretrained(
    # AutoModelForCausalLM = Auto-detect the right model class
    #
    # "Auto" does the magic:
    #   - Reads config.json from the model repo
    #   - Sees architecture: "LlamaForCausalLM"
    #   - Automatically uses the LlamaForCausalLM class
    #
    # Without Auto, you'd need to know the exact class:
    #   from transformers import LlamaForCausalLM
    #   model = LlamaForCausalLM.from_pretrained(...)
    #
    # Auto saves you from memorizing 50+ model class names
    #
    # "ForCausalLM" part:
    #   - Causal = can only see PAST tokens (not future)
    #   - LM = Language Model (predicts next token)
    #   - This is how GPT/ChatGPT style models work
    #   - Input: "The cat sat on the"
    #   - Output: "mat" (predicts next word)
    #
    # .from_pretrained() = load a pre-trained model
    #   - "pre-trained" = someone already trained it on trillions of tokens
    #   - We're not starting from random weights
    #   - We're building on top of existing knowledge
    #
    # WHY WE NEED IT: To get the actual neural network with all its weights

    model_name,
    # model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    # We defined this earlier
    #
    # This tells from_pretrained() WHERE to find the model:
    #   1. First checks local cache (~/.cache/huggingface/)
    #   2. If not there, downloads from huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
    #
    # The model repo contains:
    #   - config.json (architecture settings)
    #   - model.safetensors (the actual weights, ~2.2 GB)
    #   - generation_config.json (default generation settings)
    #
    # WHY: Tells the function which model to load

    quantization_config=bnb_config,
    # bnb_config = the BitsAndBytesConfig we created earlier
    #
    # Remember what's in it:
    #   load_in_4bit=True
    #   bnb_4bit_quant_type="nf4"
    #   bnb_4bit_compute_dtype=torch.bfloat16
    #   bnb_4bit_use_double_quant=True
    #
    # This tells the loader:
    #   "When you load those weights, compress them to 4-bit"
    #
    # WITHOUT this config:
    #   Model loads in float16/float32
    #   TinyLlama would use ~2.2 GB
    #   Bigger models wouldn't fit at all
    #
    # WITH this config:
    #   Model loads in 4-bit
    #   TinyLlama uses ~0.7 GB
    #   7B models fit on consumer GPUs
    #
    # The quantization happens ON THE FLY during loading:
    #   1. Read original float16 weights from disk
    #   2. Compress each layer to 4-bit as it's loaded
    #   3. Store compressed version in GPU memory
    #
    # Original weights on disk are unchanged
    # Only the in-memory version is quantized
    #
    # WHY: This is what makes the memory magic happen

    device_map="auto",
    # device_map = WHERE to put the model layers
    #
    # "auto" = let the library figure it out automatically
    #
    # What "auto" does:
    #   1. Check available devices (GPU? Multiple GPUs? CPU?)
    #   2. Check available memory on each device
    #   3. Spread model layers across devices optimally
    #
    # For most people (1 GPU):
    #   - Puts entire model on GPU 0
    #   - Simple and fast
    #
    # For multiple GPUs:
    #   - Splits model across GPUs automatically
    #   - Layer 0-11 on GPU 0, Layer 12-22 on GPU 1, etc.
    #   - Called "model parallelism"
    #
    # For huge models that don't fit on GPU:
    #   - Puts what fits on GPU
    #   - Puts overflow on CPU RAM
    #   - Slow but works
    #
    # OTHER OPTIONS:
    #   device_map="cuda:0"  = force everything on GPU 0
    #   device_map="cpu"     = force everything on CPU (slow!)
    #   device_map={"": 0}   = another way to say GPU 0
    #   device_map={"model.layers.0": 0, "model.layers.1": 1, ...}  = manual assignment
    #
    # "auto" is almost always what you want
    #
    # WHY: So we don't manually manage which GPU gets which layer

    trust_remote_code=True,
    # trust_remote_code = allow running custom code from the model repo
    #
    # Some models have custom Python files:
    #   - Custom attention implementations
    #   - Custom tokenization logic
    #   - Custom architecture tweaks
    #
    # True = download and execute that custom code
    # False = only use standard HuggingFace classes
    #
    # SECURITY WARNING:
    #   Custom code could theoretically be malicious
    #   Random person uploads model with evil code
    #   You run it and get hacked
    #
    #   In practice:
    #   - Popular models (TinyLlama, Llama, Mistral) are safe
    #   - They're audited by thousands of users
    #   - Don't use trust_remote_code=True on sketchy random models
    #
    # TinyLlama is trustworthy, so True is fine
    #
    # WHY: Some model features require custom code to work

    torch_dtype=torch.bfloat16,
    # torch_dtype = precision for model weights (before quantization)
    #
    # Wait, didn't we already set 4-bit quantization?
    # Yes, but this is for the COMPUTATION dtype
    #
    # The flow:
    #   1. Load weights (originally float16 or float32 on disk)
    #   2. Convert to torch_dtype (bfloat16)
    #   3. Then quantize to 4-bit for storage
    #   4. During forward pass: decompress to bfloat16 for math
    #
    # bfloat16 vs float16:
    #   float16 = 16 bits, good precision, but LIMITED RANGE
    #             Can overflow on big numbers (gives infinity)
    #             
    #   bfloat16 = 16 bits, SAME RANGE as float32, less precision
    #              Never overflows, slightly less accurate
    #              Google invented it specifically for ML training
    #
    # For training, bfloat16 is safer:
    #   - No overflow issues
    #   - Gradients stay stable
    #   - Basically same speed as float16
    #
    # NOTE: You might see a warning about this being "deprecated"
    #       They want you to use `dtype` instead of `torch_dtype`
    #       Both work fine, don't worry about it
    #
    # WHY: Stable computation without overflow during training
)

print("Model loaded!")
# Success message - model is now in GPU memory
#
# At this point:
#   - 1.1 billion parameters are loaded
#   - Compressed to 4-bit
#   - Sitting in GPU memory
#   - Ready to process text (but not ready for TRAINING yet)
#
# If you got here without errors, you're in good shape!
# Common errors that would stop you here:
#   - CUDA out of memory (model too big, try smaller one)
#   - Model not found (typo in model_name)
#   - trust_remote_code required (set it to True)
#
# WHY: Confirmation that loading succeeded

if torch.cuda.is_available():
    print(f"GPU memory used: {torch.cuda.memory_allocated()/1e9:.2f} GB")
# Let's see how much GPU memory the model is using
#
# torch.cuda.is_available() = check if we have a GPU
#   True = GPU exists and CUDA works
#   False = no GPU or CUDA broken
#
# torch.cuda.memory_allocated() = bytes currently used by PyTorch tensors
#   This is the actual memory YOUR stuff is using
#   Not total GPU memory, not memory used by other programs
#   Just PyTorch tensors (mainly our model weights)
#
# / 1e9 = divide by 1,000,000,000 = convert bytes to gigabytes
# :.2f = format with 2 decimal places
#
# OUTPUT: "GPU memory used: 0.77 GB"
#
# 0.77 GB for a 1.1 billion parameter model!
# Without quantization it would be ~2.2 GB (float16) or ~4.4 GB (float32)
# That's 3-6x memory savings from 4-bit quantization!
#
# This 0.77 GB is just the MODEL
# Training will use more:
#   - Gradients (for trainable LoRA params only)
#   - Optimizer states
#   - Activations (intermediate values)
#   - Input batches
#
# But LoRA keeps this small since we only train ~4% of params
#
# WHY: Verify quantization worked and see actual memory usage

# ============================================================
# WHAT WE HAVE NOW
# ============================================================
#
# model = a neural network with:
#   - 22 transformer layers
#   - 1.1 billion weights (quantized to 4-bit)
#   - Pre-trained knowledge from 3 trillion tokens
#   - Chat abilities from instruction tuning
#
# But it's FROZEN - all weights are fixed
# We can use it for inference (generating text)
# We CANNOT train it yet (no gradients computed)
#
# Next step: prepare_model_for_kbit_training()
# That will set up the model to be trainable with LoRA

Loading model (might take a minute)...


config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded!
GPU memory used: 0.77 GB


In [11]:
# PREPARE FOR TRAINING
model = prepare_model_for_kbit_training(
    # prepare_model_for_kbit_training = "get this quantized model ready for training"
    #
    # THE PROBLEM:
    #   We loaded a 4-bit quantized model
    #   Quantized models are WEIRD and need special handling
    #   If you just try to train them directly, things break:
    #     - Gradients don't flow properly
    #     - Numerical instability (NaN losses)
    #     - Some layers don't update correctly
    #     - Memory usage is suboptimal
    #
    # THE SOLUTION:
    #   This function does ALL the behind-the-scenes fixes
    #   One function call, all problems solved
    #
    # It comes from the PEFT library (same people who made LoRA)
    # They figured out all the edge cases so we don't have to
    #
    # WHY WE NEED IT: Quantized models need special prep before training works

    model,
    # The model we just loaded
    # This function modifies it IN PLACE but also returns it
    # We reassign: model = prepare_model_for_kbit_training(model)
    # Now 'model' refers to the prepared version
    #
    # WHAT THIS FUNCTION DOES TO THE MODEL:
    #
    # 1. FREEZES ALL PARAMETERS
    #    for param in model.parameters():
    #        param.requires_grad = False
    #
    #    requires_grad = "should we calculate gradients for this?"
    #    False = frozen, won't be trained
    #    True = trainable, will be updated
    #
    #    We freeze EVERYTHING because:
    #    - 4-bit weights can't be trained directly anyway
    #    - LoRA will add NEW trainable params on top later
    #    - We only want to train those new params
    #
    # 2. CASTS LAYER NORMS TO FLOAT32
    #    for name, module in model.named_modules():
    #        if "LayerNorm" in type(module).__name__:
    #            module.to(torch.float32)
    #
    #    LayerNorm = normalization layers (stabilize training)
    #    These are sensitive to precision
    #    4-bit or even float16 can make them unstable
    #    float32 keeps them accurate and stable
    #
    #    It's a tiny amount of extra memory (LayerNorms are small)
    #    But huge impact on training stability
    #
    # 3. ENABLES INPUT GRADIENTS FOR EMBEDDING LAYER
    #    model.enable_input_require_grads()
    #
    #    Embedding layer = converts token IDs to vectors
    #    Gradients need to flow BACK through this layer
    #    Without this, training signal doesn't reach LoRA
    #
    # 4. SETS UP GRADIENT CHECKPOINTING (if enabled)
    #    This is a memory optimization technique (see below)
    #
    # WHY ONE FUNCTION:
    #   - These steps are fiddly and easy to mess up
    #   - Order matters for some of them
    #   - Different model architectures need slightly different handling
    #   - The function handles all edge cases automatically

    use_gradient_checkpointing=True,
    # gradient_checkpointing = a MEMORY SAVING trick
    #
    # THE MEMORY PROBLEM:
    #   During training, we need to store "activations"
    #   Activations = intermediate values from forward pass
    #
    #   Forward pass: Input -> Layer 1 -> Layer 2 -> ... -> Layer 22 -> Output
    #   
    #   Each layer produces activations:
    #   Input -> [Act1] -> [Act2] -> [Act3] -> ... -> [Act22] -> Output
    #
    #   WHY KEEP THEM?
    #   For backward pass (calculating gradients), we need these values
    #   Gradient at Layer 5 depends on activations FROM Layer 5
    #   
    #   Normally: store ALL activations (lots of memory!)
    #   22 layers x batch_size x sequence_length x hidden_dim = HUGE
    #
    # THE GRADIENT CHECKPOINTING TRICK:
    #   Instead of storing all activations:
    #   1. Only store activations at certain "checkpoint" layers
    #   2. During backward pass, RECOMPUTE the others on the fly
    #
    #   Example (simplified):
    #   Normal:      Store [Act1, Act2, Act3, Act4, Act5, Act6] = 6 units memory
    #   Checkpointed: Store [Act1, Act3, Act5] = 3 units memory
    #                 When we need Act2: recompute from Act1
    #                 When we need Act4: recompute from Act3
    #
    # THE TRADEOFF:
    #   Memory: ~30-50% LESS (big win!)
    #   Speed: ~20-30% SLOWER (recomputing takes time)
    #
    #   For us: Memory is the bottleneck, not speed
    #   We're on a free GPU with limited VRAM
    #   Trading speed for memory is a GREAT deal
    #
    # True = YES, use gradient checkpointing (save memory)
    # False = NO, keep all activations (faster but more memory)
    #
    # WHEN TO USE:
    #   True: When memory is tight (free Colab, small GPU)
    #   False: When you have tons of VRAM (A100 80GB etc.)
    #
    # For free GPUs, ALWAYS use True
    #
    # WHY: We'd run out of memory during training without this
)

print("Model prepared for training!")
# Success message - model is now ready for LoRA
#
# WHAT'S DIFFERENT NOW:
#   Before prepare_model_for_kbit_training():
#   - Model loaded and quantized
#   - All params had requires_grad=True (default)
#   - Would crash or be unstable during training
#   - No gradient checkpointing
#
#   After prepare_model_for_kbit_training():
#   - All params have requires_grad=False (frozen)
#   - LayerNorms are float32 (stable)
#   - Gradient checkpointing enabled (memory efficient)
#   - Ready for LoRA adapters to be added
#
# NEXT STEP:
#   Add LoRA adapters with get_peft_model()
#   Those adapters will be the ONLY trainable parts
#   ~4% of parameters, but that's all we need!
#
# WHY: Confirms preparation succeeded

# ============================================================
# MEMORY SAVINGS SUMMARY SO FAR
# ============================================================
#
# 1. Quantization (4-bit):
#    - Model weights: 4.4 GB -> 0.77 GB
#    - 5.7x smaller!
#
# 2. Gradient Checkpointing:
#    - Activations: ~2-4 GB -> ~1-2 GB (during training)
#    - ~2x smaller!
#
# 3. LoRA (coming next):
#    - Only train 4% of params
#    - Gradients + optimizer states for 4% instead of 100%
#    - ~25x smaller!
#
# Combined: A model that would need 20+ GB
#           Now fits in 4-5 GB during training
#           That's why we can train on free Colab!

Model prepared for training!


## Chapter 6: LoRA Config

In [12]:
# LORA CONFIGURATION
# These settings control the adapter layers
#
# QUICK LORA RECAP:
#   Instead of training ALL 1.1 billion parameters, we:
#   1. Freeze the original model (don't touch it)
#   2. Add tiny "adapter" matrices to certain layers
#   3. Only train those adapters (~50 million params)
#
#   Same results, fraction of the memory and time
#
# HOW LORA WORKS (the actual math):
#   Original layer: y = Wx (W is a huge matrix, like 2048x2048)
#   
#   Instead of modifying W directly, LoRA adds a BYPASS:
#   y = Wx + BAx
#       ^^   ^^^
#       |    |__ LoRA adapter (two small matrices multiplied)
#       |_______ Original weights (frozen, unchanged)
#
#   B and A are MUCH smaller:
#   - A is (r x 2048) - compresses input to small "r" dimensions
#   - B is (2048 x r) - expands back to original size
#   - r = "rank" = how small? (we're using 64)
#
#   Original W: 2048 x 2048 = 4,194,304 parameters
#   LoRA A+B: (64 x 2048) + (2048 x 64) = 262,144 parameters
#   That's 16x fewer parameters for ONE layer!
#
# WHY WE NEED CONFIG: To tell LoRA exactly how to set up these adapters

lora_config = LoraConfig(
    # LoraConfig = container for all LoRA settings
    # We define the config here, apply it to model in next cell

    r=64,
    # r = RANK of the adapter matrices
    #
    # This is THE most important LoRA parameter
    #
    # WHAT IT MEANS:
    #   The "bottleneck" dimension in the A and B matrices
    #   A compresses to r dimensions, B expands back out
    #
    #   Think of it like:
    #   - Original: 2048-dimensional information
    #   - LoRA: squeeze through 64-dimensional bottleneck
    #   - Then expand back to 2048
    #
    #   Smaller r = tighter squeeze = less capacity = faster/smaller
    #   Larger r = looser squeeze = more capacity = slower/bigger
    #
    # HOW TO CHOOSE:
    #   r=4 or r=8:   Very small, simple tasks only
    #                 Like: changing response style, minor tweaks
    #                 
    #   r=16:         Good default for simple fine-tuning
    #                 Most tutorials use this
    #
    #   r=32:         Better for moderate complexity
    #                 Good balance of quality and efficiency
    #
    #   r=64:         What we're using - good for complex tasks
    #                 Learning new domain knowledge (mental health)
    #                 More capacity to learn new patterns
    #
    #   r=128+:       Heavy lifting, approaching full fine-tuning quality
    #                 Diminishing returns past 64-128 usually
    #
    # MEMORY IMPACT:
    #   r=16:  ~12 million trainable params
    #   r=32:  ~25 million trainable params
    #   r=64:  ~50 million trainable params (us)
    #   r=128: ~100 million trainable params
    #
    #   Still WAY less than 1.1 billion!
    #
    # WHY 64:
    #   We're teaching the model a whole new domain (mental health)
    #   Need enough capacity to learn the vocabulary and response patterns
    #   64 is generous but still very efficient
    #
    # r = rank of the adapter matrices
    # Higher = more capacity to learn, but more memory
    # 64 is good for complex tasks

    lora_alpha=128,
    # lora_alpha = SCALING FACTOR for LoRA outputs
    #
    # WHAT IT DOES:
    #   Remember: y = Wx + BAx
    #   Actually it's: y = Wx + (alpha/r) * BAx
    #                        ^^^^^^^^^^^^^
    #                        scaling factor
    #
    #   With r=64 and alpha=128:
    #   Scaling = 128/64 = 2.0
    #   LoRA output gets multiplied by 2.0
    #
    # WHY SCALE?
    #   LoRA matrices are initialized near zero
    #   Without scaling, their contribution is tiny
    #   Scaling amplifies their effect on the output
    #
    #   Higher alpha = LoRA has STRONGER influence
    #   Lower alpha = LoRA has WEAKER influence
    #
    # COMMON PATTERNS:
    #   alpha = r:       Scaling = 1.0 (neutral)
    #   alpha = 2*r:     Scaling = 2.0 (stronger, what we use)
    #   alpha = r/2:     Scaling = 0.5 (weaker)
    #
    # HOW TO CHOOSE:
    #   Start with alpha = r (scaling = 1.0)
    #   If model doesn't learn enough: increase alpha
    #   If model overfits/goes crazy: decrease alpha
    #
    #   alpha = 2*r is a popular choice for stronger learning
    #   That's what we're doing: 128 = 2 * 64
    #
    # RELATIONSHIP WITH LEARNING RATE:
    #   alpha affects how much LoRA contributes
    #   learning_rate affects how fast weights update
    #   They interact! If you change one, might need to adjust other
    #
    #   Rule of thumb: If you double alpha, might halve learning rate
    #
    # WHY 128:
    #   With r=64, alpha=128 gives 2x scaling
    #   Strong enough to learn new patterns
    #   Not so strong that it destabilizes training
    #
    # Scaling factor. Output scaled by alpha/r
    # 128/64 = 2x scaling

    lora_dropout=0.1,
    # lora_dropout = REGULARIZATION for LoRA layers
    #
    # WHAT IS DROPOUT:
    #   During training, randomly "drop" some connections
    #   Set them to zero temporarily
    #   Different connections dropped each batch
    #
    #   0.1 = drop 10% of connections randomly each time
    #
    # WHY IT HELPS:
    #   Prevents OVERFITTING
    #   
    #   Overfitting = model memorizes training data
    #                 instead of learning general patterns
    #   
    #   "Q: Hi" -> "A: Hello there!" (memorized exact response)
    #   vs
    #   "Q: Hi" -> Understanding how to greet (learned the pattern)
    #
    #   Dropout forces the model to not rely on any single connection
    #   Has to learn robust patterns that work even with missing links
    #   Like studying with random pages torn out of your textbook
    #   Forces you to understand concepts, not memorize pages
    #
    # HOW TO CHOOSE:
    #   0.0:  No dropout (risk of overfitting)
    #   0.05: Light dropout (small datasets, simple tasks)
    #   0.1:  Standard dropout (what we use - good default)
    #   0.2:  Heavy dropout (very small data, high overfit risk)
    #   0.3+: Rarely used, too aggressive
    #
    # OUR SITUATION:
    #   661 training examples = pretty small dataset
    #   Risk of overfitting is real
    #   0.1 dropout helps prevent that
    #
    # NOTE: Dropout only happens during TRAINING
    #       During inference, all connections are used
    #       (scaled appropriately)
    #
    # Dropout for regularization (prevents overfitting)

    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",      # MLP layers
    ],
    # target_modules = WHICH layers get LoRA adapters
    #
    # TRANSFORMER ARCHITECTURE RECAP:
    #   Each transformer layer has two main parts:
    #
    #   1. ATTENTION (the "thinking" part)
    #      - Decides which tokens to focus on
    #      - "When generating next word, look at these previous words"
    #      - Has 4 projection matrices:
    #        q_proj = Query projection  ("what am I looking for?")
    #        k_proj = Key projection    ("what do I contain?")
    #        v_proj = Value projection  ("what info do I have?")
    #        o_proj = Output projection ("combine attention results")
    #
    #   2. MLP/FFN (the "processing" part)
    #      - Transforms information after attention
    #      - Where factual knowledge is often stored
    #      - Has 3 projection matrices (in Llama architecture):
    #        gate_proj = Gating mechanism ("how much to let through")
    #        up_proj   = Expand to larger dimension
    #        down_proj = Compress back down
    #
    # WHICH TO TARGET?
    #   Minimal: ["q_proj", "v_proj"]
    #     - Just attention, most common in early LoRA papers
    #     - 2 modules per layer, fewer params
    #
    #   Standard: ["q_proj", "k_proj", "v_proj", "o_proj"]
    #     - All attention layers
    #     - 4 modules per layer
    #     - Good for learning new patterns
    #
    #   Full (what we use): attention + MLP
    #     - 7 modules per layer
    #     - Maximum learning capacity
    #     - Better for learning new knowledge/domain
    #
    # WHY WE TARGET ALL 7:
    #   We're teaching mental health domain knowledge
    #   That knowledge needs to be stored somewhere
    #   MLP layers are believed to store factual knowledge
    #   Attention layers learn patterns and relationships
    #   We want BOTH, so we target everything
    #
    # MORE MODULES = MORE TRAINABLE PARAMS:
    #   2 modules (q,v only):    ~18 million params
    #   4 modules (attention):   ~35 million params
    #   7 modules (all):         ~50 million params
    #
    #   Still just 4% of total model!
    #
    # HOW TO FIND MODULE NAMES:
    #   Different models have different names!
    #   Llama/TinyLlama: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    #   GPT-2: c_attn, c_proj, c_fc
    #   BERT: query, key, value, dense
    #
    #   To find them: print([n for n, _ in model.named_modules()])
    #   Or check the model's documentation
    #
    # Which layers get LoRA adapters
    # More modules = more learning capacity

    bias="none",
    # bias = whether to train bias terms in LoRA layers
    #
    # WHAT ARE BIASES:
    #   Linear layer: y = Wx + b
    #                     ^^   ^
    #                     |    |__ bias (added to output)
    #                     |_______ weights (multiplied with input)
    #
    #   Weights are the big matrices
    #   Biases are small vectors (one number per output dimension)
    #
    # OPTIONS:
    #   "none" = don't train any biases (most efficient)
    #   "all"  = train all biases in target modules
    #   "lora_only" = only train biases in LoRA layers (not original)
    #
    # WHY "none":
    #   Biases are tiny compared to weights
    #   Training them adds complexity but minimal benefit
    #   Original LoRA paper found "none" works great
    #   Keeps things simple and efficient
    #
    # WHEN TO USE OTHER OPTIONS:
    #   "all": If you're really squeezing for every bit of quality
    #          Marginal improvement, rarely worth it
    #   "lora_only": Middle ground, almost never used
    #
    # WHY: Maximum efficiency, following best practices

    task_type=TaskType.CAUSAL_LM,
    # task_type = what kind of task are we doing?
    #
    # OPTIONS:
    #   TaskType.CAUSAL_LM     = next token prediction (us!)
    #                            GPT-style text generation
    #
    #   TaskType.SEQ_2_SEQ_LM  = encoder-decoder models
    #                            Like T5, for translation/summarization
    #
    #   TaskType.SEQ_CLS       = sequence classification
    #                            "Is this email spam?" -> yes/no
    #
    #   TaskType.TOKEN_CLS     = token classification
    #                            NER: "John went to Paris" -> [PERSON, O, O, LOCATION]
    #
    #   TaskType.QUESTION_ANS  = extractive QA
    #                            Find answer span in context
    #
    #   TaskType.FEATURE_EXTRACTION = get embeddings
    #                                  Not really training, just using model
    #
    # WHY CAUSAL_LM:
    #   TinyLlama is a causal (autoregressive) language model
    #   It generates text left-to-right, predicting next token
    #   That's exactly what we want for chat responses
    #
    # WHAT THIS AFFECTS:
    #   How LoRA layers are set up internally
    #   Mostly just tells PEFT which config defaults to use
    #
    # WHY: Tells LoRA we're doing text generation, not classification etc.
)

print("LoRA config:")
print(f"  Rank: {lora_config.r}")
# Output: "Rank: 64"
# Confirms our bottleneck dimension

print(f"  Alpha: {lora_config.lora_alpha}")
# Output: "Alpha: 128"
# Confirms our scaling factor

print(f"  Scaling: {lora_config.lora_alpha / lora_config.r}")
# Output: "Scaling: 2.0"
# The actual multiplier applied to LoRA outputs
# alpha/r = 128/64 = 2.0
# LoRA contributions are doubled

print(f"  Modules: {lora_config.target_modules}")
# Output: "Modules: {'down_proj', 'q_proj', 'gate_proj', 'up_proj', 'v_proj', 'o_proj', 'k_proj'}"
# Shows all 7 modules we're targeting
# Note: it's a set so order is random, that's fine
#
# WHY PRINT ALL THIS: Verify config is what we expect before applying

# ============================================================
# WHAT WE'VE CONFIGURED
# ============================================================
#
# LoRA Adapters will be:
#   - Rank 64 (good capacity for learning new domain)
#   - Scaled 2x (alpha=128, strong influence)
#   - 10% dropout (prevent overfitting)
#   - Applied to 7 module types (attention + MLP)
#   - Across all 22 transformer layers
#
# This will create:
#   22 layers × 7 modules × 2 matrices (A and B) = 308 LoRA matrices
#   Total: ~50 million trainable parameters
#   That's 4.4% of the 1.1 billion total
#
# NEXT STEP:
#   Apply this config to the model with get_peft_model()

LoRA config:
  Rank: 64
  Alpha: 128
  Scaling: 2.0
  Modules: {'down_proj', 'q_proj', 'gate_proj', 'up_proj', 'v_proj', 'o_proj', 'k_proj'}


In [13]:
# APPLY LORA TO MODEL
model = get_peft_model(model, lora_config)
# get_peft_model = "take this model and add LoRA adapters to it"
#
# WHAT THIS FUNCTION DOES:
#   1. Takes your frozen base model
#   2. Reads the lora_config we just created
#   3. Finds all the target_modules in the model
#   4. Injects LoRA adapter matrices (A and B) into each one
#   5. Returns a wrapped model that looks the same but has adapters
#
# BEFORE get_peft_model():
#   model = frozen TinyLlama
#   All 1.1B params have requires_grad=False
#   Nothing is trainable
#   Just a static inference model
#
# AFTER get_peft_model():
#   model = TinyLlama + LoRA adapters
#   Original 1.1B params still frozen (requires_grad=False)
#   NEW ~50M LoRA params are trainable (requires_grad=True)
#   Ready for fine-tuning!
#
# WHAT'S HAPPENING UNDER THE HOOD:
#   For each target module (q_proj, k_proj, etc.):
#
#   BEFORE:
#   input -> [q_proj weights] -> output
#            (frozen 2048x2048)
#
#   AFTER:
#   input -> [q_proj weights] -----> (+) -> output
#            (frozen 2048x2048)       ^
#                                     |
#            [LoRA A] -> [LoRA B] ----+
#            (64x2048)   (2048x64)
#            (trainable) (trainable)
#
#   The original path is unchanged (frozen)
#   LoRA adds a parallel bypass path (trainable)
#   Outputs are summed together
#
# THE WRAPPER:
#   get_peft_model() returns a PeftModel object
#   It wraps the original model
#   From the outside, it behaves the same:
#     - Same input format
#     - Same output format
#     - Same forward() method
#   But internally, LoRA adapters are doing their thing
#
# WHY REASSIGN model = ...:
#   The function returns a NEW wrapped model
#   We want to use the wrapped version going forward
#   So we reassign the variable 'model' to point to it
#
# WHY WE NEED IT: This is what actually adds the trainable LoRA layers

# See how many parameters we're training
model.print_trainable_parameters()
# .print_trainable_parameters() = show parameter breakdown
#
# OUTPUT:
#   "trainable params: 50,462,720 || all params: 1,150,511,104 || trainable%: 4.3861"
#
# LET'S BREAK THAT DOWN:
#
# trainable params: 50,462,720
#   These are the LoRA adapter parameters
#   The ONLY things that will be updated during training
#   ~50 million numbers that start random and get optimized
#
#   Where do 50M params come from?
#   - 22 transformer layers
#   - 7 target modules per layer (q,k,v,o,gate,up,down)
#   - Each module gets 2 matrices: A and B
#   - Matrix sizes depend on layer dimensions and rank
#
#   Rough math for one q_proj:
#     A matrix: 64 × 2048 = 131,072 params
#     B matrix: 2048 × 64 = 131,072 params
#     Total: 262,144 params per module
#
#   262,144 × 7 modules × 22 layers ≈ 40 million
#   (Actual is 50M because some layers have different sizes)
#
# all params: 1,150,511,104
#   Total parameters in the ENTIRE model
#   Original TinyLlama (1.1B) + LoRA adapters (50M)
#   1,100,048,384 + 50,462,720 = 1,150,511,104
#
#   The base model params are still there, just frozen
#   They do the heavy lifting during forward pass
#   But they don't change during training
#
# trainable%: 4.3861
#   What percentage of total params are trainable
#   50,462,720 / 1,150,511,104 × 100 = 4.39%
#
#   WE'RE ONLY TRAINING 4.4% OF THE MODEL!
#
#   This is the magic of LoRA:
#   - 96% of model is frozen (no gradients, no optimizer states)
#   - Only 4% needs gradients and optimizer
#   - Memory for training is ~25x smaller than full fine-tuning
#
# Should be around 1-2% of total!
# ^ Actually we got 4.4% because:
#   - We used r=64 (higher rank = more params)
#   - We targeted 7 modules (more modules = more params)
#   - Still very efficient! Full fine-tuning would be 100%

# ============================================================
# WHAT JUST HAPPENED - VISUAL
# ============================================================
#
# BEFORE (frozen model):
#   Layer 1: [frozen weights] -> output
#   Layer 2: [frozen weights] -> output
#   ...
#   Layer 22: [frozen weights] -> output
#
#   Trainable: 0 params (0%)
#
# AFTER (model with LoRA):
#   Layer 1: [frozen weights] + [LoRA adapters] -> output
#   Layer 2: [frozen weights] + [LoRA adapters] -> output
#   ...
#   Layer 22: [frozen weights] + [LoRA adapters] -> output
#
#   Trainable: 50M params (4.4%)
#   Frozen: 1.1B params (95.6%)
#
# ============================================================
# WHY THIS IS AMAZING
# ============================================================
#
# FULL FINE-TUNING (the old way):
#   - Train all 1.1B parameters
#   - Need to store gradients for 1.1B params
#   - Need optimizer states for 1.1B params (2x for Adam)
#   - Memory: model(4GB) + gradients(4GB) + optimizer(8GB) = 16GB+
#   - For 7B model: 28GB + 28GB + 56GB = 112GB (impossible!)
#
# LORA FINE-TUNING (what we're doing):
#   - Train only 50M parameters (4.4%)
#   - Gradients for 50M params only
#   - Optimizer states for 50M params only
#   - Memory: model(0.7GB) + gradients(0.2GB) + optimizer(0.4GB) ≈ 1.3GB
#   - Rest is activations and batch data (~2-3GB)
#   - Total: ~4GB (fits on free GPU!)
#
# AND THE RESULTS ARE ALMOST AS GOOD:
#   Research shows LoRA achieves 90-99% of full fine-tuning quality
#   For our task (mental health chat), it's plenty
#   The base model already knows language
#   We just need to nudge it toward our specific domain
#
# ============================================================
# BONUS: OTHER USEFUL PEFT MODEL METHODS
# ============================================================
#
# model.print_trainable_parameters()  <- what we just used
# model.get_nb_trainable_parameters() <- returns the number
# model.save_pretrained("path/")      <- save ONLY the LoRA weights
# model.merge_and_unload()            <- merge LoRA into base model
# model.disable_adapter()             <- temporarily disable LoRA
# model.enable_adapter()              <- re-enable LoRA
#
# The save_pretrained() is KEY:
#   Full model: ~2.2 GB on disk
#   LoRA adapters only: ~200 MB on disk
#   You save the small adapters, load base model + adapters later
#   Much easier to share and store!

trainable params: 50,462,720 || all params: 1,150,511,104 || trainable%: 4.3861


## Chapter 7: Data Formatting (CRITICAL!)

**THIS IS WHERE MOST PEOPLE MESS UP!**

TinyLlama was trained with a specific chat format:
```
<|system|>
{system message}</s>
<|user|>
{user message}</s>
<|assistant|>
{assistant response}</s>
```

**The `</s>` after EACH section is CRITICAL!** Without it, the model outputs garbage.

In [14]:
# FORMAT FUNCTION - MUST MATCH TINYLLAMA'S EXACT FORMAT!
#
# THIS IS THE MOST IMPORTANT CELL IN THE ENTIRE NOTEBOOK
# Getting this wrong = garbage output (those <<< < < < << you saw earlier)
# Getting this right = model actually works
#
# WHY FORMAT MATTERS:
#   TinyLlama was TRAINED on a specific chat format
#   During its original training, it saw millions of examples like:
#     <|system|>
#     You are helpful...</s>
#     <|user|>
#     Hello</s>
#     <|assistant|>
#     Hi there!</s>
#
#   The model learned: "When I see this pattern, I know how to respond"
#   
#   If you use a DIFFERENT format during fine-tuning:
#     [INST] Hello [/INST]    <- Llama 2 format
#     or
#     User: Hello\nAssistant: <- Generic format
#
#   The model goes: "WTF is this? I've never seen this pattern"
#   Result: Random garbage, HTML tags, nonsense
#
#   SAME FORMAT = model understands = good outputs
#   DIFFERENT FORMAT = model confused = garbage outputs

def format_prompt(example):
    # This function takes ONE training example and formats it
    #
    # INPUT: example = {'question': 'Hi', 'answer': 'Hello there!'}
    # OUTPUT: {'text': '<|system|>\nYou are...\n<|user|>\nHi</s>\n...'}
    #
    # We'll apply this to every example in the dataset using .map()
    
    """
    Format Q&A into TinyLlama's chat format.
    
    CRITICAL: </s> must come after EACH section!
    - After system message
    - After user message  
    - After assistant message
    """
    # Docstring explaining what this function does
    # The CRITICAL note is there because this is where most people mess up
    
    question = example['question']
    # Extract the question from the example dict
    # example['question'] = "Hi" or "How do I manage anxiety?" etc.
    
    answer = example['answer']
    # Extract the answer from the example dict
    # example['answer'] = "Hello there!" or "Here are some tips..." etc.
    
    # CORRECT TinyLlama format with </s> after each part
    prompt = f"""<|system|>
You are a helpful mental health assistant. Provide supportive, empathetic, and informative responses.</s>
<|user|>
{question}</s>
<|assistant|>
{answer}</s>"""
    #
    # LET'S BREAK THIS DOWN LINE BY LINE:
    #
    # <|system|>
    #   The system message marker
    #   Tells model: "What follows is instructions about WHO you are"
    #   <| and |> are special delimiters TinyLlama recognizes
    #
    # You are a helpful mental health assistant. Provide supportive, empathetic, and informative responses.</s>
    #   The actual system prompt - defines the AI's personality
    #   </s> at the end = "end of system section"
    #   THIS </s> WAS MISSING IN THE BROKEN VERSION!
    #
    # <|user|>
    #   The user message marker
    #   Tells model: "What follows is from the human"
    #
    # {question}</s>
    #   The actual user question (inserted from our data)
    #   </s> at the end = "end of user section"
    #   THIS </s> WAS ALSO MISSING IN THE BROKEN VERSION!
    #
    # <|assistant|>
    #   The assistant message marker
    #   Tells model: "What follows is what I should say"
    #
    # {answer}</s>
    #   The actual answer (what we want model to learn)
    #   </s> at the end = "end of assistant section"
    #   This one was present in the broken version (only one that was!)
    #
    # THE BROKEN FORMAT (what was causing garbage):
    #   <|system|>
    #   You are helpful...      <- NO </s> HERE!
    #   <|user|>
    #   {question}              <- NO </s> HERE!
    #   <|assistant|>
    #   {answer}</s>
    #
    # THE FIXED FORMAT (what we're using now):
    #   <|system|>
    #   You are helpful...</s>  <- </s> ADDED!
    #   <|user|>
    #   {question}</s>          <- </s> ADDED!
    #   <|assistant|>
    #   {answer}</s>
    #
    # WHY </s> AFTER EACH SECTION:
    #   </s> = End Of Sequence token (token ID 2)
    #   TinyLlama was trained with </s> marking section boundaries
    #   It's like punctuation for the model
    #   
    #   Without </s>:
    #     Model sees: "<|system|>\nYou are helpful\n<|user|>"
    #     Thinks: "Is system message still going? Where does it end?"
    #     Gets confused about boundaries between sections
    #
    #   With </s>:
    #     Model sees: "<|system|>\nYou are helpful</s>\n<|user|>"
    #     Thinks: "System message ended, now user section starts"
    #     Clear boundaries = understands the structure
    #
    # THE f""" SYNTAX:
    #   f = f-string (allows {variable} insertions)
    #   """ = multi-line string (can span multiple lines)
    #   Combined: multi-line string with variable insertion
    #
    # WHY MULTI-LINE:
    #   - Easier to read and verify the format
    #   - Newlines are preserved exactly as written
    #   - No need for \n escape characters everywhere
    #
    # WHITESPACE MATTERS:
    #   The prompt must be EXACTLY like this
    #   No extra spaces, no missing newlines
    #   Even invisible whitespace differences cause problems!
    #
    #   That's why we put the string at column 0 (no indentation)
    #   If we indented it, we'd get extra spaces in the prompt

    return {"text": prompt}
    # Return a dictionary with key "text"
    #
    # WHY A DICT WITH "text":
    #   HuggingFace datasets work with dictionaries
    #   .map() expects function to return dict
    #   "text" is the column name where formatted prompt goes
    #   SFTTrainer will look for "text" column during training
    #
    # The full prompt is now one string in the "text" field
    # Trainer will tokenize this and use it for training

# Test it
sample = train_data[0]
# Get the first example from our dataset
# sample = {'question': 'Hi', 'answer': 'Hello there. Tell me how are you feeling today?'}

formatted = format_prompt(sample)
# Apply our format function to this one example
# formatted = {'text': '<|system|>\nYou are...\n...'}

print("Formatted prompt (notice </s> after each section!):")
print(formatted['text'])
# Print so we can visually verify the format is correct
#
# OUTPUT:
#   <|system|>
#   You are a helpful mental health assistant. Provide supportive, empathetic, and informative responses.</s>
#   <|user|>
#   Hi</s>
#   <|assistant|>
#   Hello there. Tell me how are you feeling today?</s>
#
# WHAT TO CHECK:
#   1. </s> after system message? YES
#   2. </s> after user message (Hi)? YES
#   3. </s> after assistant message? YES
#   4. All special tokens present? YES (<|system|>, <|user|>, <|assistant|>)
#   5. Newlines in right places? YES
#
# If any of these are wrong, training might "work" but model outputs garbage
#
# WHY WE TEST ON ONE EXAMPLE:
#   Always inspect your data before training!
#   Easy to make typos in the format string
#   5 seconds of checking saves hours of debugging later

# ============================================================
# THE FULL PICTURE: WHAT MODEL LEARNS
# ============================================================
#
# During training, model sees hundreds of examples like:
#
#   <|system|>
#   You are a helpful mental health assistant...</s>
#   <|user|>
#   Hi</s>
#   <|assistant|>
#   Hello there! How are you feeling today?</s>
#
#   <|system|>
#   You are a helpful mental health assistant...</s>
#   <|user|>
#   I'm feeling anxious</s>
#   <|assistant|>
#   I'm sorry to hear that. Anxiety can be really challenging...</s>
#
# Model learns the PATTERN:
#   "When I see <|system|>...<|user|>...question</s><|assistant|>"
#   "I should generate a helpful mental health response"
#
# The loss function teaches it:
#   - Given everything before <|assistant|>
#   - Predict the tokens that come after
#   - Get better at matching the training answers
#
# ============================================================
# DURING INFERENCE (using the trained model)
# ============================================================
#
# We give the model:
#   <|system|>
#   You are a helpful mental health assistant...</s>
#   <|user|>
#   How do I manage stress?</s>
#   <|assistant|>
#   <- MODEL GENERATES FROM HERE
#
# Model recognizes the pattern and generates an appropriate response
# Because it learned: "after <|assistant|>\n, I produce mental health advice"
#
# IF THE FORMAT DOESN'T MATCH:
#   Training: <|system|>...\n<|user|>...\n<|assistant|>...
#   Inference: <|system|>...NO NEWLINE<|user|>...
#   Model: "This looks different, I don't recognize this pattern"
#   Result: Garbage output
#
# That's why we're so careful about EXACT format matching!

Formatted prompt (notice </s> after each section!):
<|system|>
You are a helpful mental health assistant. Provide supportive, empathetic, and informative responses.</s>
<|user|>
Hi</s>
<|assistant|>
Hello there. Tell me how are you feeling today?</s>


In [15]:
# FORMAT ALL DATA
print("Formatting dataset...")
# Just a status message so you know something's happening
# With 661 examples it's fast, but good habit for larger datasets

formatted_dataset = train_data.map(
    # .map() = apply a function to EVERY example in the dataset
    #
    # Think of it like a for loop but optimized:
    #
    #   WHAT .map() DOES (conceptually):
    #   results = []
    #   for example in train_data:
    #       result = format_prompt(example)
    #       results.append(result)
    #   formatted_dataset = Dataset.from_list(results)
    #
    # But .map() is BETTER than a for loop because:
    #   - Optimized C code under the hood (faster)
    #   - Can parallelize across CPU cores (num_proc parameter)
    #   - Memory efficient (doesn't load everything at once)
    #   - Integrates with HuggingFace caching system
    #   - Shows a nice progress bar
    #
    # HOW IT WORKS:
    #   1. Takes each example: {'question': 'Hi', 'answer': 'Hello!'}
    #   2. Passes it to format_prompt()
    #   3. Gets back: {'text': '<|system|>\n...'}
    #   4. Collects all results into new dataset
    #
    # WHY NOT A FOR LOOP:
    #   For 661 examples, a loop would be fine
    #   But for 100,000+ examples, .map() is way faster
    #   Good habit to use .map() even for small datasets
    #   Plus you get the progress bar for free!

    format_prompt,
    # The function to apply to each example
    #
    # This is our format_prompt function we defined above
    # Note: we pass the function itself, not format_prompt()
    #   format_prompt  = "here's the function, you call it"
    #   format_prompt() = "I'm calling it now" (wrong!)
    #
    # .map() will call format_prompt(example) for each example
    #
    # FUNCTION REQUIREMENTS:
    #   - Input: one example (a dictionary)
    #   - Output: a dictionary (new/modified columns)
    #   
    # Our function:
    #   - Input: {'question': '...', 'answer': '...'}
    #   - Output: {'text': '...'}

    remove_columns=train_data.column_names,
    # remove_columns = delete these columns after mapping
    #
    # train_data.column_names = ['question', 'answer']
    #
    # BEFORE .map():
    #   | question          | answer                    |
    #   |-------------------|---------------------------|
    #   | Hi                | Hello there!              |
    #   | How are you?      | I'm doing well!           |
    #
    # AFTER .map() WITHOUT remove_columns:
    #   | question          | answer          | text                      |
    #   |-------------------|-----------------|---------------------------|
    #   | Hi                | Hello there!    | <|system|>\n...           |
    #   | How are you?      | I'm doing well! | <|system|>\n...           |
    #   
    #   We'd have the old columns PLUS the new 'text' column
    #   Wastes memory keeping data we don't need anymore
    #
    # AFTER .map() WITH remove_columns:
    #   | text                      |
    #   |---------------------------|
    #   | <|system|>\n...           |
    #   | <|system|>\n...           |
    #
    #   Only the 'text' column remains
    #   Clean and memory efficient!
    #
    # WHY REMOVE OLD COLUMNS:
    #   - We don't need 'question' and 'answer' separately anymore
    #   - They're now embedded in the 'text' column (formatted prompt)
    #   - Saves memory
    #   - Cleaner dataset structure
    #   - SFTTrainer only needs the 'text' column anyway
    #
    # HOW train_data.column_names WORKS:
    #   Returns list of all column names: ['question', 'answer']
    #   Equivalent to: remove_columns=['question', 'answer']
    #   But dynamic - if columns change, this still works
)

# WHAT JUST HAPPENED:
#
#   661 examples transformed from:
#     {'question': 'Hi', 'answer': 'Hello there!'}
#
#   To:
#     {'text': '<|system|>\nYou are a helpful mental health assistant. Provide supportive, empathetic, and informative responses.</s>\n<|user|>\nHi</s>\n<|assistant|>\nHello there!</s>'}
#
#   All 661 examples now have the correct TinyLlama chat format
#   Ready for training!

# OTHER USEFUL .map() PARAMETERS:
#
#   num_proc=4
#     Use 4 CPU cores in parallel (faster for big datasets)
#     Default is 1 (sequential)
#     Don't set higher than your CPU cores
#
#   batched=True
#     Pass batches of examples instead of one at a time
#     Function receives: {'question': ['Hi', 'Hey', ...], 'answer': [...]}
#     Faster for some operations, but we don't need it here
#
#   batch_size=1000
#     If batched=True, how many examples per batch
#
#   load_from_cache_file=False
#     Don't use cached results, always recompute
#     Useful when debugging and changing the function
#
#   desc="Processing"
#     Custom description for the progress bar

print(f"Formatted {len(formatted_dataset)} examples")
# len(formatted_dataset) = number of examples in the new dataset
#
# OUTPUT: "Formatted 661 examples"
#
# Should be same as original train_data (661)
# We didn't add or remove examples, just transformed them
#
# If this number was different:
#   - Less: some examples got filtered out (bad data?)
#   - More: something weird happened (shouldn't happen with our function)
#
# WHY CHECK: Sanity check that .map() worked correctly

# ============================================================
# QUICK PEEK AT THE RESULT
# ============================================================
#
# You can inspect the formatted data:
#
#   print(formatted_dataset[0])
#   # {'text': '<|system|>\nYou are...'}
#
#   print(formatted_dataset['text'][0])
#   # '<|system|>\nYou are...'
#
#   print(formatted_dataset[0:3])
#   # First 3 examples
#
# ============================================================
# WHAT'S NEXT
# ============================================================
#
# formatted_dataset now has 661 examples like:
#   {'text': '<|system|>\n...\n<|user|>\nHi</s>\n<|assistant|>\nHello!</s>'}
#
# Next step: Split into train and validation sets
#   - Training set: what model learns from
#   - Validation set: what we test on (to check for overfitting)
#
# Then: Create the trainer and start training!

Formatting dataset...


Map:   0%|          | 0/661 [00:00<?, ? examples/s]

Formatted 661 examples


In [16]:
# TRAIN/VALIDATION SPLIT
split_dataset = formatted_dataset.train_test_split(
    test_size=0.1,
    seed=42,
)
print(f"Training: {len(split_dataset['train'])} examples")
print(f"Validation: {len(split_dataset['test'])} examples")

Training: 594 examples
Validation: 67 examples


## Chapter 8: Training Config

In [19]:
# TRAINING CONFIGURATION
#
# This is the CONTROL PANEL for training
# Every setting here affects how the model learns
# Wrong settings = wasted hours, bad results
# Right settings = efficient training, good model
#
# Think of it like a recipe:
#   - Ingredients (data) - already prepared
#   - Oven temperature (learning rate) - too hot burns it, too cold undercooks
#   - Cooking time (epochs) - too long = overdone, too short = raw
#   - etc.

output_dir = "./fine_tuned_model"
# Where to save checkpoints and final model
#
# During training, the trainer will save:
#   ./fine_tuned_model/checkpoint-100/  (after 100 steps)
#   ./fine_tuned_model/checkpoint-200/  (after 200 steps)
#   etc.
#
# "./" means current directory
# So it creates a folder called "fine_tuned_model" right where you are
#
# WHY SAVE CHECKPOINTS:
#   - If training crashes, you can resume from last checkpoint
#   - Can compare different points in training
#   - load_best_model_at_end uses these to find the best one

os.makedirs(output_dir, exist_ok=True)
# Create the output directory if it doesn't exist
#
# os.makedirs() = create folder (and parent folders if needed)
# exist_ok=True = don't crash if folder already exists
#
# Without exist_ok=True:
#   First run: creates folder, works fine
#   Second run: folder exists, CRASHES with FileExistsError
#
# With exist_ok=True:
#   First run: creates folder
#   Second run: folder exists, that's fine, continue
#
# WHY WE NEED IT: Training will try to save files here, folder must exist

training_args = SFTConfig(
    # SFTConfig = Supervised Fine-Tuning Configuration
    # All training hyperparameters in one object
    #
    # This is from the TRL library (trl.SFTConfig)
    # Alternative: TrainingArguments from transformers (similar but less SFT-specific)
    #
    # We pass this to SFTTrainer later
    # Trainer reads all settings from this config

    output_dir=output_dir,
    # Where to save stuff (we defined this above)
    # Checkpoints, logs, final model all go here

    # ============================================================
    # TRAINING DURATION
    # ============================================================
    
    num_train_epochs=5,
    # num_train_epochs = how many times to go through ALL the data
    #
    # 1 epoch = model sees every training example exactly once
    # 5 epochs = model sees every example 5 times
    #
    # EXAMPLE WITH OUR DATA:
    #   We have ~594 training examples (after 90/10 split)
    #   1 epoch = 594 examples processed
    #   5 epochs = 594 × 5 = 2,970 total examples processed
    #
    # WHY MULTIPLE EPOCHS:
    #   Model doesn't learn everything in one pass
    #   Like studying for an exam - you read notes multiple times
    #   Each pass, model picks up patterns it missed before
    #
    # HOW MANY EPOCHS TO USE:
    #   1-2 epochs:  Large datasets (millions of examples)
    #   3-5 epochs:  Medium datasets (thousands of examples) <- US
    #   5-10 epochs: Small datasets (hundreds of examples)
    #   10+ epochs:  Tiny datasets (risk of overfitting!)
    #
    # SIGNS OF TOO MANY EPOCHS (overfitting):
    #   Training loss keeps going down
    #   Validation loss starts going UP
    #   Model memorizes training data instead of learning patterns
    #
    # SIGNS OF TOO FEW EPOCHS:
    #   Both losses still decreasing when training ends
    #   Model outputs are generic/bad
    #   Didn't learn enough
    #
    # WHY 5: Good starting point for ~600 examples

    # ============================================================
    # BATCH SIZE
    # ============================================================
    
    per_device_train_batch_size=4,
    # How many examples to process at once PER GPU
    #
    # WHAT IS A BATCH:
    #   Instead of processing 1 example at a time:
    #     forward(example1), backward(), update weights
    #     forward(example2), backward(), update weights
    #     ... (very slow!)
    #
    #   We process multiple examples together:
    #     forward([example1, example2, example3, example4]), backward(), update weights
    #     (much faster! GPUs love parallel work)
    #
    # BATCH SIZE TRADEOFFS:
    #   Larger batch (8, 16, 32):
    #     + Faster (more GPU parallelism)
    #     + More stable gradients (averaged over more examples)
    #     - Uses more GPU memory
    #     - Might generalize worse (debated)
    #
    #   Smaller batch (1, 2, 4):
    #     + Uses less memory
    #     + Can fit on smaller GPUs
    #     - Noisier gradients (more variance)
    #     - Slower (less parallelism)
    #
    # HOW TO CHOOSE:
    #   Start with largest batch that fits in memory
    #   If you get "CUDA out of memory", reduce batch size
    #   4 is conservative and works on most free GPUs
    #
    # "per_device" because if you have multiple GPUs:
    #   2 GPUs × batch_size 4 = 8 examples per step total
    #   We have 1 GPU, so it's just 4

    per_device_eval_batch_size=4,
    # Same but for validation/evaluation
    #
    # Can usually be LARGER than train batch because:
    #   - No gradients stored during eval (forward pass only)
    #   - Uses less memory than training
    #
    # We keep it same as train for simplicity
    # Could set to 8 or 16 to speed up evaluation

    gradient_accumulation_steps=4,
    # Accumulate gradients over N steps before updating weights
    #
    # THE PROBLEM:
    #   We want large effective batch size (stable training)
    #   But large batches don't fit in GPU memory
    #
    # THE SOLUTION:
    #   Accumulate gradients from multiple small batches
    #   Then do one big weight update
    #
    # HOW IT WORKS:
    #   Step 1: forward(batch1), compute gradients, DON'T update yet, SAVE gradients
    #   Step 2: forward(batch2), compute gradients, ADD to saved gradients
    #   Step 3: forward(batch3), compute gradients, ADD to saved gradients
    #   Step 4: forward(batch4), compute gradients, ADD to saved gradients
    #   NOW: average all gradients, update weights
    #
    # EFFECTIVE BATCH SIZE:
    #   per_device_batch × gradient_accumulation × num_gpus
    #   4 × 4 × 1 = 16
    #
    #   It's LIKE training with batch size 16
    #   But only batch size 4 in memory at once!
    #
    # WHY 4:
    #   Gives us effective batch of 16 (good for stability)
    #   Without needing 4× the GPU memory
    #
    # TRADEOFF:
    #   More accumulation = slower (more forward passes per update)
    #   But same final result as larger batch

    # ============================================================
    # LEARNING RATE
    # ============================================================
    
    learning_rate=5e-5,
    # How big of a step to take when updating weights
    #
    # 5e-5 = 0.00005 = 0.005%
    #
    # THE MOST IMPORTANT HYPERPARAMETER
    # This controls how fast the model learns
    #
    # WHAT LEARNING RATE DOES:
    #   new_weight = old_weight - learning_rate × gradient
    #
    #   Gradient says "move this direction to reduce loss"
    #   Learning rate says "move THIS MUCH in that direction"
    #
    # TOO HIGH (1e-3, 1e-2):
    #   - Takes huge steps
    #   - Overshoots the optimal values
    #   - Loss goes crazy (spikes, NaN)
    #   - Model learns garbage
    #
    # TOO LOW (1e-7, 1e-8):
    #   - Takes tiny steps
    #   - Barely moves from starting point
    #   - Would need millions of steps to learn anything
    #   - Wasted time
    #
    # JUST RIGHT (1e-5 to 1e-4 for fine-tuning):
    #   - Steady progress
    #   - Loss decreases smoothly
    #   - Model learns without going crazy
    #
    # COMMON LEARNING RATES:
    #   Pre-training from scratch: 1e-4 to 1e-3
    #   Full fine-tuning: 1e-5 to 5e-5
    #   LoRA fine-tuning: 1e-5 to 2e-4
    #
    # WHY 5e-5:
    #   Conservative choice for LoRA
    #   Less likely to destabilize training
    #   Works well with our alpha=128 (2× scaling)
    #   If too slow, could try 1e-4

    lr_scheduler_type="cosine",
    # How learning rate changes during training
    #
    # CONSTANT (no scheduler):
    #   Learning rate stays 5e-5 the whole time
    #   Simple but not optimal
    #
    # LINEAR:
    #   Starts at 5e-5, decreases linearly to 0
    #   Step 0: 5e-5
    #   Step 500: 2.5e-5
    #   Step 1000: 0
    #
    # COSINE (what we use):
    #   Follows a cosine curve from max to min
    #   Starts at 5e-5
    #   Slowly decreases (slower than linear at start)
    #   Accelerates decrease toward end
    #   Smoothly reaches near-zero
    #
    #   It looks like this:
    #   LR |‾‾‾‾\
    #      |     \
    #      |      \_____
    #      +------------- Steps
    #
    # WHY COSINE:
    #   - Allows faster learning at start (when there's lots to learn)
    #   - Slows down at end (fine-tuning, don't overshoot)
    #   - Empirically works well for many tasks
    #   - Industry standard for most fine-tuning
    #
    # OTHER OPTIONS:
    #   "linear": straight line decrease
    #   "polynomial": customizable curve
    #   "constant": no change
    #   "constant_with_warmup": constant after warmup

    warmup_ratio=0.1,
    # What fraction of training to "warm up" the learning rate
    #
    # WARMUP = start with tiny LR and gradually increase to full LR
    #
    # WHY WARMUP:
    #   At the very start, model weights are "cold"
    #   Gradients can be wild and unstable
    #   Big learning rate + wild gradients = disaster
    #
    #   Warmup says: "Start gentle, then ramp up"
    #   Lets the model stabilize before going full speed
    #
    # 0.1 = 10% of training steps are warmup
    #
    # EXAMPLE:
    #   Total steps: 500
    #   Warmup: 10% × 500 = 50 steps
    #
    #   Step 0-50: LR increases from 0 to 5e-5 (warmup)
    #   Step 50-500: LR follows cosine schedule from 5e-5 down
    #
    # HOW TO CHOOSE:
    #   0.03-0.1: typical range
    #   Small dataset: 0.1 (more warmup relative to short training)
    #   Large dataset: 0.03 (don't waste steps on warmup)
    #
    # WHY 0.1:
    #   Our training is relatively short
    #   10% warmup adds stability without losing much training time

    # ============================================================
    # PRECISION
    # ============================================================
    
    bf16=True,
    # Use bfloat16 precision for training computations
    #
    # PRECISION OPTIONS:
    #   fp32 (32-bit float): highest precision, slowest, most memory
    #   fp16 (16-bit float): half precision, fast, can overflow
    #   bf16 (bfloat16): half precision, fast, STABLE (no overflow)
    #
    # BFLOAT16 ADVANTAGES:
    #   - 2× faster than fp32 on modern GPUs
    #   - 2× less memory than fp32
    #   - Same numerical RANGE as fp32 (no overflow)
    #   - Slightly less PRECISION than fp32 (fine for training)
    #
    # WHY NOT FP16:
    #   fp16 has smaller range of values
    #   Very large or small numbers become infinity or zero
    #   Causes NaN losses during training
    #   Needs "loss scaling" tricks to work (extra complexity)
    #
    # WHY BF16:
    #   Best of both worlds
    #   Fast like fp16
    #   Stable like fp32
    #   No special tricks needed
    #
    # REQUIREMENT:
    #   bf16 needs Ampere GPU or newer (RTX 30xx, A100, etc.)
    #   T4 and P100 don't have native bf16 (but it still works via emulation)
    #   If you get errors, try fp16=True instead
    #
    # WHY True: Faster training, lower memory, stable

    # ============================================================
    # REGULARIZATION
    # ============================================================
    
    weight_decay=0.05,
    # L2 regularization - penalize large weights
    #
    # WHAT IT DOES:
    #   Adds a penalty to the loss based on weight magnitudes
    #   loss = original_loss + weight_decay × sum(weights²)
    #
    #   Model is punished for having large weights
    #   Encourages smaller, more distributed weights
    #
    # WHY IT HELPS:
    #   Large weights = model is "memorizing" specific examples
    #   Small weights = model learning general patterns
    #   Prevents overfitting
    #
    # EXAMPLE:
    #   Without weight decay: model might learn
    #     "When input is EXACTLY 'Hi', output EXACTLY 'Hello there!'"
    #   With weight decay: model learns
    #     "Greetings should be responded to with greetings"
    #
    # HOW TO CHOOSE:
    #   0.0: no regularization
    #   0.01: light regularization
    #   0.05: moderate regularization (us)
    #   0.1: strong regularization
    #
    # WHY 0.05:
    #   We have small dataset (661 examples)
    #   Risk of overfitting is real
    #   0.05 is moderate - not too aggressive

    max_grad_norm=0.5,
    # Gradient clipping - cap gradient magnitudes
    #
    # WHAT IT DOES:
    #   If gradient magnitude > 0.5, scale it down to 0.5
    #   Prevents "exploding gradients"
    #
    # THE PROBLEM:
    #   Sometimes gradients get HUGE (gradient explosion)
    #   Huge gradient × learning rate = massive weight update
    #   Massive update = model goes crazy
    #   Common with small batches and certain architectures
    #
    # THE SOLUTION:
    #   "Clip" gradients to max magnitude
    #   Direction preserved, magnitude limited
    #   Wild gradients become manageable gradients
    #
    # EXAMPLE:
    #   Gradient = [100, -200, 50]
    #   Magnitude = sqrt(100² + 200² + 50²) = 226
    #   max_grad_norm = 0.5
    #   Since 226 > 0.5, scale down: [0.22, -0.44, 0.11]
    #
    # HOW TO CHOOSE:
    #   1.0: standard, light clipping
    #   0.5: moderate clipping (us)
    #   0.1: aggressive clipping
    #
    # WHY 0.5:
    #   More conservative than default 1.0
    #   Extra stability for our small dataset
    #   Prevents any single bad batch from ruining training

    # ============================================================
    # EVALUATION
    # ============================================================
    
    eval_strategy="steps",
    # When to run evaluation on validation set
    #
    # OPTIONS:
    #   "no": never evaluate (fastest, but flying blind)
    #   "epoch": evaluate after each epoch
    #   "steps": evaluate every N steps (most control)
    #
    # WHY "steps":
    #   We want to see progress during training
    #   Not just at end of each epoch
    #   Can catch overfitting earlier
    #
    # Combined with eval_steps below

    eval_steps=50,
    # Evaluate every 50 training steps
    #
    # A "step" = one weight update
    # With gradient_accumulation_steps=4 and batch_size=4:
    #   1 step = 16 examples processed
    #   50 steps = 800 examples = ~1.3 epochs
    #
    # WHAT HAPPENS DURING EVAL:
    #   1. Pause training
    #   2. Run model on ALL validation examples
    #   3. Calculate validation loss
    #   4. Log the metrics
    #   5. Resume training
    #
    # WHY 50:
    #   Often enough to see trends
    #   Not so often that it slows down training
    #   With ~185 steps per epoch, we evaluate ~4 times per epoch

    # ============================================================
    # LOGGING
    # ============================================================
    
    logging_steps=25,
    # Log training metrics every 25 steps
    #
    # WHAT GETS LOGGED:
    #   - Training loss (how wrong model is on training data)
    #   - Learning rate (current LR after scheduler)
    #   - Gradient norm (are gradients exploding?)
    #   - Speed (samples per second, steps per second)
    #
    # More frequent than eval (eval is expensive, logging is cheap)
    # Lets you see training progress in real-time
    #
    # WHY 25:
    #   See progress without flooding the screen
    #   Every 25 steps ≈ every few minutes of training

    logging_first_step=True,
    # Log metrics after the very first step
    #
    # WHY:
    #   See initial loss immediately
    #   Useful for sanity checking
    #   If first loss is NaN or 1000+, something is wrong
    #
    #   Normal initial loss: 2-4 (for language models)
    #   Suspiciously high: 10+ (maybe wrong format)
    #   NaN: definitely broken (precision issues, bad data)

    # ============================================================
    # CHECKPOINTING
    # ============================================================
    
    save_strategy="steps",
    # When to save model checkpoints
    #
    # OPTIONS:
    #   "no": never save (risky!)
    #   "epoch": save after each epoch
    #   "steps": save every N steps
    #
    # WHY "steps":
    #   More frequent saves = less lost work if crash
    #   Can analyze different training stages
    #   Combined with save_steps below

    save_steps=100,
    # Save checkpoint every 100 steps
    #
    # WHAT GETS SAVED:
    #   - Model weights (the LoRA adapters)
    #   - Optimizer state (momentum, etc.)
    #   - Scheduler state (where in LR schedule)
    #   - Training progress (which step/epoch)
    #
    # This creates folders like:
    #   ./fine_tuned_model/checkpoint-100/
    #   ./fine_tuned_model/checkpoint-200/
    #
    # WHY 100:
    #   Often enough to not lose too much progress
    #   Not so often that disk fills up with checkpoints
    #   Balance between safety and disk space

    save_total_limit=3,
    # Only keep the 3 most recent checkpoints
    #
    # WITHOUT THIS:
    #   checkpoint-100, checkpoint-200, checkpoint-300, ...
    #   Could end up with dozens of checkpoints
    #   Each is ~200MB, adds up fast!
    #
    # WITH save_total_limit=3:
    #   Keeps: checkpoint-700, checkpoint-800, checkpoint-900
    #   Deletes: checkpoint-100 through checkpoint-600
    #   Auto-cleanup saves disk space
    #
    # EXCEPTION:
    #   The "best" checkpoint (if tracking) is always kept
    #   Won't delete the best one even if it's old
    #
    # WHY 3: Enough to recover from recent issues, not too much space

    load_best_model_at_end=True,
    # After training finishes, load the best checkpoint (not the last one)
    #
    # THE PROBLEM:
    #   Training might overshoot
    #   Step 500: loss 0.4 (good!)
    #   Step 600: loss 0.45 (overfitting...)
    #   Step 700: loss 0.5 (worse!)
    #
    #   Without this: you get step 700 model (the worst one!)
    #   With this: you get step 500 model (the best one!)
    #
    # HOW IT WORKS:
    #   Trainer tracks eval loss at each evaluation
    #   Remembers which checkpoint had lowest eval loss
    #   At the very end, loads that best checkpoint
    #
    # REQUIRES:
    #   save_strategy must be set (need checkpoints to load)
    #   eval_strategy must be set (need eval metrics to compare)
    #
    # WHY True: Automatically get the best model, not just the last one

    metric_for_best_model="eval_loss",
    # Which metric to use when determining "best" checkpoint
    #
    # OPTIONS:
    #   "eval_loss": validation loss (lower is better) <- MOST COMMON
    #   "accuracy": if you compute accuracy (higher is better)
    #   "f1": if you compute F1 score (higher is better)
    #   Any custom metric you log
    #
    # For language models, eval_loss is standard
    # It measures: "how well does model predict validation data?"
    #
    # WHY "eval_loss": Direct measure of model quality on unseen data

    greater_is_better=False,
    # Is a HIGHER metric better, or LOWER?
    #
    # False = lower is better (loss, error rate)
    # True = higher is better (accuracy, F1 score)
    #
    # eval_loss: lower is better, so False
    #
    # EXAMPLE:
    #   Checkpoint A: eval_loss = 0.5
    #   Checkpoint B: eval_loss = 0.4
    #   
    #   greater_is_better=False
    #   B is better (0.4 < 0.5)
    #
    # WHY False: We're using loss, lower loss = better model

    seed=42,
    # Random seed for reproducibility
    #
    # WHAT IT AFFECTS:
    #   - Data shuffling order
    #   - Dropout randomness
    #   - Weight initialization (for new layers)
    #
    # WHY SET IT:
    #   Same seed = same random choices = same results
    #   Can reproduce experiments
    #   Can compare changes fairly (only your change differs)
    #
    # WHY 42:
    #   It's a meme (Hitchhiker's Guide to the Galaxy)
    #   "The answer to life, the universe, and everything"
    #   Any number works, 42 is tradition in ML

    report_to="none",
    # Where to send training metrics
    #
    # OPTIONS:
    #   "none": just local logging
    #   "tensorboard": send to TensorBoard (visualization tool)
    #   "wandb": send to Weights & Biases (popular tracking service)
    #   "mlflow": send to MLflow
    #
    # WHY "none":
    #   Keep it simple for this tutorial
    #   No external accounts needed
    #   Metrics still print to console
    #
    # IN PRODUCTION:
    #   Would use "wandb" or "tensorboard" for better tracking
    #   Nice graphs, comparison tools, experiment management
)

print("Training config ready!")
# Confirms all settings were accepted without errors
#
# If any parameter name was wrong, we'd get an error here
# Success means config is valid and ready for trainer
#
# NEXT: Create SFTTrainer with this config and start training!

# ============================================================
# SUMMARY OF OUR SETTINGS
# ============================================================
#
# Training Duration:
#   5 epochs, ~185 steps per epoch, ~925 total steps
#
# Batch Size:
#   Actual: 4
#   Effective: 16 (with 4× accumulation)
#
# Learning Rate:
#   Start: 5e-5
#   Schedule: cosine decay
#   Warmup: 10% of training
#
# Regularization:
#   Weight decay: 0.05
#   Gradient clipping: 0.5
#   (Plus LoRA dropout 0.1 from lora_config)
#
# Monitoring:
#   Log every 25 steps
#   Eval every 50 steps
#   Save every 100 steps
#
# Safety:
#   Keep best model (by eval_loss)
#   Keep last 3 checkpoints
#   bf16 precision for stability

Training config ready!


## Chapter 9: Training!

In [20]:
# CREATE TRAINER
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    processing_class=tokenizer,
)

print(f"Trainer ready!")
print(f"  Train: {len(trainer.train_dataset)} examples")
print(f"  Eval: {len(trainer.eval_dataset)} examples")

Adding EOS to train dataset:   0%|          | 0/594 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/594 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/594 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/67 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/67 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/67 [00:00<?, ? examples/s]

Trainer ready!
  Train: 594 examples
  Eval: 67 examples


In [21]:
# TRAIN!
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Starting training...")
print("Watch the loss - should decrease over time!")
print()

train_result = trainer.train()

print("\nTraining complete!")
print(f"Final loss: {train_result.training_loss:.4f}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...
Watch the loss - should decrease over time!



Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
50,0.565,0.523949,0.534874,58328.0,0.865922
100,0.3416,0.374702,0.365874,117453.0,0.904407
150,0.2265,0.344615,0.316968,175305.0,0.911803



Training complete!
Final loss: 0.4685


In [22]:
# SAVE MODEL
#
# Training is done! Now we need to save our work
# Otherwise when the notebook closes, everything is GONE
#
# WHAT WE'RE SAVING:
#   Not the entire 1.1B parameter model (that would be huge)
#   Just the LoRA adapters (~50M parameters)
#   The tiny "diff" that makes base model -> our fine-tuned model
#
# IT'S LIKE:
#   You don't save a whole new copy of Photoshop when you edit a photo
#   You save the edits (layers, adjustments)
#   LoRA adapters = the edits
#   Base model = Photoshop (stays unchanged, download again later)

lora_output_dir = "./fine_tuned_lora"
# Where to save our LoRA adapters
#
# Different from output_dir ("./fine_tuned_model") which has checkpoints
# This is the FINAL clean save location
#
# WHY SEPARATE FOLDER:
#   output_dir has checkpoints, optimizer states, training logs
#   Messy, lots of files, some are huge
#   
#   lora_output_dir has just the final model
#   Clean, minimal, ready to share or deploy
#
# "./" means current directory
# Creates folder called "fine_tuned_lora" right here

os.makedirs(lora_output_dir, exist_ok=True)
# Create the directory if it doesn't exist
#
# os.makedirs() = create folder (and any parent folders needed)
# exist_ok=True = don't crash if folder already exists
#
# Same pattern we used for output_dir earlier
# Good habit: always create dir before saving to it

trainer.model.save_pretrained(lora_output_dir)
# Save the LoRA adapter weights
#
# LET'S BREAK THIS DOWN:
#
# trainer.model
#   The trained model inside our SFTTrainer
#   This is the PeftModel (base TinyLlama + LoRA adapters)
#   After training, LoRA weights have been optimized
#
# .save_pretrained(path)
#   Save model in HuggingFace format
#   Creates files that can be loaded with .from_pretrained() later
#
# WHAT IT SAVES (for LoRA/PEFT models):
#   adapter_config.json     - LoRA configuration (r, alpha, target_modules, etc.)
#   adapter_model.safetensors - The actual LoRA weights (~200MB)
#
# WHAT IT DOES NOT SAVE:
#   The base TinyLlama weights (1.1B params)
#   Those stay on Hugging Face Hub
#   We just save our tiny adapters
#
# THIS IS THE MAGIC OF LORA:
#   Full model save: ~2.2 GB (or 4.4GB for fp32)
#   LoRA adapter save: ~200 MB
#   10x smaller!
#
# To use the model later:
#   1. Load base TinyLlama (from Hub or cache)
#   2. Load LoRA adapters (from this folder)
#   3. Combine them
#   Done!
#
# WHY .safetensors FORMAT:
#   Newer, safer format than .bin
#   Can't contain malicious code (unlike pickle-based .bin)
#   Faster to load
#   Industry standard now

tokenizer.save_pretrained(lora_output_dir)
# Save the tokenizer too
#
# "Wait, we didn't change the tokenizer, why save it?"
#
# GOOD PRACTICE REASONS:
#   1. Everything needed in one folder
#      Don't need to remember "use TinyLlama tokenizer"
#      Just load from this folder, everything's there
#
#   2. Future compatibility
#      What if TinyLlama tokenizer changes?
#      You have YOUR version saved
#
#   3. Easier deployment
#      Copy one folder, everything works
#      No hunting for matching tokenizer
#
# WHAT IT SAVES:
#   tokenizer.json          - The actual vocabulary and rules
#   tokenizer_config.json   - Tokenizer settings
#   special_tokens_map.json - Special tokens (<s>, </s>, etc.)
#
# These files are small (a few MB total)
# Worth saving for convenience

print(f"Saved to: {lora_output_dir}")
# Confirm where files were saved
#
# Output: "Saved to: ./fine_tuned_lora"
#
# If something went wrong, we'd see an error before this
# Seeing this message = save successful!

for f in os.listdir(lora_output_dir):
    size = os.path.getsize(os.path.join(lora_output_dir, f)) / 1e6
    print(f"  {f}: {size:.2f} MB")
# List all saved files with their sizes
#
# LET'S BREAK IT DOWN:
#
# os.listdir(lora_output_dir)
#   Get list of all files in the folder
#   Returns: ['adapter_config.json', 'adapter_model.safetensors', 'tokenizer.json', ...]
#
# for f in ...:
#   Loop through each filename
#
# os.path.getsize(os.path.join(lora_output_dir, f))
#   Get file size in bytes
#   os.path.join() combines folder + filename into full path
#   "./fine_tuned_lora" + "adapter_model.safetensors" 
#   = "./fine_tuned_lora/adapter_model.safetensors"
#
# / 1e6
#   Convert bytes to megabytes
#   1e6 = 1,000,000 bytes = 1 MB
#
# :.2f
#   Format as decimal with 2 places
#   192.847264 -> 192.85
#
# TYPICAL OUTPUT:
#   adapter_config.json: 0.00 MB
#   adapter_model.safetensors: 192.85 MB
#   tokenizer.json: 1.84 MB
#   tokenizer_config.json: 0.00 MB
#   special_tokens_map.json: 0.00 MB
#
# KEY OBSERVATION:
#   adapter_model.safetensors is ~200 MB
#   This contains ALL 50 million LoRA parameters
#   50M params × 4 bytes (float32) = 200 MB (checks out!)
#
#   Compare to full model: 1.1B × 4 bytes = 4.4 GB
#   We're saving 22x less data!
#
# WHY PRINT FILE SIZES:
#   Verify save worked correctly
#   Confirm files aren't empty (0 bytes = something wrong)
#   See how much space LoRA actually takes
#   Satisfying to see how small the adapters are!

# ============================================================
# WHAT WE SAVED
# ============================================================
#
# ./fine_tuned_lora/
# ├── adapter_config.json      (~0.5 KB)
# │   └── LoRA settings: r=64, alpha=128, target_modules, etc.
# │
# ├── adapter_model.safetensors (~200 MB)
# │   └── The actual trained LoRA weights
# │       22 layers × 7 modules × 2 matrices = 308 LoRA matrices
# │       This is what we trained!
# │
# ├── tokenizer.json           (~1.8 MB)
# │   └── Full vocabulary (32,000 tokens)
# │
# ├── tokenizer_config.json    (~1 KB)
# │   └── Tokenizer settings
# │
# └── special_tokens_map.json  (~0.5 KB)
#     └── Maps special tokens like <s>, </s>, <pad>
#
# Total: ~200 MB (vs 2.2 GB for full model)
#
# ============================================================
# HOW TO USE THESE FILES LATER
# ============================================================
#
# OPTION 1: Load on same machine (files still here)
#
#   from peft import PeftModel
#   
#   # Load base model
#   base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
#   
#   # Load LoRA adapters
#   model = PeftModel.from_pretrained(base_model, "./fine_tuned_lora")
#   
#   # Load tokenizer
#   tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_lora")
#
#
# OPTION 2: Download the folder and use elsewhere
#
#   1. Download the ./fine_tuned_lora folder (zip it or whatever)
#   2. On new machine, same code as above but with new path
#
#
# OPTION 3: Upload to Hugging Face Hub (share with the world!)
#
#   from huggingface_hub import HfApi
#   api = HfApi()
#   api.upload_folder(
#       folder_path="./fine_tuned_lora",
#       repo_id="your-username/mental-health-tinyllama-lora",
#       repo_type="model",
#   )
#
#   Then anyone can use:
#   model = PeftModel.from_pretrained(base_model, "your-username/mental-health-tinyllama-lora")
#
# ============================================================
# COMPARE: LORA VS FULL MODEL SAVING
# ============================================================
#
# FULL FINE-TUNING SAVE:
#   - Save ALL 1.1B parameters
#   - File size: 2.2 GB (fp16) or 4.4 GB (fp32)
#   - Self-contained (no need for base model)
#   - Can't easily switch between fine-tuned versions
#
# LORA SAVE (what we did):
#   - Save only 50M adapter parameters
#   - File size: ~200 MB
#   - Need base model to use (download once, reuse)
#   - Can swap adapters easily!
#       - Load TinyLlama base
#       - Add mental-health adapter -> mental health bot
#       - Remove adapter, add coding adapter -> coding bot
#       - Same base model, different "personalities"
#
# LoRA adapters are like costume changes for your model
# Quick to save, quick to load, easy to swap!

Saved to: ./fine_tuned_lora
  special_tokens_map.json: 0.00 MB
  tokenizer.json: 3.62 MB
  README.md: 0.01 MB
  tokenizer.model: 0.50 MB
  adapter_config.json: 0.00 MB
  tokenizer_config.json: 0.00 MB
  chat_template.jinja: 0.00 MB
  adapter_model.safetensors: 201.89 MB


## Chapter 10: Testing!

In [23]:
# INFERENCE FUNCTION
# MUST use same format as training!
#
# THIS IS WHERE THE MAGIC BECOMES REAL
# We trained the model, now we actually USE it
#
# THE GOLDEN RULE:
#   Inference format MUST MATCH training format EXACTLY
#   Same special tokens, same structure, same everything
#   Any mismatch = garbage output
#
# DURING TRAINING:
#   Model learned: "When I see <|system|>...</s><|user|>...</s><|assistant|>"
#                  "I should generate a helpful mental health response"
#
# DURING INFERENCE:
#   We give it: "<|system|>...</s><|user|>...</s><|assistant|>"
#   Model goes: "I recognize this pattern! I know what to do!"
#   And generates a response
#
# If we used different format at inference:
#   Model: "Wtf is this? Never seen this pattern before"
#   Output: garbage, random text, HTML tags, nonsense

def generate_response(model, tokenizer, question, max_new_tokens=150):
    # Function that takes a question and returns model's answer
    #
    # PARAMETERS:
    #   model: our fine-tuned TinyLlama (with LoRA adapters)
    #   tokenizer: converts text <-> tokens
    #   question: what the user is asking (string)
    #   max_new_tokens: maximum length of response (default 150)
    #
    # RETURNS:
    #   The model's response as a string
    
    """
    Generate a response. Uses EXACT same format as training.
    """
    # Docstring - describes what function does
    # "EXACT same format" is emphasized because it's THAT important
    
    # Same format as training - with </s> after system and user!
    prompt = f"""<|system|>
You are a helpful mental health assistant. Provide supportive, empathetic, and informative responses.</s>
<|user|>
{question}</s>
<|assistant|>
"""
    # BUILD THE PROMPT - must match training format!
    #
    # LET'S COMPARE:
    #
    # TRAINING FORMAT:
    #   <|system|>
    #   You are a helpful mental health assistant...</s>
    #   <|user|>
    #   {question}</s>
    #   <|assistant|>
    #   {answer}</s>              <- answer included
    #
    # INFERENCE FORMAT:
    #   <|system|>
    #   You are a helpful mental health assistant...</s>
    #   <|user|>
    #   {question}</s>
    #   <|assistant|>
    #                            <- NO answer, model generates it!
    #
    # SAME:
    #   ✓ <|system|> with message and </s>
    #   ✓ <|user|> with question and </s>
    #   ✓ <|assistant|> marker
    #
    # DIFFERENT:
    #   Training: answer is provided (model learns to predict it)
    #   Inference: answer is missing (model generates it)
    #
    # Note: NO </s> after assistant - model generates that
    #
    # WHY NO </s> AFTER <|assistant|>:
    #   If we put </s> there, we're saying "assistant turn is done"
    #   But assistant hasn't said anything yet!
    #   Model would be confused: "Turn ended but I didn't speak?"
    #
    #   We leave it open: "<|assistant|>\n"
    #   Model knows: "My turn to speak, I should generate until </s>"
    #   Model generates: "Hello! How can I help?</s>"
    #   The MODEL produces the </s> when it's done talking
    #
    # THIS IS A COMMON MISTAKE:
    #   People add </s> after <|assistant|> in inference
    #   Model sees complete conversation, nothing to generate
    #   Outputs nothing or garbage

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=256).to(model.device)
    # TOKENIZE THE PROMPT
    #
    # tokenizer(prompt, ...)
    #   Converts our text prompt into token IDs
    #   "Hello" -> [15496]
    #   Our full prompt -> [bunch of token IDs]
    #
    # return_tensors="pt"
    #   Return PyTorch tensors (not lists, not numpy)
    #   "pt" = PyTorch
    #   "tf" = TensorFlow
    #   "np" = NumPy
    #   Model expects PyTorch tensors
    #
    # truncation=True
    #   If prompt is too long, cut it off
    #   Without this: error if prompt > max_length
    #   With this: silently truncate (from the right/end)
    #
    # max_length=256
    #   Maximum prompt length in tokens
    #   Same as what we used in training
    #   Longer prompts get truncated
    #
    # .to(model.device)
    #   Move tensors to same device as model (GPU or CPU)
    #   
    #   model.device = where the model lives (cuda:0 or cpu)
    #   Tensors and model MUST be on same device
    #   GPU model + CPU tensor = error!
    #   
    #   .to(model.device) ensures they match
    #
    # WHAT inputs CONTAINS:
    #   {
    #     'input_ids': tensor([[1, 529, 29989, ...]]),  # token IDs
    #     'attention_mask': tensor([[1, 1, 1, ...]])    # which tokens to attend to
    #   }

    with torch.no_grad():
        # DISABLE GRADIENT COMPUTATION
        #
        # During training:
        #   PyTorch tracks all operations for backpropagation
        #   Stores intermediate values (activations) for gradient calculation
        #   Uses lots of memory!
        #
        # During inference:
        #   We're NOT training, just generating
        #   Don't need gradients
        #   Don't need to store activations
        #
        # torch.no_grad() tells PyTorch:
        #   "Don't track operations, don't store gradients"
        #   Uses ~3-4x LESS memory
        #   Also slightly faster
        #
        # ALWAYS use torch.no_grad() for inference
        # Common mistake: forgetting this, running out of memory
        
        outputs = model.generate(
            # .generate() = the text generation function
            #
            # This is where the magic happens
            # Model takes our prompt and produces new tokens
            #
            # HOW IT WORKS (simplified):
            #   1. Encode prompt: "Hello" -> [15496]
            #   2. Feed through model, get probability distribution over vocab
            #   3. Sample next token from distribution: "there" -> [15496, 727]
            #   4. Feed [15496, 727] through model, get next distribution
            #   5. Sample next token: [15496, 727, 11]
            #   6. Repeat until </s> token or max_new_tokens reached
            #
            # Each step, model predicts ONE token based on ALL previous tokens
            
            **inputs,
            # **inputs = unpack the dictionary
            # Same as: input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']
            # Just cleaner syntax
            
            max_new_tokens=max_new_tokens,
            # Maximum number of NEW tokens to generate
            # New = tokens after the prompt
            #
            # We pass this in as parameter (default 150)
            # Prompt might be 50 tokens, total output up to 50+150=200 tokens
            #
            # WHY LIMIT:
            #   Model could ramble forever
            #   Need a stopping condition
            #   Also prevents hanging if model doesn't produce </s>
            #
            # 150 tokens ≈ 100-120 words ≈ a decent paragraph
            
            do_sample=True,
            # SAMPLING vs GREEDY
            #
            # do_sample=False (GREEDY):
            #   Always pick the highest probability token
            #   "What's most likely next? Pick that."
            #   Deterministic: same input = same output every time
            #   Problem: boring, repetitive, gets stuck in loops
            #
            # do_sample=True (SAMPLING):
            #   Randomly sample from probability distribution
            #   Higher prob tokens more likely, but not guaranteed
            #   Stochastic: same input = different outputs
            #   More creative, more varied, more natural
            #
            # For chat: ALWAYS use sampling (True)
            # For code/facts: might use greedy (False)
            
            temperature=0.7,
            # TEMPERATURE - controls randomness of sampling
            #
            # HOW IT WORKS:
            #   Probabilities are scaled by 1/temperature before sampling
            #   Mathematically: softmax(logits / temperature)
            #
            # temperature=1.0 (neutral):
            #   Use raw probabilities as-is
            #   
            # temperature < 1.0 (e.g., 0.7):
            #   "Sharpen" the distribution
            #   High-prob tokens become even higher prob
            #   Low-prob tokens become even lower prob
            #   Result: more focused, predictable outputs
            #
            # temperature > 1.0 (e.g., 1.5):
            #   "Flatten" the distribution
            #   All tokens become more equally likely
            #   Result: more random, creative, chaotic outputs
            #
            # VISUAL:
            #   Low temp (0.3):  [0.9, 0.05, 0.03, 0.02] - very confident
            #   Neutral (1.0):   [0.5, 0.25, 0.15, 0.10] - normal
            #   High temp (1.5): [0.35, 0.25, 0.22, 0.18] - uncertain
            #
            # WHY 0.7:
            #   Slightly focused - good balance
            #   Creative but not crazy
            #   Industry standard for chat applications
            #   ChatGPT uses around 0.7-0.9
            
            top_p=0.9,
            # TOP-P (NUCLEUS) SAMPLING
            #
            # After temperature, ANOTHER filter on which tokens to consider
            #
            # HOW IT WORKS:
            #   1. Sort tokens by probability (highest first)
            #   2. Add probabilities until you reach top_p (0.9 = 90%)
            #   3. Only sample from those tokens
            #   4. Ignore the bottom 10% (long tail of unlikely tokens)
            #
            # EXAMPLE:
            #   Tokens sorted: [0.3, 0.25, 0.2, 0.15, 0.05, 0.03, 0.02]
            #   top_p=0.9
            #   Sum: 0.3+0.25+0.2+0.15=0.9 (stop here!)
            #   Only sample from first 4 tokens
            #   Bottom 3 (0.05, 0.03, 0.02) are ignored
            #
            # WHY IT HELPS:
            #   Removes garbage tokens (very low probability)
            #   Even with sampling, won't pick nonsense
            #   Adaptive: uses more tokens when uncertain, fewer when confident
            #
            # WHY 0.9:
            #   Keep 90% of probability mass
            #   Removes only the weirdest options
            #   Standard value, works well
            
            top_k=50,
            # TOP-K SAMPLING
            #
            # ANOTHER filter (can use with or without top_p)
            #
            # HOW IT WORKS:
            #   Only consider the top K most likely tokens
            #   Ignore everything else
            #
            # EXAMPLE:
            #   top_k=50
            #   Only sample from the 50 most likely next tokens
            #   Ignore the other 31,950 tokens in vocabulary
            #
            # WHY IT HELPS:
            #   Hard cutoff on number of choices
            #   Guarantees we never pick something super weird
            #
            # TOP_K vs TOP_P:
            #   top_k: fixed number of tokens (always exactly K)
            #   top_p: variable number (however many to reach P probability)
            #   
            #   Using both: tokens must pass BOTH filters
            #   Top 50 AND contribute to top 90% probability
            #
            # WHY 50:
            #   Reasonable variety without too much chaos
            #   Some use 40, some use 100
            #   50 is a safe middle ground
            
            pad_token_id=tokenizer.pad_token_id,
            # Tell model which token ID is padding
            #
            # Remember: we set pad_token = eos_token earlier
            # So pad_token_id = 2 (same as </s>)
            #
            # WHY NEEDED:
            #   Generation function needs to know about special tokens
            #   If not provided, might get warning or weird behavior
            
            eos_token_id=tokenizer.eos_token_id,
            # Tell model which token ID means "stop generating"
            #
            # eos_token_id = 2 (the </s> token)
            #
            # HOW IT'S USED:
            #   Model generates token by token
            #   If it generates token 2 (</s>), STOP
            #   "End of sequence - I'm done talking"
            #
            # Without this:
            #   Model might not know when to stop
            #   Could ramble until max_new_tokens
            
            repetition_penalty=1.2,
            # PENALIZE REPEATING TOKENS
            #
            # HOW IT WORKS:
            #   If a token already appeared, reduce its probability
            #   penalty=1.0: no penalty
            #   penalty=1.2: tokens that appeared before are 1.2x less likely
            #   penalty=2.0: much stronger penalty
            #
            # WHY IT HELPS:
            #   Without penalty, model might loop:
            #   "I think I think I think I think I think..."
            #   "The the the the the the the..."
            #
            #   With penalty, repeating tokens become less likely
            #   Forces model to use varied vocabulary
            #
            # WHY 1.2:
            #   Mild penalty - reduces repetition without killing natural patterns
            #   Some repetition is normal ("I am happy to help you")
            #   We don't want to ban all repetition, just excessive loops
            #
            # Common values: 1.0 (off), 1.1, 1.2, 1.5
            
            no_repeat_ngram_size=3,
            # HARD BAN ON REPEATING N-GRAMS
            #
            # n-gram = sequence of n tokens
            # 3-gram = sequence of 3 tokens
            #
            # HOW IT WORKS:
            #   Track all 3-token sequences that appeared
            #   If a 3-gram would repeat, set its probability to 0
            #   CANNOT repeat ANY 3-word phrase
            #
            # EXAMPLE:
            #   Generated so far: "I am happy to help you"
            #   3-grams seen: ["I am happy", "am happy to", "happy to help", "to help you"]
            #   
            #   Next token prediction:
            #   If model wants to generate "help you" after "to"...
            #   That would create "to help you" again
            #   But that 3-gram already exists!
            #   So "you" is BLOCKED after "to help"
            #
            # WHY 3:
            #   2-gram: too strict (many 2-word phrases naturally repeat)
            #   3-gram: good balance (blocks "I think I think" patterns)
            #   4-gram: might allow some repetitive loops
            #
            # COMBINED WITH repetition_penalty:
            #   repetition_penalty: soft penalty (reduces probability)
            #   no_repeat_ngram_size: hard ban (zero probability)
            #   Together = very effective at preventing repetition
        )

    text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    # DECODE TOKENS BACK TO TEXT
    #
    # outputs = tensor of token IDs
    #   Shape: [1, sequence_length]
    #   The 1 is batch dimension (we only generated 1 sequence)
    #
    # outputs[0] = get the first (only) sequence
    #   Shape: [sequence_length]
    #   Just the token IDs
    #
    # tokenizer.decode(...) = convert token IDs to string
    #   [1, 529, 29989, ...] -> "<|system|>\nYou are a helpful..."
    #
    # skip_special_tokens=False
    #   Keep special tokens like </s>, <|system|>, etc.
    #   We want to see them so we can parse the response
    #
    #   If skip_special_tokens=True:
    #     Special tokens removed from output
    #     Harder to find where assistant response starts
    #
    # text now contains the FULL generated text:
    #   "<|system|>\n...\n<|user|>\n{question}</s>\n<|assistant|>\n{response}</s>"
    #   Prompt + generated response all together

    # Extract response after <|assistant|>
    if "<|assistant|>" in text:
        response = text.split("<|assistant|>")[-1]
        response = response.replace("</s>", "").strip()
    else:
        response = text
    # EXTRACT JUST THE RESPONSE
    #
    # We don't want to return the whole thing:
    #   "<|system|>\n...\n<|user|>\nHi</s>\n<|assistant|>\nHello!</s>"
    #
    # We just want:
    #   "Hello!"
    #
    # HOW WE EXTRACT IT:
    #
    # if "<|assistant|>" in text:
    #   Check if the marker exists (it should!)
    #   Safety check in case something weird happened
    #
    # text.split("<|assistant|>")
    #   Split string at "<|assistant|>" marker
    #   Before: "...<|user|>\nHi</s>\n<|assistant|>\nHello!</s>"
    #   After: ["...<|user|>\nHi</s>\n", "\nHello!</s>"]
    #   Returns list of parts
    #
    # [-1]
    #   Get the LAST element (index -1 in Python)
    #   That's everything AFTER the marker
    #   "\nHello!</s>"
    #
    # .replace("</s>", "")
    #   Remove the end-of-sequence token
    #   "\nHello!</s>" -> "\nHello!"
    #
    # .strip()
    #   Remove whitespace from both ends
    #   "\nHello!" -> "Hello!"
    #
    # else: response = text
    #   If no <|assistant|> found (shouldn't happen), return everything
    #   Fallback for weird edge cases
    #
    # RESULT:
    #   Clean response string: "Hello!"
    #   No special tokens, no prompt, just the answer

    return response
    # Return the clean response string
    # Caller can print it, store it, whatever they want

print("Ready to test!")
# Just confirms function was defined without errors
#
# The function doesn't DO anything until we call it
# Next cell will call it with test questions
#
# OUTPUT: "Ready to test!"
#
# If there was a syntax error in the function, we'd see it here
# Success = function is ready to use

# ============================================================
# GENERATION PARAMETERS SUMMARY
# ============================================================
#
# WHAT WE'RE USING:
#   max_new_tokens=150     (limit response length)
#   do_sample=True         (use sampling, not greedy)
#   temperature=0.7        (slightly focused randomness)
#   top_p=0.9              (consider top 90% probability mass)
#   top_k=50               (consider top 50 tokens max)
#   repetition_penalty=1.2 (soft penalty on repeats)
#   no_repeat_ngram_size=3 (hard ban on 3-gram repeats)
#
# THIS GIVES US:
#   - Creative but coherent responses
#   - No repetitive loops
#   - Natural-sounding text
#   - Reasonable length
#
# ============================================================
# ALTERNATIVE SETTINGS FOR DIFFERENT NEEDS
# ============================================================
#
# MORE CREATIVE (brainstorming, stories):
#   temperature=1.0
#   top_p=0.95
#   top_k=100
#
# MORE FOCUSED (factual, consistent):
#   temperature=0.3
#   top_p=0.8
#   top_k=30
#
# DETERMINISTIC (exactly reproducible):
#   do_sample=False
#   (ignores temperature, top_p, top_k)
#
# ============================================================
# COMMON ISSUES AND FIXES
# ============================================================
#
# ISSUE: Repetitive output ("I think I think I think")
# FIX: Increase repetition_penalty to 1.3-1.5
#      Or increase no_repeat_ngram_size to 4
#
# ISSUE: Too random/incoherent
# FIX: Lower temperature (0.5)
#      Lower top_p (0.7)
#
# ISSUE: Too boring/repetitive content
# FIX: Higher temperature (0.9)
#      Higher top_p (0.95)
#
# ISSUE: Cuts off mid-sentence
# FIX: Increase max_new_tokens
#
# ISSUE: Doesn't stop, rambles forever
# FIX: Check eos_token_id is set correctly
#      Model might not have learned to produce </s>

Ready to test!


In [24]:
# TEST THE MODEL!

test_questions = [
    "Hi",
    "What are some ways to manage anxiety?",
    "I've been feeling stressed at work. Any suggestions?",
    "How can I improve my sleep?",
    "I'm feeling sad today",
]

print("Testing fine-tuned model:")
print("=" * 60)

for i, q in enumerate(test_questions, 1):
    print(f"\nQ{i}: {q}")
    print("-" * 40)
    response = generate_response(model, tokenizer, q)
    print(f"A: {response}")

Testing fine-tuned model:

Q1: Hi
----------------------------------------
A: Hello there. Tell me how are you feeling today?

Q2: What are some ways to manage anxiety?
----------------------------------------
A: 1. Exercise: Regular physical activity can help reduce feelings of anxiety by increasing endorphins in the brain.
2. Stay hydrated: Drinking water helps regulate moods.
3. Get enough sleep: Lack of sleep can cause anxiety.
4. Eat well: A balanced diet with plenty of fiber, vitamins, minerals, and healthy fats can improve digestion and promote good nutrient absorption which can aid in reducing symptoms of anxietiy.
5. Practice mindfulness: Pay attention to what you're doing without judgment. Focus on your breath or other sensory experiences. This practice can

Q3: I've been feeling stressed at work. Any suggestions?
----------------------------------------
A: How long have you been feeling this way? What else is on your mind?

Q4: How can I improve my sleep?
-------------------

## Chapter 11: Load Model Later

In [25]:
# HOW TO LOAD YOUR MODEL LATER
#
# You trained the model, saved the LoRA adapters, now the session ends
# Tomorrow you come back and want to USE the model
# Or you want to use it on a different machine
# Or share it with a friend
#
# THIS IS HOW YOU DO IT
#
# THE PROCESS:
#   1. Load the base model (TinyLlama) - download from HuggingFace or cache
#   2. Load YOUR LoRA adapters - from the folder you saved
#   3. Combine them - PeftModel does this automatically
#   4. Done! Use generate_response() like before
#
# WHY TWO STEPS:
#   We only saved the adapters (~200 MB), not full model (2.2 GB)
#   Base model lives on HuggingFace Hub
#   We download base once, then apply our custom adapters
#   Like downloading Photoshop once, then loading different edit files

from peft import PeftModel
# PeftModel = the wrapper class that combines base model + adapters
#
# We imported other peft stuff earlier (LoraConfig, get_peft_model)
# PeftModel is specifically for LOADING saved adapters
#
# get_peft_model: add NEW adapters to a model (training time)
# PeftModel.from_pretrained: load SAVED adapters (inference time)

def load_model(base_model_name, lora_path):
    # A reusable function to load your fine-tuned model
    #
    # PARAMETERS:
    #   base_model_name: HuggingFace model ID
    #                    "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    #
    #   lora_path: where you saved the LoRA adapters
    #              "./fine_tuned_lora" or any path
    #
    # RETURNS:
    #   model: ready-to-use model with LoRA adapters applied
    #   tokenizer: matching tokenizer
    #
    # USAGE:
    #   model, tokenizer = load_model("TinyLlama/...", "./fine_tuned_lora")
    #   response = generate_response(model, tokenizer, "Hi!")
    
    """Load base model with LoRA adapters."""
    # Docstring - short description of what function does
    
    tokenizer = AutoTokenizer.from_pretrained(lora_path)
    # LOAD TOKENIZER FROM YOUR SAVED FOLDER
    #
    # Remember: we saved tokenizer alongside LoRA adapters
    # lora_path contains:
    #   - adapter_config.json
    #   - adapter_model.safetensors
    #   - tokenizer.json         <- loading this!
    #   - tokenizer_config.json  <- and this!
    #   - special_tokens_map.json
    #
    # WHY FROM lora_path (not base_model_name)?
    #   We saved our exact tokenizer configuration
    #   Guaranteed to match what we trained with
    #   No risk of version mismatch
    #
    #   Could also do: AutoTokenizer.from_pretrained(base_model_name)
    #   Would work, but using saved copy is safer
    
    base_model = AutoModelForCausalLM.from_pretrained(
        # LOAD THE BASE MODEL (same as during training)
        #
        # This is the ORIGINAL TinyLlama, no fine-tuning
        # Same model everyone downloads
        # Our customization comes from LoRA adapters (loaded separately)
        
        base_model_name,
        # "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        #
        # First time: downloads from HuggingFace Hub (~2.2 GB)
        # After that: loads from local cache (fast!)
        #
        # Cache location: ~/.cache/huggingface/hub/
        # The heavy download only happens once per machine
        
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        ),
        # SAME QUANTIZATION CONFIG AS TRAINING
        #
        # We trained with 4-bit quantization
        # We should load with 4-bit quantization
        # Keeps memory usage low (~0.7 GB)
        #
        # COULD YOU LOAD WITHOUT QUANTIZATION?
        #   Yes! Load in fp16 or fp32
        #   Would use more memory but might be slightly better quality
        #   For inference, 4-bit is usually fine
        #
        # THE SETTINGS:
        #   load_in_4bit=True         - use 4-bit weights
        #   bnb_4bit_quant_type="nf4" - NormalFloat4 format
        #   bnb_4bit_compute_dtype=torch.bfloat16 - math in bf16
        #   bnb_4bit_use_double_quant=True - extra compression
        #
        # Same as training - consistency is key
        
        device_map="auto",
        # Automatically put model on GPU (or CPU if no GPU)
        # Same as training
        
        torch_dtype=torch.bfloat16,
        # Computation precision
        # Same as training
    )
    # NOW WE HAVE: base TinyLlama model, quantized, on GPU
    # But it's NOT fine-tuned yet - just vanilla TinyLlama
    # The magic happens in the next step...
    
    model = PeftModel.from_pretrained(base_model, lora_path)
    # COMBINE BASE MODEL + LORA ADAPTERS
    #
    # THIS IS THE KEY STEP!
    #
    # PeftModel.from_pretrained():
    #   Takes: base model + path to saved adapters
    #   Does: loads adapters from disk, attaches to base model
    #   Returns: combined model ready for inference
    #
    # WHAT IT LOADS FROM lora_path:
    #   adapter_config.json - LoRA settings (r=64, alpha=128, etc.)
    #   adapter_model.safetensors - the actual trained weights (~200MB)
    #
    # WHAT HAPPENS INTERNALLY:
    #   1. Read adapter_config.json to know LoRA structure
    #   2. Create empty LoRA matrices matching that config
    #   3. Load trained weights from adapter_model.safetensors
    #   4. Attach LoRA layers to base model's target modules
    #   5. Return wrapped PeftModel
    #
    # RESULT:
    #   base_model: vanilla TinyLlama (frozen, unchanged)
    #   model: TinyLlama + YOUR trained adapters
    #
    # model = base_model + your mental health fine-tuning
    # When you generate text, LoRA adapters modify the output
    # You get mental health responses, not generic ones!
    
    model.eval()
    # SET MODEL TO EVALUATION MODE
    #
    # PyTorch models have two modes:
    #   .train() - training mode
    #   .eval()  - evaluation/inference mode
    #
    # WHAT'S DIFFERENT IN EVAL MODE:
    #   1. Dropout is DISABLED
    #      Training: randomly drop 10% of connections (regularization)
    #      Inference: use ALL connections (best predictions)
    #
    #   2. BatchNorm uses running statistics
    #      (Not relevant for LLMs, but matters for other models)
    #
    #   3. Some layers behave differently
    #      Ensures deterministic, optimal inference
    #
    # IMPORTANT:
    #   We're loading for INFERENCE, not more training
    #   .eval() ensures we get best predictions
    #   
    # COMMON MISTAKE:
    #   Forgetting .eval() when loading for inference
    #   Model still works but dropout is active
    #   Slightly worse and non-deterministic outputs
    #
    # ALWAYS call .eval() when loading for inference!
    
    return model, tokenizer
    # Return both model and tokenizer
    #
    # WHY RETURN BOTH:
    #   You need both to generate text
    #   tokenizer: convert question to tokens
    #   model: generate response tokens
    #   tokenizer: convert response back to text
    #
    #   Convenient to get both from one function call
    #
    # USAGE:
    #   model, tokenizer = load_model(...)
    #   response = generate_response(model, tokenizer, "Hi!")

print("To load later:")
print(f"model, tokenizer = load_model('{model_name}', './fine_tuned_lora')")
# Show the exact command to use later
#
# OUTPUT:
#   To load later:
#   model, tokenizer = load_model('TinyLlama/TinyLlama-1.1B-Chat-v1.0', './fine_tuned_lora')
#
# Copy-paste this command into a new notebook!
# (After defining the load_model function of course)
#
# WHY PRINT model_name:
#   You might forget which base model you used
#   This saves you from guessing later
#   The base model MUST match what you trained with!

# ============================================================
# COMPLETE WORKFLOW FOR USING SAVED MODEL
# ============================================================
#
# IN A NEW NOTEBOOK OR SCRIPT:
#
# Step 1: Imports
#   import torch
#   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
#   from peft import PeftModel
#
# Step 2: Define load_model function (copy from above)
#
# Step 3: Define generate_response function (copy from earlier)
#
# Step 4: Load the model
#   model, tokenizer = load_model(
#       "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
#       "./fine_tuned_lora"  # or wherever you saved it
#   )
#
# Step 5: Use it!
#   response = generate_response(model, tokenizer, "How do I manage anxiety?")
#   print(response)
#
# ============================================================
# LOADING FROM DIFFERENT LOCATIONS
# ============================================================
#
# SAME MACHINE, SAME FOLDER:
#   lora_path = "./fine_tuned_lora"
#   (what we've been using)
#
# SAME MACHINE, DIFFERENT FOLDER:
#   lora_path = "/home/user/my_models/mental_health_lora"
#   (just change the path)
#
# DIFFERENT MACHINE:
#   1. Copy the fine_tuned_lora folder to new machine
#   2. Use the path where you put it
#   lora_path = "/path/on/new/machine/fine_tuned_lora"
#
# FROM GOOGLE DRIVE (Colab):
#   from google.colab import drive
#   drive.mount('/content/drive')
#   lora_path = "/content/drive/MyDrive/fine_tuned_lora"
#
# FROM HUGGINGFACE HUB (if you uploaded):
#   lora_path = "your-username/mental-health-tinyllama-lora"
#   (works just like a model name!)
#
# ============================================================
# ALTERNATIVE: MERGE LORA INTO BASE MODEL
# ============================================================
#
# Instead of loading base + adapters separately every time,
# you can MERGE them into one model:
#
#   # Load as usual
#   model = PeftModel.from_pretrained(base_model, lora_path)
#   
#   # Merge LoRA into base weights
#   merged_model = model.merge_and_unload()
#   
#   # Save the merged model
#   merged_model.save_pretrained("./merged_model")
#   tokenizer.save_pretrained("./merged_model")
#
# NOW:
#   - merged_model is a regular model (no LoRA wrapper)
#   - File size: ~2.2 GB (full model, not just adapters)
#   - Load with just AutoModelForCausalLM.from_pretrained()
#   - No need for PeftModel
#
# PROS OF MERGING:
#   - Simpler loading (one step, not two)
#   - Slightly faster inference (no LoRA overhead)
#   - Works with tools that don't support PEFT
#
# CONS OF MERGING:
#   - Much larger file size (2.2 GB vs 200 MB)
#   - Can't easily swap adapters anymore
#   - Loses the flexibility of LoRA
#
# For most cases, keeping adapters separate is better!
#
# ============================================================
# TROUBLESHOOTING
# ============================================================
#
# ERROR: "Can't find adapter_config.json"
#   The lora_path is wrong, or files weren't saved correctly
#   Check: os.listdir(lora_path) - do you see the files?
#
# ERROR: "Model architecture mismatch"
#   You're using a different base_model_name than you trained with
#   Must use exact same model: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#
# ERROR: "CUDA out of memory"
#   GPU doesn't have enough RAM
#   Make sure quantization_config is included (4-bit loading)
#   Or use device_map="cpu" (slower but always works)
#
# GARBAGE OUTPUT:
#   Are you using the same prompt format as training?
#   Same </s> tokens after each section?
#   Check generate_response function matches training format!

To load later:
model, tokenizer = load_model('TinyLlama/TinyLlama-1.1B-Chat-v1.0', './fine_tuned_lora')


# Done! You Just Fine-Tuned an LLM

### That's 1.1 billion parameters trained to do YOUR bidding. Not bad for a day's work.

---

## What You Actually Accomplished

Let's take a step back and appreciate what just happened:
```
BEFORE THIS NOTEBOOK:
- TinyLlama: A generic chatbot that gives Wikipedia-style responses
- You: "What's LoRA? What's quantization? Why is my GPU crying?"

AFTER THIS NOTEBOOK:
- Your Model: A mental health assistant that gives empathetic, supportive responses
- You: "I understand every line of code and can adapt this to any dataset"
```

You didn't just copy-paste code. You actually learned:

### 1. The Concepts

| Concept | What It Is | Why It Matters |
|---------|-----------|----------------|
| **Fine-tuning** | Teaching a pre-trained model YOUR specific task | Makes generic models useful for YOUR use case |
| **LoRA** | Train only ~4% of parameters (freeze the rest) | 25x less memory for gradients/optimizer |
| **Quantization** | Compress weights from 32-bit to 4-bit | 8x smaller model fits on cheap GPUs |
| **Prompt Format** | The exact template the model expects | Wrong format = garbage output (THE golden rule) |

### 2. The Code

Every cell in this notebook, you now understand:
- WHAT it does
- WHY it's there
- What BREAKS if you change it
- How to ADAPT it for your own use case

That's the difference between "tutorial follower" and "practitioner."

### 3. The Workflow
```
The Fine-Tuning Recipe (now burned into your brain):

1. GET DATA
   - Format as {'question': '...', 'answer': '...'}
   - More data = better model (usually)

2. LOAD BASE MODEL  
   - Pick a model (TinyLlama, Mistral, Llama, etc.)
   - Apply 4-bit quantization (BitsAndBytesConfig)
   - Prepare for training (prepare_model_for_kbit_training)

3. ADD LORA
   - Configure: rank, alpha, target modules
   - Apply: get_peft_model()
   - Now only ~4% of params are trainable

4. FORMAT DATA (CRITICAL!)
   - Must match model's expected format EXACTLY
   - TinyLlama: <|system|>...</s><|user|>...</s><|assistant|>...</s>
   - Miss one </s> = garbage output

5. TRAIN
   - Set hyperparameters (epochs, LR, batch size)
   - Create SFTTrainer
   - trainer.train() and watch the loss go down

6. SAVE
   - Save LoRA adapters (~200 MB, not 2+ GB)
   - Save tokenizer too (for convenience)

7. USE
   - Load base model + LoRA adapters
   - Use SAME format as training
   - Generate responses
```

This same workflow works for ANY model, ANY dataset, ANY task.

---

## Quick Reference: Common Issues and Fixes

Bookmark this. You'll need it.

### Problem: Model outputs garbage (random tokens, HTML tags, repetitive nonsense)
```
Cause:    99% of the time, your inference format doesn't match training format
Fix:      Check your prompt template character by character
          - </s> after system message?
          - </s> after user message?
          - Same special tokens?
          - Same whitespace/newlines?
```

### Problem: Loss stays high or spikes randomly
```
Cause:    Learning rate too high
Fix:      Lower it: 2e-4 --> 1e-4 --> 5e-5 --> 2e-5
          Start conservative, increase if training is too slow
```

### Problem: Training loss goes down, but validation loss goes UP
```
Cause:    Overfitting (model memorizing, not learning)
Fix:      - Reduce epochs: 5 --> 3 --> 2
          - Increase dropout: 0.1 --> 0.15 --> 0.2
          - Increase weight_decay: 0.05 --> 0.1
          - Get more training data (best fix!)
```

### Problem: CUDA out of memory
```
Cause:    GPU doesn't have enough RAM
Fix:      - Reduce batch size: 4 --> 2 --> 1
          - Reduce max_seq_length: 512 --> 256 --> 128
          - Make sure quantization is enabled
          - Make sure gradient_checkpointing is True
          - Reduce LoRA rank: 64 --> 32 --> 16
```

### Problem: Model repeats itself endlessly
```
Cause:    Generation parameters not tuned
Fix:      - Add repetition_penalty=1.2
          - Add no_repeat_ngram_size=3
          - Lower temperature: 0.9 --> 0.7
```

---

## Where To Go From Here

You've got the foundation. Here's how to level up:

### Level 1: Try Your Own Dataset

The mental health dataset was just an example. You can fine-tune for:

- **Customer Support:** Your company's support tickets
- **Sales:** Your sales call transcripts  
- **Legal:** Legal Q&A pairs
- **Medical:** Medical conversations (be careful with this one)
- **Creative Writing:** Stories in a specific style
- **Code:** Your codebase + documentation
- **Anything:** If you have question-answer pairs, you can fine-tune

Just format your data as:
```python
{'question': 'user input here', 'answer': 'desired output here'}
```

Use the same code, swap the dataset. Done.

### Level 2: Experiment with LoRA Settings

We used `r=64`, but you can experiment:
```
r=8    --> Very efficient, limited capacity
          Good for: Simple style changes
          Trainable params: ~12M

r=16   --> Efficient, decent capacity  
          Good for: Light adaptation
          Trainable params: ~25M

r=32   --> Balanced
          Good for: Learning new patterns
          Trainable params: ~35M

r=64   --> High capacity (what we used)
          Good for: Learning new domain knowledge
          Trainable params: ~50M

r=128  --> Very high capacity
          Good for: Complex tasks
          Trainable params: ~100M
```

Start with r=16. If results are bad, increase. Find the minimum that works.

### Level 3: Try Bigger Models

TinyLlama (1.1B) is great for learning, but bigger = better:
```
If you have 16GB GPU (T4, P100):
  - Phi-2 (2.7B) - surprisingly capable
  - Mistral-7B with aggressive settings
  
If you have 24GB GPU (RTX 3090/4090):
  - Mistral-7B comfortably
  - Llama-2-7B
  - Llama-2-13B with quantization

If you have 40GB+ GPU (A100):
  - Pretty much anything
  - Llama-2-70B with quantization
```

Same code, just change:
```python
model_name = "mistralai/Mistral-7B-v0.1"
# or
model_name = "meta-llama/Llama-2-7b-hf"
```

### Level 4: Learn More Techniques

Once you're comfortable with basic fine-tuning:

- **DPO (Direct Preference Optimization):** Train on preference data (A is better than B)
- **RLHF:** Reinforcement Learning from Human Feedback (how ChatGPT was trained)
- **Merging:** Combine multiple LoRA adapters
- **Quantization-Aware Training:** Train in low precision from the start

But master the basics first. Everything else builds on what you learned today.

---

## The Golden Rule (One Last Time)

I know I've said it a hundred times, but I'll say it once more:
```
+------------------------------------------------------------------+
|                                                                  |
|   TRAINING FORMAT  ====  INFERENCE FORMAT                        |
|                                                                  |
|   They must match EXACTLY.                                       |
|                                                                  |
|   This is the #1 cause of "my fine-tuned model doesn't work."    |
|   Check it first. Check it always. Check it when in doubt.       |
|                                                                  |
+------------------------------------------------------------------+
```

---

## Final Words

You started this notebook wondering what fine-tuning even is.

You're ending it with:
- A working fine-tuned model
- Understanding of LoRA and quantization
- Knowledge of the entire fine-tuning workflow
- Ability to adapt this to any dataset or model

That's real progress. That's a real skill.

Now go build something cool with it.

**If this notebook helped you, please upvote it!** It helps others find it too.

Good luck out there.

---

*P.S. - When your fine-tuned model inevitably outputs garbage the first time you try it on your own dataset, remember: check the prompt format. It's always the prompt format.*