# Notebook 03 ‚Ä¢ Fine-Tuning Preparation

**Goal:** Prepare datasets, understand LoRA, and set up for fine-tuning experiments.

---

## 1. Setup

In [None]:
# Install dependencies
# !pip install transformers datasets peft accelerate bitsandbytes torch

import json
import os
from pathlib import Path
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

## 2. Dataset Format

Fine-tuning datasets typically follow the instruction format:

In [None]:
# Example dataset entries
SAMPLE_DATASET = [
    {
        "instruction": "Explain what a Docker container is.",
        "input": "",
        "output": "A Docker container is a lightweight, standalone package that includes everything needed to run software: code, runtime, libraries, and settings. It's isolated from other containers and the host system."
    },
    {
        "instruction": "Convert this time to 24-hour format.",
        "input": "3:30 PM",
        "output": "15:30"
    },
    {
        "instruction": "Write a git commit message for these changes.",
        "input": "Fixed login bug, updated README, added unit tests",
        "output": "fix: resolve login authentication bug\n\n- Fix session timeout handling\n- Update README with new API endpoints\n- Add unit tests for auth module"
    },
]

print(f"Sample entries: {len(SAMPLE_DATASET)}")
print(json.dumps(SAMPLE_DATASET[0], indent=2))

## 3. Creating Your Dataset

Build a dataset for a specific use case. Let's create a DevOps Q&A dataset.

In [None]:
# DevOps Q&A dataset
DEVOPS_QA = [
    {"instruction": "What is CI/CD?", "input": "", "output": "CI/CD stands for Continuous Integration and Continuous Deployment. CI automatically builds and tests code changes. CD automatically deploys validated changes to production. Together, they enable faster, safer software delivery."},
    {"instruction": "Explain Kubernetes pods.", "input": "", "output": "A Kubernetes pod is the smallest deployable unit. It contains one or more containers that share storage, network, and settings. Pods are ephemeral - they can be created, destroyed, and replaced as needed."},
    {"instruction": "What's the difference between Docker and Kubernetes?", "input": "", "output": "Docker packages applications into containers. Kubernetes orchestrates containers across clusters. Think of Docker as the container runtime and Kubernetes as the container manager at scale."},
    {"instruction": "What is infrastructure as code?", "input": "", "output": "Infrastructure as Code (IaC) manages infrastructure through configuration files rather than manual processes. Tools like Terraform, Pulumi, and CloudFormation let you version, review, and automate infrastructure changes."},
    {"instruction": "Explain blue-green deployment.", "input": "", "output": "Blue-green deployment maintains two identical environments: blue (current) and green (new). Traffic switches from blue to green after validation. This enables zero-downtime deployments and quick rollbacks."},
    {"instruction": "What is a load balancer?", "input": "", "output": "A load balancer distributes incoming traffic across multiple servers. It improves availability, scalability, and reliability. Types include Layer 4 (transport) and Layer 7 (application) balancers."},
    {"instruction": "Explain microservices architecture.", "input": "", "output": "Microservices break applications into small, independent services. Each service handles a specific business function, communicates via APIs, and can be deployed separately. Benefits include scalability and flexibility; challenges include complexity and distributed system issues."},
    {"instruction": "What is observability?", "input": "", "output": "Observability is the ability to understand a system's internal state from its outputs. The three pillars are: metrics (numbers), logs (events), and traces (request paths). It's essential for debugging distributed systems."},
]

print(f"DevOps Q&A entries: {len(DEVOPS_QA)}")

# Save as JSONL
data_dir = Path("../data")
data_dir.mkdir(exist_ok=True)

with open(data_dir / "devops_qa.jsonl", "w") as f:
    for entry in DEVOPS_QA:
        f.write(json.dumps(entry) + "\n")

print(f"Saved to {data_dir / 'devops_qa.jsonl'}")

## 4. Train/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

# Load and split
with open(data_dir / "devops_qa.jsonl") as f:
    all_data = [json.loads(line) for line in f]

train_data, val_data = train_test_split(all_data, test_size=0.2, random_state=42)

# Save splits
with open(data_dir / "train.jsonl", "w") as f:
    for entry in train_data:
        f.write(json.dumps(entry) + "\n")

with open(data_dir / "val.jsonl", "w") as f:
    for entry in val_data:
        f.write(json.dumps(entry) + "\n")

print(f"Train: {len(train_data)}, Validation: {len(val_data)}")

## 5. Understanding LoRA

LoRA (Low-Rank Adaptation) adds small trainable matrices instead of modifying all weights.

In [None]:
# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                    # Rank of adaptation
    lora_alpha=32,          # Scaling factor
    lora_dropout=0.1,       # Dropout probability
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    bias="none",
)

print("LoRA Config:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Target modules: {lora_config.target_modules}")

In [None]:
# Example: Check trainable parameters (without loading full model)
def print_trainable_parameters(model):
    """Print the number of trainable parameters."""
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")

# This would work with a loaded model:
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
# model = get_peft_model(model, lora_config)
# print_trainable_parameters(model)

## 6. Prompt Template for Training

In [None]:
def format_prompt(example, tokenizer):
    """Format example for instruction tuning."""
    if example["input"]:
        prompt = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        prompt = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    
    return prompt

# Example
example = DEVOPS_QA[0]
print(format_prompt(example, None))

## 7. üéØ Your Tasks

### Task 1: Expand the Dataset
Add at least 20 more DevOps Q&A entries to make a meaningful training set.

In [None]:
# TODO: Add more entries
more_entries = [
    # {"instruction": "...", "input": "", "output": "..."},
]

# Combine and save
# full_dataset = DEVOPS_QA + more_entries

### Task 2: Prepare a Different Domain
Create a dataset for a domain you're interested in (e.g., customer support, code generation, etc.).

In [None]:
# TODO: Create your custom dataset
CUSTOM_DATASET = [
    # Add your entries
]

# Save it
# with open(data_dir / "custom.jsonl", "w") as f:
#     for entry in CUSTOM_DATASET:
#         f.write(json.dumps(entry) + "\n")

### Task 3: Document Your Dataset

In [None]:
# Create a README for your data
data_readme = """# Dataset Documentation

## devops_qa.jsonl
- **Purpose**: DevOps Q&A fine-tuning
- **Format**: Instruction-tuning JSONL
- **Entries**: {len(DEVOPS_QA)}
- **Split**: 80/20 train/val

## Schema
```json
{
  "instruction": "The task or question",
  "input": "Optional context (can be empty)",
  "output": "The expected response"
}
```

## Source
Curated from DevOps documentation and common interview questions.
"""

with open(data_dir / "README.md", "w") as f:
    f.write(data_readme)

print("Data README created!")

## 8. üìù Reflection

1. Why use LoRA instead of full fine-tuning?
2. What makes a good fine-tuning dataset?
3. How would you evaluate a fine-tuned model?
4. What are the risks of fine-tuning on small datasets?

---

**Next:** Run `scripts/train_lora.py` to train your model!