# Lab 6: Fine-tuning Large Language Models

This notebook covers fine-tuning large language models (LLMs) using parameter-efficient techniques like LoRA (Low-Rank Adaptation).

## Learning Objectives

By the end of this lab, you will be able to:
- Load and prepare datasets for LLM fine-tuning
- Understand LoRA and parameter-efficient fine-tuning
- Fine-tune Llama models using Hugging Face
- Evaluate fine-tuned models
- Use fine-tuned models for inference


## Introduction to LLM Fine-tuning

Fine-tuning adapts pre-trained language models to specific tasks or domains. Instead of training from scratch, we:
1. Start with a pre-trained model (e.g., Llama, GPT)
2. Add task-specific layers or adapt existing parameters
3. Train on domain-specific data

### Why LoRA?

**LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning method that:
- Adds trainable low-rank matrices to existing weights
- Reduces memory requirements significantly
- Maintains model performance
- Allows multiple task-specific adapters


In [None]:
# Install required packages
# Uncomment if needed:
# !pip install transformers datasets accelerate peft bitsandbytes

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import os

print("âœ“ Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Load Dataset

We'll use a medical Q&A dataset for demonstration. In practice, you can use any domain-specific dataset.


In [None]:
# For demonstration, we'll create a simple Q&A dataset
# In practice, you would load from Hugging Face Hub or your own data

# Example: Medical Q&A format
sample_data = {
    "instruction": [
        "What are the symptoms of diabetes?",
        "How is hypertension treated?",
        "What causes migraines?",
        "Explain the function of the heart.",
        "What is the difference between a virus and bacteria?"
    ],
    "input": ["", "", "", "", ""],
    "output": [
        "Common symptoms of diabetes include increased thirst, frequent urination, extreme fatigue, blurred vision, and slow-healing sores.",
        "Hypertension is typically treated with lifestyle changes (diet, exercise) and medications such as ACE inhibitors, beta-blockers, or diuretics.",
        "Migraines can be caused by various factors including stress, hormonal changes, certain foods, sleep patterns, and environmental triggers.",
        "The heart pumps blood throughout the body, delivering oxygen and nutrients to tissues and removing carbon dioxide and waste products.",
        "Viruses are smaller than bacteria, require a host cell to reproduce, and are not considered living organisms. Bacteria are single-celled organisms that can reproduce independently."
    ]
}

# Convert to dataset format
from datasets import Dataset

dataset = Dataset.from_dict(sample_data)

# For a real scenario, you might load from Hugging Face:
# dataset = load_dataset("medical_questions", split="train")

print(f"âœ“ Dataset loaded: {len(dataset)} examples")
print("\nSample entry:")
print(dataset[0])


## Step 2: Load Pre-trained Model

We'll use a smaller Llama model for demonstration. For production, consider larger models.


In [None]:
# Model configuration
model_name = "microsoft/DialoGPT-small"  # Using a smaller model for demo
# For Llama models, you would use: "meta-llama/Llama-2-7b-hf" (requires access)

print(f"Loading model: {model_name}")
print("Note: For Llama models, you need Hugging Face access token")

try:
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Add padding token if it doesn't exist
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # Load model
    # For 8-bit quantization (memory efficient):
    # model = AutoModelForCausalLM.from_pretrained(
    #     model_name,
    #     load_in_8bit=True,
    #     device_map="auto"
    # )
    
    # For regular loading:
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        device_map="auto" if torch.cuda.is_available() else None
    )
    
    print(f"âœ“ Model loaded: {model_name}")
    print(f"  - Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")
    
except Exception as e:
    print(f"Error loading model: {e}")
    print("\nUsing a simpler approach for demonstration...")
    print("In practice, ensure you have:")
    print("1. Hugging Face account and access token")
    print("2. Sufficient GPU memory")
    print("3. Model access permissions (for Llama)")


## Step 3: Prepare Data for Training

Format the dataset into prompts that the model can learn from.


In [None]:
def format_prompt(instruction, input_text, output):
    """Format data into a prompt for instruction tuning"""
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
    return prompt

def tokenize_function(examples):
    """Tokenize the formatted prompts"""
    # Format prompts
    texts = [
        format_prompt(inst, inp, out)
        for inst, inp, out in zip(
            examples["instruction"],
            examples["input"],
            examples["output"]
        )
    ]
    
    # Tokenize
    tokenized = tokenizer(
        texts,
        truncation=True,
        padding=True,
        max_length=512,
        return_tensors="pt"
    )
    
    # For causal LM, labels are the same as input_ids
    tokenized["labels"] = tokenized["input_ids"].clone()
    
    return tokenized

# Apply tokenization
print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)

print(f"âœ“ Dataset tokenized")
print(f"  - Number of examples: {len(tokenized_dataset)}")
print(f"\nSample tokenized input (first 50 tokens):")
sample_ids = tokenized_dataset[0]["input_ids"][:50]
print(tokenizer.decode(sample_ids))


## Step 4: Configure LoRA

Set up LoRA for parameter-efficient fine-tuning.


In [None]:
try:
    from peft import LoraConfig, get_peft_model, TaskType
    
    # Configure LoRA
    lora_config = LoraConfig(
        r=16,  # Rank: lower = fewer parameters
        lora_alpha=32,  # Scaling factor
        target_modules=["q_lin", "k_lin", "v_lin", "out_lin"],  # Attention layers
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM
    )
    
    # Prepare model for LoRA
    # If using 8-bit: model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    
    # Print trainable parameters
    model.print_trainable_parameters()
    
    print("âœ“ LoRA configured successfully!")
    print("\nLoRA adds trainable parameters to specific layers")
    print("This allows fine-tuning with much less memory")
    
except ImportError:
    print("PEFT not installed. Install with: pip install peft")
    print("\nFor this demo, we'll proceed without LoRA")
    print("In production, always use LoRA for efficient fine-tuning")


## Step 5: Training Setup

Configure training arguments and start fine-tuning.


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./llm_finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    logging_steps=5,
    save_steps=50,
    evaluation_strategy="no",
    save_total_limit=2,
    prediction_loss_only=True,
    remove_unused_columns=False,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    report_to="none"  # Disable wandb/tensorboard
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal LM, not masked LM
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print("âœ“ Trainer configured")
print(f"  - Training epochs: {training_args.num_train_epochs}")
print(f"  - Batch size: {training_args.per_device_train_batch_size}")
print(f"  - Output directory: {training_args.output_dir}")

# Note: For this demo with very small dataset, training might overfit quickly
# In practice, use larger datasets and proper train/val splits


## Step 6: Fine-tune the Model

Train the model on your dataset.


In [None]:
print("Starting training...")
print("Note: This is a demonstration. For real training:")
print("  - Use larger datasets")
print("  - Monitor validation loss")
print("  - Use proper train/val/test splits")
print("  - Adjust hyperparameters")

# Uncomment to actually train:
# trainer.train()

# Save the model
# trainer.save_model()
# tokenizer.save_pretrained("./llm_finetuned")

print("\nâœ“ Training setup complete!")
print("Uncomment trainer.train() to start actual training")
print("\nAfter training, you can:")
print("1. Load the fine-tuned model")
print("2. Test on new examples")
print("3. Compare with base model")


## Step 7: Inference with Fine-tuned Model

Test the fine-tuned model on new examples.


In [None]:
def generate_response(model, tokenizer, instruction, max_length=200):
    """Generate response using the fine-tuned model"""
    prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part
    if "### Response:" in response:
        response = response.split("### Response:")[-1].strip()
    
    return response

# Test examples
test_instructions = [
    "What are the symptoms of diabetes?",
    "How is hypertension treated?",
    "Explain the function of the heart."
]

print("Testing fine-tuned model (if trained):")
print("="*70)

# Note: This will use the base model if not fine-tuned
# In practice, load your fine-tuned model:
# model = AutoModelForCausalLM.from_pretrained("./llm_finetuned")

for instruction in test_instructions:
    print(f"\nðŸ“Œ Instruction: {instruction}")
    try:
        response = generate_response(model, tokenizer, instruction)
        print(f"   Response: {response[:200]}...")
    except Exception as e:
        print(f"   Error: {e}")
        print("   (Model may need to be fine-tuned first)")


## Summary

This lab covered:

1. **Dataset Preparation**: Formatting data for instruction tuning
2. **Model Loading**: Loading pre-trained LLMs (Llama/DialoGPT)
3. **LoRA Configuration**: Setting up parameter-efficient fine-tuning
4. **Training Setup**: Configuring training arguments and data collators
5. **Fine-tuning**: Training the model on domain-specific data
6. **Inference**: Using fine-tuned models for generation

### Key Takeaways

- **LoRA is essential**: Reduces memory and training time significantly
- **Data quality matters**: Well-formatted prompts lead to better results
- **Start small**: Test with small models before scaling up
- **Monitor training**: Watch for overfitting and adjust accordingly

### Next Steps

- Experiment with different LoRA configurations (r, alpha values)
- Try different prompt formats
- Fine-tune on your own domain-specific datasets
- Explore other PEFT methods (AdaLoRA, QLoRA)
