<a href="https://colab.research.google.com/github/Lcocks/DS6050-DeepLearning/blob/main/DPO%2BQLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ====================================================================================
# LLM Alignment in Practice: Direct Preference Optimization (DPO)
#
# This notebook demonstrates how to align a language model using DPO with QLoRA.
# We connect concepts from our PEFT and Alignment lectures.
#
# Learning Goals:
# 1. Understand preference data format
# 2. Apply QLoRA for memory-efficient alignment
# 3. Use Hugging Face TRL (Transformer Reinforcement Learning) library
# 4. Compare model behavior before and after alignment
# ====================================================================================

# -----------------------------------------------------------------------------
# CELL 1: Installation
# -----------------------------------------------------------------------------
"""
Install required packages. This may take a few minutes.
Run this cell first, then restart the runtime if needed.
"""

!pip install -q torch transformers datasets trl peft accelerate bitsandbytes


[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m465.5/465.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:

# -----------------------------------------------------------------------------
# CELL 2: Imports and Setup
# -----------------------------------------------------------------------------
"""
Import necessary libraries from the Hugging Face ecosystem.
"""

import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from peft import LoraConfig, PeftModel
from trl import DPOTrainer, DPOConfig
import gc
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# -----------------------------------------------------------------------------
# CELL 3: Understanding Preference Data
# -----------------------------------------------------------------------------
"""
From the lecture: Stage 2 of RLHF - Preference Data Collection

In alignment, we don't write "perfect" responses (hard!).
Instead, we collect PREFERENCES: given two responses, which is better?

Required format:
- 'prompt': The user's question/instruction
- 'chosen': The preferred response (y_w - winner)
- 'rejected': The dispreferred response (y_l - loser)

EXAMPLE:
TinyLlama-Chat is ALREADY very enthusiastic with emojis!
So we'll do the OPPOSITE: align it to be more formal and professional.
This creates a clear, observable behavioral shift.

IMPORTANT: We're training the model to prefer formal, professional responses
over casual, emoji-filled ones. This works better pedagogically because
we're changing the model's existing behavior rather than reinforcing it.
"""

preference_data = {
    #original
    # "prompt": [
    #     "What is the weather forecast for tomorrow?",
    #     "Did you hear the news about the new library opening?",
    #     "Can you explain how photosynthesis works?",
    #     "What are the benefits of reading books?",
    #     "Tell me about the concert last night.",
    #     "How do I bake a chocolate cake?",
    #     "What's the best way to learn programming?",
    #     "Why is exercise important?",
    #     "What happened in the football game?",
    #     "How can I improve my memory?",
    #     "What makes a good movie?",
    #     "Why should I drink more water?",
    #     "Tell me about classical music.",
    #     "How do plants grow?",
    #     "What's good about traveling?",
    #     "Why is sleep important?",
    #     "What are hobbies good for?",
    #     "How does recycling help?",
    #     "What makes a friendship strong?",
    #     "Why learn history?",
    # ],

    #now testing with formal request input.
    "prompt": [
        "What is the weather forecast for tomorrow? Respond in a formal manner.",
        "Did you hear the news about the new library opening? Respond in a formal manner.",
        "Can you explain how photosynthesis works? Respond in a formal manner.",
        "What are the benefits of reading books? Respond in a formal manner.",
        "Tell me about the concert last night. Respond in a formal manner.",
        "How do I bake a chocolate cake? Respond in a formal manner.",
        "What's the best way to learn programming? Respond in a formal manner.",
        "Why is exercise important? Respond in a formal manner.",
        "What happened in the football game? Respond in a formal manner.",
        "How can I improve my memory? Respond in a formal manner.",
        "What makes a good movie? Respond in a formal manner.",
        "Why should I drink more water? Respond in a formal manner.",
        "Tell me about classical music. Respond in a formal manner.",
        "How do plants grow? Respond in a formal manner.",
        "What's good about traveling? Respond in a formal manner.",
        "Why is sleep important? Respond in a formal manner.",
        "What are hobbies good for? Respond in a formal manner.",
        "How does recycling help? Respond in a formal manner.",
        "What makes a friendship strong? Respond in a formal manner.",
        "Why learn history? Respond in a formal manner.",
    ],
    "chosen": [
        # Formal, professional responses (what we NOW WANT)
        "The forecast indicates clear skies with temperatures reaching 75¬∞F tomorrow. No precipitation is expected.",
        "I am aware of the announcement regarding the new library opening. It will serve as a valuable community resource.",
        "Photosynthesis is the biochemical process by which plants convert light energy into chemical energy, producing glucose and oxygen.",
        "Reading provides numerous cognitive benefits including vocabulary expansion, improved concentration, and knowledge acquisition.",
        "The concert featured multiple musical performances with professional execution across various genres.",
        "To prepare chocolate cake, combine flour, cocoa powder, eggs, sugar, and butter. Bake at 350¬∞F for 30-35 minutes.",
        "Programming proficiency can be developed through systematic study of computer science fundamentals and regular practice.",
        "Regular physical exercise contributes to cardiovascular health, muscular strength, and overall physiological well-being.",
        "The football match concluded with the home team securing a 2-1 victory in the final minutes of play.",
        "Memory retention can be enhanced through spaced repetition techniques, adequate sleep, and consistent cognitive engagement.",
        "High-quality cinema demonstrates strong narrative structure, well-developed characters, and thoughtful thematic exploration.",
        "Adequate hydration supports metabolic processes, cognitive function, and maintains optimal body temperature regulation.",
        "Classical music represents a Western art music tradition characterized by complex harmonic structures and formal composition.",
        "Plant development requires photosynthesis, water absorption through roots, and nutrient uptake from surrounding soil.",
        "International travel facilitates cultural exchange, broadens perspectives, and provides educational opportunities.",
        "Sleep is essential for memory consolidation, cellular repair, and maintaining cognitive performance.",
        "Recreational activities provide stress reduction, skill development, and contribute to overall life satisfaction.",
        "Recycling programs reduce landfill waste, conserve natural resources, and minimize environmental degradation.",
        "Strong interpersonal relationships are built on mutual trust, consistent support, and shared experiences.",
        "Historical study provides contextual understanding of contemporary society and insight into human development.",
    ],
    "rejected": [
        # Casual, emoji-filled responses (what we DON'T WANT anymore)
        "It looks like it will be sunny and warm tomorrow! ‚òÄÔ∏è Perfect day for a walk! üö∂",
        "Yes, that's amazing news! üìö A new library is a fantastic addition to the community! üéâ",
        "Absolutely! Photosynthesis is the incredible process plants use to create energy from sunlight! üå± It's vital for all life! üåç",
        "Reading is incredible! ü§© It expands your vocabulary, improves focus, and opens doors to new worlds! üìñ",
        "The concert was absolutely spectacular! üé∏ The energy was electric and everyone was dancing! üíÉüï∫",
        "Baking chocolate cake is so much fun! üç∞ Just mix flour, cocoa, eggs and sugar, then bake at 350¬∞F! Delicious! üòã",
        "Learning programming is exciting! üíª Start with Python - it's beginner-friendly and super powerful! üöÄ",
        "Exercise is wonderful! üí™ It boosts your mood, energy, and keeps you healthy! üèÉ‚Äç‚ôÄÔ∏è",
        "What an exciting game! ‚öΩ The home team scored in the last minute for a thrilling 2-1 victory! üéâ",
        "Great question! üß† Try spaced repetition, get good sleep, and stay active - it really helps! ‚ú®",
        "A good movie has compelling characters, an engaging plot, and leaves you thinking! üé¨ So powerful! ‚ù§Ô∏è",
        "Water is essential! üíß It energizes you, helps your brain, and keeps everything running smoothly! üí™",
        "Classical music is beautiful! üéµ It's calming, complex, and has stood the test of time! üéª",
        "Plants are amazing! üå± They need sunlight, water, and nutrients to grow big and strong! üåª",
        "Traveling is awesome! ‚úàÔ∏è You experience new cultures, foods, and perspectives! üåé",
        "Sleep is crucial! üò¥ It helps your brain consolidate memories and repairs your body! üí§",
        "Hobbies are fantastic! üé® They reduce stress, teach new skills, and bring joy! üòä",
        "Recycling is important! ‚ôªÔ∏è It reduces waste, saves resources, and protects our planet! üåç",
        "True friendships are built on trust, support, and shared experiences! üëØ So valuable! ‚ù§Ô∏è",
        "History teaches us about our past and helps us understand the present! üìú Fascinating! ü§î",
    ]
}

# Convert to Hugging Face Dataset
dataset = Dataset.from_dict(preference_data)
print(f"Created preference dataset with {len(dataset)} examples")
print(f"\nExample:")
print(f"Prompt: {dataset[0]['prompt']}")
print(f"Chosen: {dataset[0]['chosen'][:80]}...")
print(f"Rejected: {dataset[0]['rejected'][:80]}...")

# -----------------------------------------------------------------------------
# CELL 4: Model Selection and Chat Template
# -----------------------------------------------------------------------------
"""
LECTURE CONNECTION: Stage 1 of RLHF - Starting from SFT Model

We start with an instruction-tuned model (the output of SFT).
TinyLlama is small (1.1B parameters) - perfect for education and fast iteration.

IMPORTANT: Different models use different chat templates (ChatML, Llama format, etc.)
We must format prompts correctly using the tokenizer's template.
"""

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token  # Required for padding in batches

def format_prompt(example):
    """
    Apply the model's chat template to format prompts correctly.
    This ensures the model understands this is a user instruction.
    """
    prompt_message = [{"role": "user", "content": example["prompt"]}]
    example["prompt"] = tokenizer.apply_chat_template(
        prompt_message,
        tokenize=False,
        add_generation_prompt=True  # Adds the assistant prompt starter
    )
    return example

# Apply formatting to all examples
dataset = dataset.map(format_prompt)
print(f"Formatted prompt example (first 200 chars):")
print(dataset[0]['prompt'][:200])

# -----------------------------------------------------------------------------
# CELL 5: QLoRA Configuration
# -----------------------------------------------------------------------------
"""
LECTURE CONNECTION: PEFT Lecture - QLoRA for Memory Efficiency

Alignment training is memory-intensive. QLoRA enables us to:
1. Quantize the base model to 4-bit (reduces memory ~4x)
2. Train small LoRA adapters instead of all parameters

This makes alignment feasible on consumer GPUs!
"""

# Quantization Configuration (4-bit)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat 4-bit (recommended)
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    bnb_4bit_use_double_quant=True,  # Nested quantization for extra memory savings
)

# LoRA Configuration
lora_config = LoraConfig(
    r=16,  # Rank of the low-rank matrices
    lora_alpha=32,  # Scaling factor (typically 2*r)
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # Target the attention and MLP projection layers in Llama architecture
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

print("QLoRA Configuration:")
print(f"  - 4-bit quantization: NF4")
print(f"  - LoRA rank: {lora_config.r}")
print(f"  - Trainable parameters: ~{lora_config.r * 2 * 7 * 2048 / 1e6:.1f}M (vs ~1100M full model)")

# -----------------------------------------------------------------------------
# CELL 6: Load Model with Quantization
# -----------------------------------------------------------------------------
"""
Load the base model (Pi_SFT) with 4-bit quantization.
This is our starting policy - already instruction-tuned but not aligned.
"""

print("Loading model with 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically distribute across available GPUs
    trust_remote_code=True,
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Model dtype: {model.dtype}")
print(f"Memory footprint: ~{model.get_memory_footprint() / 1e9:.2f} GB")

# -----------------------------------------------------------------------------
# CELL 7: Configure DPO Training
# -----------------------------------------------------------------------------
"""
LECTURE CONNECTION: DPO as a Simpler Alternative to RLHF

Key differences from RLHF:
- No explicit reward model training
- No complex PPO optimization
- Direct optimization using preference data
- Treats alignment as a classification problem

Key hyperparameters:
- beta: Controls KL divergence penalty (typically 0.1-0.5)
- learning_rate: Start small for alignment (5e-5 is typical)
"""

# DPO-specific configuration
dpo_config = DPOConfig(
    output_dir="./dpo_training_output",
    num_train_epochs=10,  # More epochs for visible change with small dataset
    per_device_train_batch_size=2,  # Reduce if you hit OOM
    gradient_accumulation_steps=4,  # Effective batch size = 2*4 = 8
    learning_rate=5e-5,

    # DPO-specific parameters
    beta=0.01,  # Lower beta = allow more deviation from reference (more visible change!)

    # Performance settings
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported() and torch.cuda.is_available(),
    gradient_checkpointing=True,  # Saves memory at cost of speed

    # Logging
    logging_steps=1,
    logging_first_step=True,
    report_to="none",  # Disable wandb for simplicity

    # Required for TRL
    remove_unused_columns=False,
)

# Set tokenizer properties needed for training
tokenizer.padding_side = "left"  # Important for generation tasks

print("DPO Configuration:")
print(f"  - Beta (KL penalty): {dpo_config.beta} (lower = more change allowed)")
print(f"  - Learning rate: {dpo_config.learning_rate}")
print(f"  - Training epochs: {dpo_config.num_train_epochs}")
print(f"  - Effective batch size: {dpo_config.per_device_train_batch_size * dpo_config.gradient_accumulation_steps}")

# -----------------------------------------------------------------------------
# CELL 8: Initialize DPO Trainer
# -----------------------------------------------------------------------------
"""
The DPOTrainer handles:
1. Loading the reference model (Pi_ref) - handled automatically with PEFT
2. Computing the DPO loss (maximizing chosen, minimizing rejected)
3. Maintaining the KL constraint to prevent reward hacking
"""

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # When using PEFT, TRL handles reference model internally
    args=dpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,  # Updated API: use processing_class instead of tokenizer
    peft_config=lora_config,
)

print("DPO Trainer initialized successfully!")
print(f"Training on {len(dataset)} preference pairs")

# -----------------------------------------------------------------------------
# CELL 9: Train the Model (Alignment Phase!)
# -----------------------------------------------------------------------------
"""
LECTURE CONNECTION: This is where the magic happens!

The model learns to:
- Increase probability of 'chosen' responses
- Decrease probability of 'rejected' responses
- Stay close to the reference model (via KL penalty)

Watch the loss decrease - this means the model is learning preferences!

WHAT TO EXPECT:
- Initial loss: ~0.69 (random classifier baseline)
- Good final loss: 0.3-0.5 (successful learning)
- Training time: 10-15 minutes on T4 GPU (Colab free tier)

The loss is negative log-likelihood of correctly classifying preferences.
Lower loss = better at predicting which response humans prefer.
"""

print("Starting DPO training...")
print("This will take 10-15 minutes on Colab T4 GPU.")
print("Watch the loss - it should decrease steadily!\n")

dpo_trainer.train()

print("\n‚úì Training complete!")
print("\nüìä Check your final loss:")
print("  - Loss < 0.4: Excellent alignment")
print("  - Loss 0.4-0.5: Good alignment")
print("  - Loss > 0.6: Might need more epochs or lower beta")

# Save the trained LoRA adapters
ADAPTER_PATH = "./dpo_aligned_adapters"
dpo_trainer.save_model(ADAPTER_PATH)
print(f"Adapters saved to: {ADAPTER_PATH}")


PyTorch version: 2.8.0+cu126
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
Created preference dataset with 20 examples

Example:
Prompt: What is the weather forecast for tomorrow? Respond in a formal manner.
Chosen: The forecast indicates clear skies with temperatures reaching 75¬∞F tomorrow. No ...
Rejected: It looks like it will be sunny and warm tomorrow! ‚òÄÔ∏è Perfect day for a walk! üö∂...


Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Formatted prompt example (first 200 chars):
<|user|>
What is the weather forecast for tomorrow? Respond in a formal manner.</s>
<|assistant|>

QLoRA Configuration:
  - 4-bit quantization: NF4
  - LoRA rank: 16
  - Trainable parameters: ~0.5M (vs ~1100M full model)
Loading model with 4-bit quantization...
Model loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Model dtype: torch.float16
Memory footprint: ~0.75 GB
DPO Configuration:
  - Beta (KL penalty): 0.01 (lower = more change allowed)
  - Learning rate: 5e-05
  - Training epochs: 10
  - Effective batch size: 8


Extracting prompt in train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


DPO Trainer initialized successfully!
Training on 20 preference pairs
Starting DPO training...
This will take 10-15 minutes on Colab T4 GPU.
Watch the loss - it should decrease steadily!



Step,Training Loss
1,0.6931
2,0.677
3,0.6575
4,0.6191
5,0.5779
6,0.5351
7,0.5048
8,0.4677
9,0.4478
10,0.3843



‚úì Training complete!

üìä Check your final loss:
  - Loss < 0.4: Excellent alignment
  - Loss 0.4-0.5: Good alignment
  - Loss > 0.6: Might need more epochs or lower beta
Adapters saved to: ./dpo_aligned_adapters


In [12]:

# -----------------------------------------------------------------------------
# CELL 10: Clean Up Memory
# -----------------------------------------------------------------------------
"""
Free up memory before loading models for inference.
"""

del model, dpo_trainer
gc.collect()
torch.cuda.empty_cache()
print("Memory cleaned up for inference")

# -----------------------------------------------------------------------------
# CELL 11: Define Generation Helper
# -----------------------------------------------------------------------------
"""
Helper function to generate responses from a model.
We use greedy decoding (do_sample=False) for deterministic comparison.
"""

def generate_response(model, tokenizer, user_prompt, max_new_tokens=80):
    """
    Generate a response from the model given a user prompt.

    Args:
        model: The language model
        tokenizer: The tokenizer
        user_prompt: Raw user input (string)
        max_new_tokens: Maximum length of generated response

    Returns:
        Generated response (string)
    """
    model.eval()

    # Format as chat message
    messages = [{"role": "user", "content": user_prompt}]

    # Apply chat template and tokenize
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True
    ).to(model.device)

    # Generate
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=max_new_tokens,
            #do_sample=False,  # Greedy decoding for reproducibility
            do_sample=True, #affects randomness of decoding
            pad_token_id=tokenizer.eos_token_id,
        )

    # Decode only the generated part (skip the input prompt)
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response.strip()

# -----------------------------------------------------------------------------
# CELL 12: Load BEFORE Model (Original SFT)
# -----------------------------------------------------------------------------
"""
Load the original SFT model (before alignment) to compare.
This is Pi_ref from the lecture.
"""

print("Loading original SFT model (BEFORE alignment)...")

compute_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

model_before = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=compute_dtype,
    device_map="auto",
    trust_remote_code=True,
)

print("‚úì BEFORE model loaded")

# -----------------------------------------------------------------------------
# CELL 13: Load AFTER Model (DPO-Aligned)
# -----------------------------------------------------------------------------
"""
Load a SEPARATE instance of the base model, then apply DPO adapters.
This is Pi_theta after DPO optimization.

CRITICAL: We must load a fresh model instance! Otherwise, applying adapters
modifies the first model in-place, making both identical.
"""

print("Loading DPO-aligned model (AFTER alignment)...")

# Load a SECOND, separate instance of the base model
model_after_base = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=compute_dtype,
    device_map="auto",
    trust_remote_code=True,
)

# Now apply the trained adapters to this separate instance
model_after = PeftModel.from_pretrained(model_after_base, ADAPTER_PATH)

print("‚úì AFTER model loaded with DPO adapters")
print(f"‚úì Verified: Two separate model instances exist")

# -----------------------------------------------------------------------------
# CELL 14: Compare Before vs After
# -----------------------------------------------------------------------------
"""
DEMONSTRATION: The moment of truth!

We test on a NEW prompt (unseen during training) to see if the model
generalized the preference pattern.

Expected: The aligned model should be more enthusiastic/friendly.
"""

# Test prompt - similar to training but NOT in the dataset
TEST_PROMPTS = [
    "A new coffee shop opened downtown!",
    "What do you think about learning a new language?",
    "The park has beautiful flowers this season.",
]

print("\n" + "="*70)
print("BEFORE vs AFTER COMPARISON")
print("="*70 + "\n")

for prompt in TEST_PROMPTS:
    print(f"üìù Prompt: {prompt}\n")

    # Generate from BEFORE model (original SFT)
    response_before = generate_response(model_before, tokenizer, prompt, max_new_tokens=100)
    print(f"BEFORE (SFT Model):")
    print(f"  ‚Üí {response_before}\n")

    # Generate from AFTER model (with DPO adapters)
    response_after = generate_response(model_after, tokenizer, prompt, max_new_tokens=100)
    print(f"AFTER (DPO-Aligned Model):")
    print(f"  ‚Üí {response_after}\n")

    print("-"*70 + "\n")

print("‚úì Inference complete!")


Memory cleaned up for inference
Loading original SFT model (BEFORE alignment)...
‚úì BEFORE model loaded
Loading DPO-aligned model (AFTER alignment)...
‚úì AFTER model loaded with DPO adapters
‚úì Verified: Two separate model instances exist

BEFORE vs AFTER COMPARISON

üìù Prompt: A new coffee shop opened downtown!

BEFORE (SFT Model):
  ‚Üí Title: A New Coffee Shop: Discovering the Magic of the Everyday

Introduction:
Welcome to A New Coffee Shop - a thrilling adventure where you'll discover the magic of the everyday through the captivating stories of its cozy, inviting coffee shop. Located in the heart of downtown, this new spot is known for its delicious blends and creative coffee brews, making it the perfect spot

AFTER (DPO-Aligned Model):
  ‚Üí The following are some elements that could be included in a press release announcing the opening of a new coffee shop in downtown:

1. Title: "Downtown Coffee Shop Opens with Bold Statement on New Location"
2. Summary: The press release 

In [13]:

# -----------------------------------------------------------------------------
# CELL 15: Analyzing the Changes - What Makes DPO Work?
# -----------------------------------------------------------------------------
"""
Let's explicitly analyze what changed to understand DPO's effect.
This helps you verify the alignment worked and understand the patterns.
"""

print("\n" + "="*70)
print("ANALYZING ALIGNMENT SUCCESS")
print("="*70 + "\n")

print("üîç KEY CHANGES TO OBSERVE:\n")

print("1Ô∏è‚É£ TONE SHIFT:")
print("   BEFORE: Casual, conversational, warm")
print("   AFTER:  Formal, academic, professional")
print("   Example BEFORE: 'cozy and inviting space'")
print("   Example AFTER:  'measures that can be taken to promote'\n")

print("2Ô∏è‚É£ STRUCTURE:")
print("   BEFORE: Narrative storytelling, personal address")
print("   AFTER:  Structured analysis, numbered lists, systematic")
print("   Example BEFORE: 'Here are some tips to help you get started'")
print("   Example AFTER:  'According to research...cognitive development'\n")

print("3Ô∏è‚É£ VOCABULARY:")
print("   BEFORE: Simple, accessible words")
print("   AFTER:  Technical, academic terminology")
print("   Example BEFORE: 'beautiful sight to behold'")
print("   Example AFTER:  'cultural and historical significance'\n")

print("4Ô∏è‚É£ ENGAGEMENT STYLE:")
print("   BEFORE: Agreeable, enthusiastic ('Yes!', 'rewarding experience')")
print("   AFTER:  Detached, analytical ('The text material does not...')\n")

print("5Ô∏è‚É£ EXTREME BEHAVIORS (Sign of Strong Alignment):")
print("   - Model becoming overly pedantic (analyzing 'text material' that doesn't exist)")
print("   - This shows DPO learned the formal pattern STRONGLY")
print("   - Real-world: would need human feedback to balance this\n")

print("-"*70 + "\n")

print("‚úÖ WHY THIS DEMONSTRATES SUCCESSFUL DPO:\n")

print("‚Ä¢ The model's NATURAL style (from SFT) was casual and friendly")
print("‚Ä¢ We trained it with preferences for FORMAL, professional responses")
print("‚Ä¢ The aligned model adopted the new style across UNSEEN prompts")
print("‚Ä¢ Key insight: DPO shifted behavior AWAY from strong existing priors\n")

print("üìä QUANTITATIVE INDICATORS (if you want to measure):\n")
print("‚Ä¢ Count exclamation marks (should decrease)")
print("‚Ä¢ Check for emojis (should disappear)")
print("‚Ä¢ Measure sentence complexity (should increase)")
print("‚Ä¢ Look for technical terms vs simple words")
print("‚Ä¢ Analyze sentence length (formal = longer, more complex)\n")

print("-"*70 + "\n")

print("üéØ CONNECTING TO LECTURE CONCEPTS:\n")

print("1. KL DIVERGENCE CONSTRAINT:")
print("   - Beta=0.01 allowed significant deviation from reference")
print("   - Higher beta (0.1) would keep responses closer to original style")
print("   - You can see the effect: substantial but not incoherent changes\n")

print("2. PREFERENCE CLASSIFICATION:")
print("   - DPO learned: 'formal language' > 'casual language'")
print("   - Applied this preference to new, unseen prompts")
print("   - Shows generalization of learned preference pattern\n")

print("3. OFF-POLICY LEARNING:")
print("   - Reference model (Pi_ref) = original TinyLlama-Chat")
print("   - We optimized away from its natural distribution")
print("   - The log probability ratio captured how much we deviated\n")

print("4. BRADLEY-TERRY MODEL:")
print("   - Training implicitly modeled: P(formal > casual | prompt)")
print("   - Loss decreased = model better at predicting our preferences")
print("   - Now generates text matching those preferences\n")

print("="*70)
print("üéì SUCCESS: You've aligned an LLM using DPO!")
print("="*70)

# -----------------------------------------------------------------------------
# CELL 15: Key Takeaways (Run this to see summary)
# -----------------------------------------------------------------------------
"""
üéì WHAT WE LEARNED:

1. PREFERENCE DATA FORMAT
   - prompt, chosen, rejected triplets
   - Easier than writing perfect demonstrations
   - Need sufficient data (20+ examples minimum for visible effects)

2. DPO SIMPLICITY
   - No explicit reward model
   - No complex PPO loop
   - Direct optimization on preferences
   - Treats alignment as binary classification

3. QLORA EFFICIENCY
   - 4-bit quantization reduces memory
   - LoRA adapters are small and mergeable
   - Makes alignment accessible on consumer hardware

4. HUGGING FACE ECOSYSTEM
   - transformers: Model loading
   - datasets: Data handling
   - peft: LoRA adapters
   - trl: DPO training (connects everything!)

5. BEHAVIOR SHIFT INSIGHTS
   - We trained AGAINST the model's natural style (formality vs enthusiasm)
   - This is harder but more educational - shows DPO can override strong priors
   - Chat models have strong alignment - we're "un-aligning" them
   - Real-world: usually align in same direction as model's tendencies

6. LEARNING LESSON
   - Original attempt: tried to make model MORE enthusiastic (it already was!)
   - Revised approach: make model MORE formal (fights natural style)
   - This demonstrates DPO can shift behavior in ANY direction
   - The key: clear contrast between chosen/rejected examples

‚ö†Ô∏è CRITICAL BUG WE FIXED:
   - PeftModel.from_pretrained(model, adapters) modifies `model` IN-PLACE!
   - Must load TWO separate model instances for proper comparison
   - Otherwise both "before" and "after" point to the same aligned model
   - This is a common gotcha when comparing base vs adapted models!

üîó CONNECTIONS TO LECTURE:
- This implements the DPO loss function we derived
- Beta parameter controls the KL constraint (balance reward vs coherence)
- Reference model (Pi_ref) vs active policy (Pi_theta)
- Off-policy learning: reference model defines the "behavior policy"
- The log probability ratio is the implicit reward function
- Lower beta allows larger deviation = stronger style shift

‚öôÔ∏è HYPERPARAMETER GUIDANCE:
- Beta = 0.1: Conservative (default), stays close to reference
- Beta = 0.01: Aggressive, allows significant style change (we used this)
- Beta = 0.001: Very aggressive, risk of reward hacking/incoherence
- Rule of thumb: Start with 0.1, decrease if you need more visible changes
- When fighting model's priors: need lower beta + more epochs

üìà MONITORING TRAINING:
- Watch the loss curve - should decrease steadily
- If loss plateaus immediately: learning rate too low or beta too high
- If loss explodes: learning rate too high or beta too low
- Typical final loss: 0.3-0.6 for good alignment
- When "un-aligning": may need more epochs to see visible changes

üöÄ NEXT STEPS:
- Try different tasks: safety (reduce harmful outputs), code style, technical depth
- Experiment with beta values to see KL constraint's effect
- Compare to other methods (KTO with unary feedback, RLHF with reward model)
- Apply to larger models (7B, 13B) - same code works!
- Use base models (not chat) for alignment in natural direction
- Collect real human preferences (check out Argilla for data labeling)

üéØ REAL-WORLD APPLICATIONS:
- Instruction following style (formal vs casual) ‚Üê we did this!
- Safety alignment (reduce harmful outputs)
- Domain adaptation (medical, legal, technical tone)
- Multi-lingual preference alignment
- Code style preferences (verbose vs concise, comments vs no comments)
- Reducing verbosity or hallucination tendencies
"""

print(__doc__)


ANALYZING ALIGNMENT SUCCESS

üîç KEY CHANGES TO OBSERVE:

1Ô∏è‚É£ TONE SHIFT:
   BEFORE: Casual, conversational, warm
   AFTER:  Formal, academic, professional
   Example BEFORE: 'cozy and inviting space'
   Example AFTER:  'measures that can be taken to promote'

2Ô∏è‚É£ STRUCTURE:
   BEFORE: Narrative storytelling, personal address
   AFTER:  Structured analysis, numbered lists, systematic
   Example BEFORE: 'Here are some tips to help you get started'
   Example AFTER:  'According to research...cognitive development'

3Ô∏è‚É£ VOCABULARY:
   BEFORE: Simple, accessible words
   AFTER:  Technical, academic terminology
   Example BEFORE: 'beautiful sight to behold'
   Example AFTER:  'cultural and historical significance'

4Ô∏è‚É£ ENGAGEMENT STYLE:
   BEFORE: Agreeable, enthusiastic ('Yes!', 'rewarding experience')
   AFTER:  Detached, analytical ('The text material does not...')

5Ô∏è‚É£ EXTREME BEHAVIORS (Sign of Strong Alignment):
   - Model becoming overly pedantic (analyzing 'tex