# Lab 3.1.2 Solutions: 8B Model LoRA Fine-Tuning

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Difficulty:** ⭐⭐⭐☆☆ (Intermediate)  
**Exercises:** 4 (Custom LoRA Config, Hyperparameter Tuning, Custom Training Loop, Evaluation Metrics)

This notebook contains solutions for all exercises in the 8B Model LoRA Fine-Tuning notebook.

---

---

## Exercise 1 Solution: Custom LoRA Configuration

**Task:** Create a LoRA configuration optimized for code generation tasks.

In [None]:
from peft import LoraConfig, TaskType

# Code generation benefits from higher rank (more capacity for patterns)
# and targeting more modules for comprehensive adaptation

code_lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=32,  # Higher rank for complex code patterns
    lora_alpha=64,  # alpha = 2 * r is common
    lora_dropout=0.1,  # Slight dropout for regularization
    target_modules=[
        "q_proj",
        "k_proj",  # Include K for better attention
        "v_proj",
        "o_proj",
        "gate_proj",  # MLP modules for code logic
        "up_proj",
        "down_proj",
    ],
    bias="none",
    modules_to_save=["lm_head"],  # Important for specialized vocabulary
)

print("Code Generation LoRA Configuration:")
print(f"  Rank: {code_lora_config.r}")
print(f"  Alpha: {code_lora_config.lora_alpha}")
print(f"  Scaling: {code_lora_config.lora_alpha / code_lora_config.r}")
print(f"  Target modules: {code_lora_config.target_modules}")
print(f"  Dropout: {code_lora_config.lora_dropout}")

In [None]:
# Explanation of choices:

explanation = """
Why these settings for code generation?

1. HIGHER RANK (r=32):
   - Code has strict syntax rules and patterns
   - Higher rank captures more complex relationships
   - Trade-off: More parameters, but code tasks benefit

2. MORE TARGET MODULES:
   - gate_proj, up_proj, down_proj: MLP layers handle logic/reasoning
   - k_proj: Better key matching for code patterns
   - Code generation needs both attention AND reasoning adaptation

3. MODULES_TO_SAVE=["lm_head"]:
   - Fully fine-tunes the output layer
   - Important when dealing with specialized tokens (code syntax)
   - Helps model learn language-specific outputs better

4. DROPOUT=0.1:
   - Code datasets often have patterns that can be memorized
   - Slight dropout prevents overfitting to specific code snippets

Parameter count estimate for 8B model:
- 7 target modules per layer × 32 layers = 224 adapted modules
- Each module: 2 × rank × dimension ≈ 2 × 32 × 4096 = 262K params
- Total LoRA params: ~60M (still <1% of 8B base!)
"""
print(explanation)

---

## Exercise 2 Solution: Training Hyperparameter Tuning

**Task:** Implement a hyperparameter search comparing different learning rates and batch sizes.

In [None]:
import torch
from dataclasses import dataclass
from typing import List, Dict, Any
import json

@dataclass
class HyperparameterConfig:
    """Configuration for a single hyperparameter experiment."""
    learning_rate: float
    batch_size: int
    gradient_accumulation_steps: int
    warmup_ratio: float
    weight_decay: float
    
    @property
    def effective_batch_size(self) -> int:
        return self.batch_size * self.gradient_accumulation_steps
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "learning_rate": self.learning_rate,
            "batch_size": self.batch_size,
            "gradient_accumulation_steps": self.gradient_accumulation_steps,
            "effective_batch_size": self.effective_batch_size,
            "warmup_ratio": self.warmup_ratio,
            "weight_decay": self.weight_decay,
        }


def generate_search_space() -> List[HyperparameterConfig]:
    """
    Generate hyperparameter configurations for grid search.
    
    Returns:
        List of configurations to try
    """
    configs = []
    
    # Learning rates to try (log scale)
    learning_rates = [1e-5, 2e-5, 5e-5, 1e-4, 2e-4]
    
    # Batch size configurations (batch_size, grad_accum)
    # All result in different effective batch sizes
    batch_configs = [
        (1, 8),   # Effective: 8
        (2, 8),   # Effective: 16
        (4, 8),   # Effective: 32
        (4, 16),  # Effective: 64
    ]
    
    for lr in learning_rates:
        for batch_size, grad_accum in batch_configs:
            configs.append(HyperparameterConfig(
                learning_rate=lr,
                batch_size=batch_size,
                gradient_accumulation_steps=grad_accum,
                warmup_ratio=0.1,
                weight_decay=0.01,
            ))
    
    return configs


# Generate and display search space
search_space = generate_search_space()
print(f"Total configurations: {len(search_space)}")
print("\nSample configurations:")
for i, config in enumerate(search_space[:5]):
    print(f"  {i+1}. LR={config.learning_rate:.0e}, "
          f"Batch={config.batch_size}×{config.gradient_accumulation_steps}="
          f"{config.effective_batch_size}")

In [None]:
from transformers import TrainingArguments
import os

def create_training_args(
    config: HyperparameterConfig,
    output_dir: str,
    num_train_epochs: int = 3,
    max_steps: int = -1,
) -> TrainingArguments:
    """
    Create TrainingArguments from a hyperparameter config.
    
    Args:
        config: Hyperparameter configuration
        output_dir: Directory for outputs
        num_train_epochs: Number of training epochs
        max_steps: Max steps (-1 for epoch-based)
    
    Returns:
        TrainingArguments object
    """
    run_name = (f"lr{config.learning_rate:.0e}_"
                f"bs{config.effective_batch_size}")
    
    return TrainingArguments(
        output_dir=os.path.join(output_dir, run_name),
        
        # From config
        learning_rate=config.learning_rate,
        per_device_train_batch_size=config.batch_size,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        warmup_ratio=config.warmup_ratio,
        weight_decay=config.weight_decay,
        
        # Training duration
        num_train_epochs=num_train_epochs,
        max_steps=max_steps,
        
        # DGX Spark optimizations
        bf16=True,
        optim="adamw_8bit",
        gradient_checkpointing=True,
        
        # Logging
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=50,
        save_strategy="steps",
        save_steps=100,
        
        # Misc
        run_name=run_name,
        report_to="none",  # Or "wandb" for tracking
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
    )


# Example usage
example_config = search_space[0]
example_args = create_training_args(example_config, "./hp_search")
print("Example TrainingArguments created:")
print(f"  Output dir: {example_args.output_dir}")
print(f"  Learning rate: {example_args.learning_rate}")
print(f"  Effective batch size: {example_args.per_device_train_batch_size * example_args.gradient_accumulation_steps}")

In [None]:
# Simulated hyperparameter search results (for demonstration)
# In practice, you would run actual training

import matplotlib.pyplot as plt
import numpy as np

# Simulated results (realistic patterns)
np.random.seed(42)

results = []
for config in search_space:
    # Simulate loss based on hyperparameters
    # Lower is better
    base_loss = 2.0
    
    # Optimal LR around 2e-5
    lr_factor = abs(np.log10(config.learning_rate) - np.log10(2e-5)) * 0.3
    
    # Larger batch sizes generally help (up to a point)
    batch_factor = 0.1 * (1 - min(config.effective_batch_size, 32) / 32)
    
    # Too high LR causes instability
    instability = 0.5 if config.learning_rate > 1e-4 else 0
    
    final_loss = base_loss - 0.5 + lr_factor + batch_factor + instability
    final_loss += np.random.normal(0, 0.05)  # Add noise
    
    results.append({
        "config": config.to_dict(),
        "final_loss": max(0.5, final_loss),
        "best_step": np.random.randint(100, 500),
    })

# Sort by loss
results.sort(key=lambda x: x["final_loss"])

print("Top 5 Configurations:")
print("=" * 60)
for i, r in enumerate(results[:5]):
    c = r["config"]
    print(f"{i+1}. Loss: {r['final_loss']:.4f}")
    print(f"   LR: {c['learning_rate']:.0e}, "
          f"Batch: {c['effective_batch_size']}, "
          f"Best step: {r['best_step']}")
    print()

In [None]:
# Visualize hyperparameter search results

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Extract data for plotting
lrs = [r["config"]["learning_rate"] for r in results]
batch_sizes = [r["config"]["effective_batch_size"] for r in results]
losses = [r["final_loss"] for r in results]

# Plot 1: Learning rate vs Loss
ax1 = axes[0]
scatter1 = ax1.scatter(lrs, losses, c=batch_sizes, cmap='viridis', 
                       s=100, alpha=0.7)
ax1.set_xscale('log')
ax1.set_xlabel('Learning Rate')
ax1.set_ylabel('Final Loss')
ax1.set_title('Learning Rate vs Loss\n(color = batch size)')
plt.colorbar(scatter1, ax=ax1, label='Effective Batch Size')
ax1.axvline(x=2e-5, color='red', linestyle='--', alpha=0.5, label='Optimal LR region')
ax1.legend()

# Plot 2: Batch size vs Loss
ax2 = axes[1]
scatter2 = ax2.scatter(batch_sizes, losses, c=np.log10(lrs), cmap='plasma',
                       s=100, alpha=0.7)
ax2.set_xlabel('Effective Batch Size')
ax2.set_ylabel('Final Loss')
ax2.set_title('Batch Size vs Loss\n(color = log10(LR))')
plt.colorbar(scatter2, ax=ax2, label='log10(Learning Rate)')

plt.tight_layout()
plt.savefig('hp_search_results.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Insights:")
print("- Optimal learning rate: ~2e-5 (sweet spot for LoRA)")
print("- Larger batch sizes help stability but diminishing returns past 32")
print("- High LR (>1e-4) causes training instability")

---

## Exercise 3 Solution: Custom Training Loop with Memory Monitoring

**Task:** Implement a training loop that tracks memory usage and supports gradient accumulation.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from typing import Optional, Callable, Dict, Any
from dataclasses import dataclass, field
import time
import gc

@dataclass
class TrainingMetrics:
    """Container for training metrics."""
    step: int = 0
    epoch: int = 0
    loss: float = 0.0
    learning_rate: float = 0.0
    gpu_memory_gb: float = 0.0
    tokens_per_second: float = 0.0
    gradient_norm: float = 0.0
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "step": self.step,
            "epoch": self.epoch,
            "loss": self.loss,
            "learning_rate": self.learning_rate,
            "gpu_memory_gb": self.gpu_memory_gb,
            "tokens_per_second": self.tokens_per_second,
            "gradient_norm": self.gradient_norm,
        }


class MemoryEfficientTrainer:
    """
    Custom trainer with memory monitoring and gradient accumulation.
    
    Optimized for DGX Spark's unified memory architecture.
    """
    
    def __init__(
        self,
        model: nn.Module,
        optimizer: torch.optim.Optimizer,
        scheduler: Optional[torch.optim.lr_scheduler._LRScheduler] = None,
        gradient_accumulation_steps: int = 1,
        max_grad_norm: float = 1.0,
        use_amp: bool = True,
        log_interval: int = 10,
    ):
        self.model = model
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.gradient_accumulation_steps = gradient_accumulation_steps
        self.max_grad_norm = max_grad_norm
        self.use_amp = use_amp
        self.log_interval = log_interval
        
        # Mixed precision scaler (for non-BF16 systems)
        self.scaler = torch.cuda.amp.GradScaler() if use_amp else None
        
        # Metrics history
        self.history: list = []
        self.global_step = 0
        
    def get_memory_stats(self) -> Dict[str, float]:
        """Get current GPU memory statistics."""
        if not torch.cuda.is_available():
            return {"allocated": 0, "reserved": 0, "max_allocated": 0}
        
        return {
            "allocated": torch.cuda.memory_allocated() / 1e9,
            "reserved": torch.cuda.memory_reserved() / 1e9,
            "max_allocated": torch.cuda.max_memory_allocated() / 1e9,
        }
    
    def compute_gradient_norm(self) -> float:
        """Compute total gradient norm across all parameters."""
        total_norm = 0.0
        for p in self.model.parameters():
            if p.grad is not None:
                param_norm = p.grad.data.norm(2)
                total_norm += param_norm.item() ** 2
        return total_norm ** 0.5
    
    def train_epoch(
        self,
        train_loader: DataLoader,
        epoch: int,
        callback: Optional[Callable[[TrainingMetrics], None]] = None,
    ) -> float:
        """
        Train for one epoch with gradient accumulation and memory monitoring.
        
        Args:
            train_loader: DataLoader for training data
            epoch: Current epoch number
            callback: Optional callback for each logging step
        
        Returns:
            Average loss for the epoch
        """
        self.model.train()
        total_loss = 0.0
        num_batches = 0
        accumulated_loss = 0.0
        
        # Reset memory stats
        if torch.cuda.is_available():
            torch.cuda.reset_peak_memory_stats()
        
        start_time = time.time()
        tokens_processed = 0
        
        for batch_idx, batch in enumerate(train_loader):
            # Move batch to device
            if isinstance(batch, dict):
                batch = {k: v.cuda() if torch.is_tensor(v) else v 
                        for k, v in batch.items()}
                input_ids = batch.get("input_ids")
            else:
                batch = batch.cuda()
                input_ids = batch
            
            # Track tokens
            if input_ids is not None:
                tokens_processed += input_ids.numel()
            
            # Forward pass with mixed precision
            with torch.cuda.amp.autocast(dtype=torch.bfloat16, enabled=self.use_amp):
                if isinstance(batch, dict):
                    outputs = self.model(**batch)
                    loss = outputs.loss if hasattr(outputs, 'loss') else outputs[0]
                else:
                    outputs = self.model(batch)
                    loss = outputs
                
                # Scale loss for gradient accumulation
                loss = loss / self.gradient_accumulation_steps
            
            # Backward pass
            if self.scaler:
                self.scaler.scale(loss).backward()
            else:
                loss.backward()
            
            accumulated_loss += loss.item()
            
            # Optimizer step (after accumulation)
            if (batch_idx + 1) % self.gradient_accumulation_steps == 0:
                # Gradient clipping
                if self.scaler:
                    self.scaler.unscale_(self.optimizer)
                
                grad_norm = self.compute_gradient_norm()
                torch.nn.utils.clip_grad_norm_(
                    self.model.parameters(), 
                    self.max_grad_norm
                )
                
                # Optimizer step
                if self.scaler:
                    self.scaler.step(self.optimizer)
                    self.scaler.update()
                else:
                    self.optimizer.step()
                
                if self.scheduler:
                    self.scheduler.step()
                
                self.optimizer.zero_grad()
                self.global_step += 1
                
                # Record metrics
                total_loss += accumulated_loss
                num_batches += 1
                
                # Log at intervals
                if self.global_step % self.log_interval == 0:
                    elapsed = time.time() - start_time
                    memory_stats = self.get_memory_stats()
                    
                    metrics = TrainingMetrics(
                        step=self.global_step,
                        epoch=epoch,
                        loss=accumulated_loss,
                        learning_rate=self.optimizer.param_groups[0]["lr"],
                        gpu_memory_gb=memory_stats["allocated"],
                        tokens_per_second=tokens_processed / elapsed,
                        gradient_norm=grad_norm,
                    )
                    
                    self.history.append(metrics.to_dict())
                    
                    if callback:
                        callback(metrics)
                    else:
                        print(f"Step {self.global_step}: "
                              f"loss={accumulated_loss:.4f}, "
                              f"lr={metrics.learning_rate:.2e}, "
                              f"mem={metrics.gpu_memory_gb:.1f}GB, "
                              f"tok/s={metrics.tokens_per_second:.0f}")
                
                accumulated_loss = 0.0
        
        avg_loss = total_loss / max(num_batches, 1)
        return avg_loss
    
    def get_memory_summary(self) -> str:
        """Get a summary of memory usage during training."""
        if not self.history:
            return "No training history available."
        
        memory_values = [h["gpu_memory_gb"] for h in self.history]
        return (
            f"Memory Usage Summary:\n"
            f"  Min: {min(memory_values):.2f} GB\n"
            f"  Max: {max(memory_values):.2f} GB\n"
            f"  Avg: {sum(memory_values)/len(memory_values):.2f} GB"
        )


print("MemoryEfficientTrainer class defined!")
print("\nFeatures:")
print("  - Gradient accumulation")
print("  - Mixed precision training (BF16)")
print("  - Memory monitoring")
print("  - Gradient norm tracking")
print("  - Tokens/second throughput")

In [None]:
# Example usage with a simple model

# Create a dummy model for demonstration
class DummyLM(nn.Module):
    def __init__(self, vocab_size=1000, hidden_size=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(hidden_size, 4, batch_first=True),
            num_layers=2
        )
        self.lm_head = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, input_ids, labels=None, **kwargs):
        x = self.embedding(input_ids)
        x = self.transformer(x)
        logits = self.lm_head(x)
        
        loss = None
        if labels is not None:
            loss = nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)),
                labels.view(-1)
            )
        
        return type('Output', (), {'loss': loss, 'logits': logits})()

# Setup
model = DummyLM().cuda() if torch.cuda.is_available() else DummyLM()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

trainer = MemoryEfficientTrainer(
    model=model,
    optimizer=optimizer,
    gradient_accumulation_steps=4,
    max_grad_norm=1.0,
    use_amp=torch.cuda.is_available(),
    log_interval=2,
)

# Create dummy data
from torch.utils.data import TensorDataset

dummy_input_ids = torch.randint(0, 1000, (32, 64))
dummy_labels = torch.randint(0, 1000, (32, 64))
dummy_dataset = TensorDataset(dummy_input_ids, dummy_labels)

class DictDataLoader:
    def __init__(self, dataset, batch_size):
        self.loader = DataLoader(dataset, batch_size=batch_size)
    
    def __iter__(self):
        for input_ids, labels in self.loader:
            yield {"input_ids": input_ids, "labels": labels}
    
    def __len__(self):
        return len(self.loader)

train_loader = DictDataLoader(dummy_dataset, batch_size=4)

# Train for one epoch
print("\nStarting training...\n")
avg_loss = trainer.train_epoch(train_loader, epoch=1)
print(f"\nEpoch complete! Average loss: {avg_loss:.4f}")
print(f"\n{trainer.get_memory_summary()}")

---

## Exercise 4 Solution: Evaluation Metrics Implementation

**Task:** Implement perplexity calculation and generation quality metrics.

In [None]:
import torch
import torch.nn.functional as F
from typing import List, Dict, Optional, Tuple
import numpy as np
from collections import Counter
import math

class LLMEvaluator:
    """
    Comprehensive evaluator for language model fine-tuning.
    
    Includes:
    - Perplexity calculation
    - BLEU score (for translation/paraphrase)
    - ROUGE scores (for summarization)
    - Distinct-N (for diversity)
    """
    
    @staticmethod
    def calculate_perplexity(
        model: torch.nn.Module,
        dataloader: torch.utils.data.DataLoader,
        device: str = "cuda",
    ) -> float:
        """
        Calculate perplexity on a dataset.
        
        Perplexity = exp(average cross-entropy loss)
        Lower is better.
        
        Args:
            model: The language model
            dataloader: DataLoader with input_ids and labels
            device: Device to use
        
        Returns:
            Perplexity score
        """
        model.eval()
        total_loss = 0.0
        total_tokens = 0
        
        with torch.no_grad():
            for batch in dataloader:
                if isinstance(batch, dict):
                    input_ids = batch["input_ids"].to(device)
                    labels = batch.get("labels", input_ids).to(device)
                else:
                    input_ids = batch[0].to(device)
                    labels = batch[1].to(device) if len(batch) > 1 else input_ids
                
                outputs = model(input_ids=input_ids, labels=labels)
                loss = outputs.loss
                
                # Count non-padding tokens
                num_tokens = (labels != -100).sum().item()
                total_loss += loss.item() * num_tokens
                total_tokens += num_tokens
        
        avg_loss = total_loss / total_tokens
        perplexity = math.exp(avg_loss)
        
        return perplexity
    
    @staticmethod
    def get_ngrams(tokens: List[str], n: int) -> List[Tuple[str, ...]]:
        """Extract n-grams from a token list."""
        return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
    
    @staticmethod
    def calculate_bleu(
        references: List[str],
        hypotheses: List[str],
        max_n: int = 4,
    ) -> Dict[str, float]:
        """
        Calculate BLEU scores.
        
        Args:
            references: List of reference texts
            hypotheses: List of generated texts
            max_n: Maximum n-gram to consider
        
        Returns:
            Dictionary with BLEU-1, BLEU-2, ..., BLEU-N scores
        """
        scores = {}
        
        for n in range(1, max_n + 1):
            total_matches = 0
            total_count = 0
            
            for ref, hyp in zip(references, hypotheses):
                ref_tokens = ref.lower().split()
                hyp_tokens = hyp.lower().split()
                
                ref_ngrams = Counter(LLMEvaluator.get_ngrams(ref_tokens, n))
                hyp_ngrams = Counter(LLMEvaluator.get_ngrams(hyp_tokens, n))
                
                # Clipped count
                for ngram, count in hyp_ngrams.items():
                    total_matches += min(count, ref_ngrams.get(ngram, 0))
                
                total_count += sum(hyp_ngrams.values())
            
            precision = total_matches / max(total_count, 1)
            scores[f"BLEU-{n}"] = precision
        
        # Geometric mean for overall BLEU
        if all(scores[f"BLEU-{n}"] > 0 for n in range(1, max_n + 1)):
            log_sum = sum(math.log(scores[f"BLEU-{n}"]) for n in range(1, max_n + 1))
            scores["BLEU"] = math.exp(log_sum / max_n)
        else:
            scores["BLEU"] = 0.0
        
        return scores
    
    @staticmethod
    def calculate_rouge_l(
        references: List[str],
        hypotheses: List[str],
    ) -> Dict[str, float]:
        """
        Calculate ROUGE-L (Longest Common Subsequence).
        
        Args:
            references: List of reference texts
            hypotheses: List of generated texts
        
        Returns:
            Dictionary with precision, recall, and F1
        """
        def lcs_length(x: List[str], y: List[str]) -> int:
            """Compute length of longest common subsequence."""
            m, n = len(x), len(y)
            dp = [[0] * (n + 1) for _ in range(m + 1)]
            
            for i in range(1, m + 1):
                for j in range(1, n + 1):
                    if x[i-1] == y[j-1]:
                        dp[i][j] = dp[i-1][j-1] + 1
                    else:
                        dp[i][j] = max(dp[i-1][j], dp[i][j-1])
            
            return dp[m][n]
        
        total_precision = 0.0
        total_recall = 0.0
        
        for ref, hyp in zip(references, hypotheses):
            ref_tokens = ref.lower().split()
            hyp_tokens = hyp.lower().split()
            
            lcs_len = lcs_length(ref_tokens, hyp_tokens)
            
            precision = lcs_len / max(len(hyp_tokens), 1)
            recall = lcs_len / max(len(ref_tokens), 1)
            
            total_precision += precision
            total_recall += recall
        
        n = len(references)
        avg_precision = total_precision / n
        avg_recall = total_recall / n
        
        f1 = (2 * avg_precision * avg_recall / 
              max(avg_precision + avg_recall, 1e-8))
        
        return {
            "ROUGE-L-P": avg_precision,
            "ROUGE-L-R": avg_recall,
            "ROUGE-L-F1": f1,
        }
    
    @staticmethod
    def calculate_distinct_n(
        texts: List[str],
        max_n: int = 2,
    ) -> Dict[str, float]:
        """
        Calculate Distinct-N scores (lexical diversity).
        
        Higher = more diverse generation.
        
        Args:
            texts: List of generated texts
            max_n: Maximum n-gram size
        
        Returns:
            Dictionary with Distinct-1, Distinct-2, etc.
        """
        scores = {}
        
        for n in range(1, max_n + 1):
            all_ngrams = []
            
            for text in texts:
                tokens = text.lower().split()
                ngrams = LLMEvaluator.get_ngrams(tokens, n)
                all_ngrams.extend(ngrams)
            
            if all_ngrams:
                unique_ngrams = len(set(all_ngrams))
                total_ngrams = len(all_ngrams)
                scores[f"Distinct-{n}"] = unique_ngrams / total_ngrams
            else:
                scores[f"Distinct-{n}"] = 0.0
        
        return scores


print("LLMEvaluator class defined!")
print("\nAvailable metrics:")
print("  - calculate_perplexity()")
print("  - calculate_bleu()")
print("  - calculate_rouge_l()")
print("  - calculate_distinct_n()")

In [None]:
# Demo the evaluator

# Sample reference and generated texts
references = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language for data science",
]

# Good generations (similar to reference)
good_hypotheses = [
    "The quick brown fox leaps over the lazy dog",
    "Machine learning is part of artificial intelligence",
    "Python is a widely used language for data science",
]

# Bad generations (very different)
bad_hypotheses = [
    "The cat sat on the mat",
    "Cooking recipes are available online",
    "The weather is nice today",
]

print("=" * 60)
print("GOOD GENERATIONS:")
print("=" * 60)
bleu_good = LLMEvaluator.calculate_bleu(references, good_hypotheses)
rouge_good = LLMEvaluator.calculate_rouge_l(references, good_hypotheses)
distinct_good = LLMEvaluator.calculate_distinct_n(good_hypotheses)

print(f"BLEU scores: {bleu_good}")
print(f"ROUGE-L: {rouge_good}")
print(f"Distinct-N: {distinct_good}")

print("\n" + "=" * 60)
print("BAD GENERATIONS:")
print("=" * 60)
bleu_bad = LLMEvaluator.calculate_bleu(references, bad_hypotheses)
rouge_bad = LLMEvaluator.calculate_rouge_l(references, bad_hypotheses)
distinct_bad = LLMEvaluator.calculate_distinct_n(bad_hypotheses)

print(f"BLEU scores: {bleu_bad}")
print(f"ROUGE-L: {rouge_bad}")
print(f"Distinct-N: {distinct_bad}")

print("\n" + "=" * 60)
print("INTERPRETATION:")
print("=" * 60)
print(f"Good vs Bad BLEU-2: {bleu_good['BLEU-2']:.3f} vs {bleu_bad['BLEU-2']:.3f}")
print(f"Good vs Bad ROUGE-L-F1: {rouge_good['ROUGE-L-F1']:.3f} vs {rouge_bad['ROUGE-L-F1']:.3f}")
print("\nHigher BLEU/ROUGE = closer to reference (better for many tasks)")

---

## Summary

These solutions demonstrate:

1. **Custom LoRA Configuration**: How to tune LoRA parameters for specific tasks (code generation)

2. **Hyperparameter Search**: Systematic exploration of learning rates and batch sizes with visualization

3. **Custom Training Loop**: Memory-efficient training with gradient accumulation and monitoring

4. **Evaluation Metrics**: Perplexity, BLEU, ROUGE-L, and Distinct-N for comprehensive model evaluation

### Key Takeaways

- **LoRA rank 8-32** works well for most tasks; higher for complex tasks like code
- **Learning rate 2e-5** is a good starting point for LoRA fine-tuning
- **Gradient accumulation** allows large effective batch sizes with limited memory
- **Multiple metrics** give a fuller picture than any single metric

In [None]:
# Cleanup
import gc
gc.collect()

try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("GPU cache cleared")
except ImportError:
    pass

print("Cleanup complete!")