# LoRA: Low-Rank Adaptation of Large Language Models

## üéØ Overview

LoRA (Low-Rank Adaptation) is a revolutionary parameter-efficient fine-tuning technique that has transformed how we adapt large language models. Instead of updating all model parameters, LoRA injects trainable low-rank matrices into transformer layers while keeping the original weights frozen.

**Key Innovation**: Decomposes weight updates into low-rank matrices A and B, reducing trainable parameters by up to 10,000√ó while maintaining performance.

**Impact**: Universal adoption in HuggingFace PEFT library, enabling fine-tuning of massive models on consumer hardware.

## üìö Background & Motivation

### The Problem
- Full fine-tuning requires updating all model parameters
- Memory requirements scale linearly with model size
- Storing separate copies for each task becomes prohibitive
- Training large models requires expensive hardware

### The LoRA Solution
- Hypothesis: Weight updates during adaptation have low "intrinsic rank"
- Decompose weight updates ŒîW into low-rank matrices: ŒîW = BA
- Only train A and B matrices, freeze original weights
- Merge weights during inference: W_new = W_original + BA

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
import seaborn as sns

# Set style
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

print("üì¶ Libraries imported successfully!")
print(f"üî¢ NumPy version: {np.__version__}")
print(f"üî• PyTorch version: {torch.__version__}")

## üßÆ Mathematical Foundation

### Core LoRA Mathematics

For a pre-trained weight matrix W‚ÇÄ ‚àà ‚Ñù·µàÀ£·µè, LoRA represents the weight update as:

**W = W‚ÇÄ + ŒîW = W‚ÇÄ + BA**

Where:
- **B ‚àà ‚Ñù·µàÀ£ ≥**: Down-projection matrix (trainable)
- **A ‚àà ‚Ñù ≥À£·µè**: Up-projection matrix (trainable)  
- **r**: Rank (much smaller than d, k)
- **W‚ÇÄ**: Original frozen weights

### Parameter Reduction
- **Original parameters**: d √ó k
- **LoRA parameters**: r √ó (d + k)
- **Reduction ratio**: (d √ó k) / (r √ó (d + k))

For typical values (d=4096, k=4096, r=16):
- Original: 16,777,216 parameters
- LoRA: 131,072 parameters  
- **Reduction: 128√ó**

In [None]:
class LoRALayer(nn.Module):
    """
    LoRA (Low-Rank Adaptation) layer implementation.
    
    This layer adds trainable low-rank matrices to a frozen linear layer.
    """
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 16,
        alpha: float = 16,
        dropout: float = 0.1
    ):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Store original layer (frozen)
        self.original_layer = original_layer
        self.original_layer.requires_grad_(False)
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # LoRA matrices
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.dropout = nn.Dropout(dropout)
        
        # Initialize LoRA weights
        nn.init.kaiming_uniform_(self.lora_A.weight, a=np.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)
        
        print(f"‚úÖ LoRA layer created:")
        print(f"   Original params: {in_features * out_features:,}")
        print(f"   LoRA params: {rank * (in_features + out_features):,}")
        print(f"   Reduction: {(in_features * out_features) / (rank * (in_features + out_features)):.1f}x")
    
    def forward(self, x):
        # Original forward pass (frozen)
        original_output = self.original_layer(x)
        
        # LoRA forward pass
        lora_output = self.lora_B(self.lora_A(self.dropout(x)))
        
        # Combine with scaling
        return original_output + lora_output * self.scaling
    
    def merge_weights(self):
        """Merge LoRA weights into original layer for inference."""
        with torch.no_grad():
            # Compute LoRA weight update
            lora_weight = self.lora_B.weight @ self.lora_A.weight
            
            # Add to original weights
            self.original_layer.weight.add_(lora_weight * self.scaling)
            
            # Zero out LoRA weights
            self.lora_A.weight.zero_()
            self.lora_B.weight.zero_()
    
    def get_parameter_count(self):
        """Get parameter counts for analysis."""
        original_params = sum(p.numel() for p in self.original_layer.parameters())
        lora_params = sum(p.numel() for p in [self.lora_A.weight, self.lora_B.weight])
        return {
            'original': original_params,
            'lora': lora_params,
            'total': original_params + lora_params,
            'reduction': original_params / lora_params
        }

# Test the LoRA layer
original_linear = nn.Linear(1024, 1024)
lora_layer = LoRALayer(original_linear, rank=16, alpha=16)

# Test forward pass
x = torch.randn(32, 1024)
output = lora_layer(x)
print(f"\nüîÑ Forward pass successful: {output.shape}")

# Analyze parameters
params = lora_layer.get_parameter_count()
print(f"\nüìä Parameter Analysis:")
for key, value in params.items():
    if isinstance(value, (int, float)):
        print(f"   {key}: {value:,.0f}")

## üèóÔ∏è LoRA Attention Implementation

The most common application of LoRA is in attention layers, where we apply it to the query, key, value, and output projections.

In [None]:
class LoRAMultiHeadAttention(nn.Module):
    """
    Multi-Head Attention with LoRA adaptation.
    
    Applies LoRA to query, key, value, and output projections.
    """
    
    def __init__(
        self,
        embed_dim: int = 768,
        num_heads: int = 12,
        lora_rank: int = 16,
        lora_alpha: float = 16,
        dropout: float = 0.1
    ):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        
        assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
        
        # Original attention layers (frozen)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        
        # Apply LoRA to projections
        self.q_lora = LoRALayer(self.q_proj, lora_rank, lora_alpha, dropout)
        self.k_lora = LoRALayer(self.k_proj, lora_rank, lora_alpha, dropout)
        self.v_lora = LoRALayer(self.v_proj, lora_rank, lora_alpha, dropout)
        self.out_lora = LoRALayer(self.out_proj, lora_rank, lora_alpha, dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, attention_mask=None):
        batch_size, seq_len, embed_dim = x.shape
        
        # Apply LoRA-enhanced projections
        q = self.q_lora(x)
        k = self.k_lora(x)
        v = self.v_lora(x)
        
        # Reshape for multi-head attention
        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / np.sqrt(self.head_dim)
        
        if attention_mask is not None:
            scores = scores.masked_fill(attention_mask == 0, -1e9)
        
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply attention to values
        attn_output = torch.matmul(attn_weights, v)
        
        # Reshape and apply output projection
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, seq_len, embed_dim
        )
        
        output = self.out_lora(attn_output)
        
        return output, attn_weights
    
    def get_trainable_parameters(self):
        """Get count of trainable vs total parameters."""
        total_params = sum(p.numel() for p in self.parameters())
        trainable_params = sum(p.numel() for p in self.parameters() if p.requires_grad)
        
        return {
            'total': total_params,
            'trainable': trainable_params,
            'frozen': total_params - trainable_params,
            'efficiency': trainable_params / total_params
        }

# Test LoRA attention
lora_attention = LoRAMultiHeadAttention(
    embed_dim=768,
    num_heads=12,
    lora_rank=16
)

# Test with sample input
x = torch.randn(4, 128, 768)  # batch_size, seq_len, embed_dim
output, weights = lora_attention(x)

print(f"\nüîÑ LoRA Attention Test:")
print(f"   Input shape: {x.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Attention weights shape: {weights.shape}")

# Analyze parameter efficiency
param_stats = lora_attention.get_trainable_parameters()
print(f"\nüìä Parameter Efficiency:")
for key, value in param_stats.items():
    if key == 'efficiency':
        print(f"   {key}: {value:.4f} ({value*100:.2f}%)")
    else:
        print(f"   {key}: {value:,}")

## üìä LoRA Rank Analysis

The rank r is a crucial hyperparameter that controls the trade-off between parameter efficiency and model capacity.

In [None]:
def analyze_lora_rank_efficiency():
    """
    Analyze the trade-off between rank and parameter efficiency.
    """
    embed_dim = 768
    ranks = [1, 2, 4, 8, 16, 32, 64, 128]
    
    results = {
        'rank': [],
        'lora_params': [],
        'total_params': [],
        'efficiency': [],
        'reduction_ratio': []
    }
    
    # Original linear layer
    original_params = embed_dim * embed_dim
    
    for rank in ranks:
        # LoRA parameters: r * (d + k)
        lora_params = rank * (embed_dim + embed_dim)
        total_params = original_params + lora_params
        efficiency = lora_params / total_params
        reduction_ratio = original_params / lora_params
        
        results['rank'].append(rank)
        results['lora_params'].append(lora_params)
        results['total_params'].append(total_params)
        results['efficiency'].append(efficiency)
        results['reduction_ratio'].append(reduction_ratio)
    
    return results

# Analyze rank efficiency
rank_analysis = analyze_lora_rank_efficiency()

# Create visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# 1. Parameter count vs rank
ax1.plot(rank_analysis['rank'], rank_analysis['lora_params'], 'o-', label='LoRA params', linewidth=2)
ax1.axhline(y=768*768, color='r', linestyle='--', label='Original params')
ax1.set_xlabel('LoRA Rank')
ax1.set_ylabel('Parameters')
ax1.set_title('Parameter Count vs LoRA Rank')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

# 2. Reduction ratio vs rank
ax2.plot(rank_analysis['rank'], rank_analysis['reduction_ratio'], 'o-', color='green', linewidth=2)
ax2.set_xlabel('LoRA Rank')
ax2.set_ylabel('Parameter Reduction Ratio')
ax2.set_title('Parameter Reduction vs LoRA Rank')
ax2.grid(True, alpha=0.3)
ax2.set_yscale('log')

# 3. Efficiency percentage
efficiency_pct = [e * 100 for e in rank_analysis['efficiency']]
bars = ax3.bar(range(len(rank_analysis['rank'])), efficiency_pct, color='orange', alpha=0.7)
ax3.set_xlabel('LoRA Rank')
ax3.set_ylabel('Trainable Parameters (%)')
ax3.set_title('Training Efficiency by Rank')
ax3.set_xticks(range(len(rank_analysis['rank'])))
ax3.set_xticklabels(rank_analysis['rank'])
ax3.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 0.1,
             f'{height:.2f}%', ha='center', va='bottom', fontsize=9)

# 4. Memory usage comparison
rank_subset = [1, 4, 16, 64]
memory_original = [768*768*4] * len(rank_subset)  # 4 bytes per float32
memory_lora = [rank * (768 + 768) * 4 for rank in rank_subset]

x_pos = np.arange(len(rank_subset))
width = 0.35

ax4.bar(x_pos - width/2, [m/1e6 for m in memory_original], width, 
        label='Original', color='red', alpha=0.7)
ax4.bar(x_pos + width/2, [m/1e6 for m in memory_lora], width, 
        label='LoRA', color='blue', alpha=0.7)

ax4.set_xlabel('LoRA Rank')
ax4.set_ylabel('Memory Usage (MB)')
ax4.set_title('Memory Usage Comparison')
ax4.set_xticks(x_pos)
ax4.set_xticklabels(rank_subset)
ax4.legend()
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Print summary table
print("\nüìä LoRA Rank Analysis Summary:")
print("=" * 80)
print(f"{'Rank':<6} {'LoRA Params':<12} {'Reduction':<12} {'Efficiency':<12} {'Memory (MB)':<12}")
print("=" * 80)
for i, rank in enumerate(rank_analysis['rank']):
    if rank in [1, 4, 8, 16, 32, 64]:
        lora_params = rank_analysis['lora_params'][i]
        reduction = rank_analysis['reduction_ratio'][i]
        efficiency = rank_analysis['efficiency'][i] * 100
        memory_mb = lora_params * 4 / 1e6  # 4 bytes per float32
        print(f"{rank:<6} {lora_params:<12,} {reduction:<12.1f}x {efficiency:<12.2f}% {memory_mb:<12.2f}")

## üéØ Practical LoRA Training Example

Let's implement a complete example showing how to use LoRA for fine-tuning.

In [None]:
class SimpleBERT(nn.Module):
    """
    Simplified BERT-like model for demonstration.
    """
    
    def __init__(self, vocab_size=30000, embed_dim=768, num_heads=12, num_layers=6):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Embeddings
        self.token_embeddings = nn.Embedding(vocab_size, embed_dim)
        self.position_embeddings = nn.Embedding(512, embed_dim)
        
        # Transformer layers
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=embed_dim,
                nhead=num_heads,
                dim_feedforward=embed_dim * 4,
                dropout=0.1,
                batch_first=True
            ) for _ in range(num_layers)
        ])
        
        # Classification head
        self.classifier = nn.Linear(embed_dim, 2)  # Binary classification
        
    def forward(self, input_ids, attention_mask=None):
        seq_len = input_ids.size(1)
        position_ids = torch.arange(seq_len, device=input_ids.device).unsqueeze(0)
        
        # Embeddings
        token_embeds = self.token_embeddings(input_ids)
        pos_embeds = self.position_embeddings(position_ids)
        embeddings = token_embeds + pos_embeds
        
        # Transformer layers
        hidden_states = embeddings
        for layer in self.layers:
            hidden_states = layer(hidden_states, src_key_padding_mask=attention_mask)
        
        # Classification (use [CLS] token)
        cls_hidden = hidden_states[:, 0]  # First token
        logits = self.classifier(cls_hidden)
        
        return logits

def apply_lora_to_model(model, lora_rank=16, lora_alpha=16):
    """
    Apply LoRA to all linear layers in the model.
    """
    lora_layers = []
    
    def apply_lora_recursive(module, name=""):
        for child_name, child in module.named_children():
            full_name = f"{name}.{child_name}" if name else child_name
            
            if isinstance(child, nn.Linear):
                # Skip embedding and final classifier layers
                if 'embeddings' not in full_name and 'classifier' not in full_name:
                    # Replace with LoRA layer
                    lora_layer = LoRALayer(child, lora_rank, lora_alpha)
                    setattr(module, child_name, lora_layer)
                    lora_layers.append((full_name, lora_layer))
                    print(f"‚úÖ Applied LoRA to: {full_name}")
            else:
                apply_lora_recursive(child, full_name)
    
    apply_lora_recursive(model)
    return lora_layers

def count_parameters(model):
    """
    Count trainable and total parameters.
    """
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    return {
        'total': total_params,
        'trainable': trainable_params,
        'frozen': total_params - trainable_params,
        'efficiency': trainable_params / total_params if total_params > 0 else 0
    }

# Create and analyze model
print("ü§ñ Creating SimpleBERT model...")
model = SimpleBERT(vocab_size=1000, embed_dim=256, num_heads=8, num_layers=4)

# Count parameters before LoRA
params_before = count_parameters(model)
print(f"\nüìä Parameters before LoRA:")
for key, value in params_before.items():
    if key == 'efficiency':
        print(f"   {key}: {value:.4f} ({value*100:.2f}%)")
    else:
        print(f"   {key}: {value:,}")

# Apply LoRA
print(f"\nüîß Applying LoRA (rank=16)...")
lora_layers = apply_lora_to_model(model, lora_rank=16, lora_alpha=16)

# Count parameters after LoRA
params_after = count_parameters(model)
print(f"\nüìä Parameters after LoRA:")
for key, value in params_after.items():
    if key == 'efficiency':
        print(f"   {key}: {value:.4f} ({value*100:.2f}%)")
    else:
        print(f"   {key}: {value:,}")

# Test the model
batch_size, seq_len = 4, 128
input_ids = torch.randint(0, 1000, (batch_size, seq_len))
attention_mask = torch.ones(batch_size, seq_len)

with torch.no_grad():
    logits = model(input_ids, attention_mask)
    
print(f"\nüîÑ Model test successful:")
print(f"   Input shape: {input_ids.shape}")
print(f"   Output shape: {logits.shape}")
print(f"   LoRA layers applied: {len(lora_layers)}")

# Calculate efficiency improvement
reduction_factor = params_before['total'] / params_after['trainable']
print(f"\nüöÄ LoRA Efficiency:")
print(f"   Parameter reduction: {reduction_factor:.1f}x")
print(f"   Training efficiency: {params_after['efficiency']*100:.2f}% parameters trainable")
print(f"   Memory savings: ~{(1 - params_after['efficiency']) * 100:.1f}%")

## üîß Advanced LoRA Techniques

### 1. Adaptive Rank Selection
Different layers may benefit from different ranks based on their importance.

In [None]:
def analyze_layer_importance():
    """
    Simulate layer importance analysis for adaptive rank selection.
    """
    # Simulated importance scores (in practice, computed from gradients/activations)
    layer_names = [
        'layers.0.self_attn.q_proj',
        'layers.0.self_attn.k_proj', 
        'layers.0.self_attn.v_proj',
        'layers.0.self_attn.out_proj',
        'layers.0.linear1',
        'layers.0.linear2',
        'layers.1.self_attn.q_proj',
        'layers.1.self_attn.k_proj',
        'layers.1.self_attn.v_proj',
        'layers.1.self_attn.out_proj',
        'layers.1.linear1',
        'layers.1.linear2',
    ]
    
    # Simulate importance scores (higher = more important)
    np.random.seed(42)
    importance_scores = np.random.beta(2, 5, len(layer_names))  # Skewed towards lower values
    
    # Assign ranks based on importance
    def assign_adaptive_rank(importance):
        if importance > 0.7:
            return 64
        elif importance > 0.5:
            return 32
        elif importance > 0.3:
            return 16
        elif importance > 0.1:
            return 8
        else:
            return 4
    
    adaptive_ranks = [assign_adaptive_rank(score) for score in importance_scores]
    
    # Compare with fixed rank
    fixed_rank = 16
    fixed_params = len(layer_names) * fixed_rank * (256 + 256)  # Assume 256-dim layers
    adaptive_params = sum(rank * (256 + 256) for rank in adaptive_ranks)
    
    print("üß† Adaptive Rank Selection Analysis:")
    print("=" * 60)
    print(f"{'Layer':<30} {'Importance':<12} {'Rank':<6}")
    print("=" * 60)
    
    for name, importance, rank in zip(layer_names, importance_scores, adaptive_ranks):
        print(f"{name:<30} {importance:<12.3f} {rank:<6}")
    
    print("\nüìä Parameter Comparison:")
    print(f"   Fixed rank ({fixed_rank}): {fixed_params:,} parameters")
    print(f"   Adaptive ranks: {adaptive_params:,} parameters")
    print(f"   Savings: {((fixed_params - adaptive_params) / fixed_params * 100):.1f}%")
    
    return layer_names, importance_scores, adaptive_ranks

# Visualize adaptive rank selection
layer_names, importance_scores, adaptive_ranks = analyze_layer_importance()

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# 1. Importance scores
colors = plt.cm.viridis(importance_scores)
bars1 = ax1.bar(range(len(layer_names)), importance_scores, color=colors)
ax1.set_xlabel('Layer Index')
ax1.set_ylabel('Importance Score')
ax1.set_title('Layer Importance Scores')
ax1.grid(True, alpha=0.3, axis='y')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap=plt.cm.viridis, norm=plt.Normalize(vmin=min(importance_scores), vmax=max(importance_scores)))
sm.set_array([])
cbar = plt.colorbar(sm, ax=ax1, shrink=0.8)
cbar.set_label('Importance Score')

# 2. Adaptive ranks
rank_colors = ['red' if r == 4 else 'orange' if r == 8 else 'yellow' if r == 16 else 'lightgreen' if r == 32 else 'green' for r in adaptive_ranks]
bars2 = ax2.bar(range(len(layer_names)), adaptive_ranks, color=rank_colors, alpha=0.7)
ax2.set_xlabel('Layer Index')
ax2.set_ylabel('LoRA Rank')
ax2.set_title('Adaptive Rank Assignment')
ax2.grid(True, alpha=0.3, axis='y')

# Add rank labels
for i, bar in enumerate(bars2):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{int(height)}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

## üéØ Practical Exercises

### Exercise 1: LoRA Rank Experiment
Implement and compare different LoRA ranks on a simple task.

In [None]:
def lora_rank_experiment():
    """
    Compare different LoRA ranks on a synthetic task.
    """
    print("üß™ LoRA Rank Experiment")
    print("=" * 50)
    
    # Create synthetic data
    torch.manual_seed(42)
    X = torch.randn(1000, 256)
    y = (X.sum(dim=1) > 0).float()  # Simple binary classification
    
    ranks_to_test = [1, 4, 8, 16, 32, 64]
    results = []
    
    for rank in ranks_to_test:
        print(f"\nüîÑ Testing rank {rank}...")
        
        # Create model with LoRA
        original_layer = nn.Linear(256, 1)
        lora_layer = LoRALayer(original_layer, rank=rank, alpha=rank)
        
        # Simple training loop
        optimizer = torch.optim.Adam(lora_layer.parameters(), lr=0.01)
        criterion = nn.BCEWithLogitsLoss()
        
        losses = []
        for epoch in range(50):
            optimizer.zero_grad()
            logits = lora_layer(X).squeeze()
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()
            losses.append(loss.item())
        
        # Evaluate
        with torch.no_grad():
            logits = lora_layer(X).squeeze()
            predictions = (torch.sigmoid(logits) > 0.5).float()
            accuracy = (predictions == y).float().mean().item()
        
        param_count = lora_layer.get_parameter_count()
        
        results.append({
            'rank': rank,
            'final_loss': losses[-1],
            'accuracy': accuracy,
            'lora_params': param_count['lora'],
            'reduction': param_count['reduction']
        })
        
        print(f"   Final loss: {losses[-1]:.4f}")
        print(f"   Accuracy: {accuracy:.4f}")
        print(f"   Parameters: {param_count['lora']:,} ({param_count['reduction']:.1f}x reduction)")
    
    return results

# Run experiment
experiment_results = lora_rank_experiment()

# Visualize results
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))

ranks = [r['rank'] for r in experiment_results]
accuracies = [r['accuracy'] for r in experiment_results]
losses = [r['final_loss'] for r in experiment_results]
param_counts = [r['lora_params'] for r in experiment_results]

# 1. Accuracy vs Rank
ax1.plot(ranks, accuracies, 'o-', linewidth=2, markersize=8)
ax1.set_xlabel('LoRA Rank')
ax1.set_ylabel('Accuracy')
ax1.set_title('Accuracy vs LoRA Rank')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0.5, 1.0])

# 2. Loss vs Rank
ax2.plot(ranks, losses, 'o-', color='red', linewidth=2, markersize=8)
ax2.set_xlabel('LoRA Rank')
ax2.set_ylabel('Final Loss')
ax2.set_title('Final Loss vs LoRA Rank')
ax2.grid(True, alpha=0.3)

# 3. Parameter count vs performance
scatter = ax3.scatter(param_counts, accuracies, c=ranks, s=100, cmap='viridis', alpha=0.7)
ax3.set_xlabel('LoRA Parameters')
ax3.set_ylabel('Accuracy')
ax3.set_title('Accuracy vs Parameter Count')
ax3.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax3, label='LoRA Rank')

plt.tight_layout()
plt.show()

print("\nüìä Experiment Summary:")
print("=" * 70)
print(f"{'Rank':<6} {'Accuracy':<10} {'Loss':<10} {'Params':<10} {'Reduction':<10}")
print("=" * 70)
for result in experiment_results:
    print(f"{result['rank']:<6} {result['accuracy']:<10.4f} {result['final_loss']:<10.4f} "
          f"{result['lora_params']:<10,} {result['reduction']:<10.1f}x")

## üí° Key Takeaways

### LoRA Advantages:
1. **Extreme Parameter Efficiency**: 10,000√ó reduction in trainable parameters
2. **Memory Efficiency**: Significant reduction in GPU memory requirements
3. **Storage Efficiency**: Store multiple task-specific adaptations easily
4. **Training Speed**: Faster training due to fewer parameters
5. **Modularity**: Easy to combine multiple LoRA modules

### Best Practices:
1. **Rank Selection**: Start with rank 16, adjust based on task complexity
2. **Alpha Scaling**: Use Œ± = rank for balanced scaling
3. **Layer Selection**: Apply to attention layers first, then FFN layers
4. **Initialization**: Zero-initialize B, random-initialize A
5. **Dropout**: Use dropout on input to LoRA layers

### When to Use LoRA:
- **Large Model Fine-tuning**: When full fine-tuning is computationally prohibitive
- **Multiple Tasks**: When you need task-specific adaptations
- **Limited Resources**: When GPU memory or storage is constrained
- **Quick Iteration**: When you need to experiment with many variations

## üöÄ Next Steps

1. **Explore QLoRA**: Combine LoRA with quantization for extreme efficiency
2. **Try AdaLoRA**: Adaptive rank selection during training
3. **Experiment with DoRA**: Direction-aware LoRA for improved performance
4. **Study LoRA+**: Recent improvements to the LoRA technique
5. **Apply to Real Tasks**: Use LoRA on actual NLP tasks with HuggingFace PEFT

**LoRA has revolutionized how we approach large model adaptation, making it possible to fine-tune massive models efficiently and democratically. It's an essential technique for any modern NLP practitioner!** üéØ