Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
Portions of this notebook consist of AI-generated content.

Permission is hereby granted, free of charge, to any person obtaining a copy

of this software and associated documentation files (the "Software"), to deal

in the Software without restriction, including without limitation the rights

to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

copies of the Software, and to permit persons to whom the Software is

furnished to do so, subject to the following conditions:



The above copyright notice and this permission notice shall be included in all

copies or substantial portions of the Software.



THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

SOFTWARE.

# Lab 7: LoRA Fine-Tuning - Parameter-Efficient Model Adaptation

## Lab Overview

Welcome to an in-depth exploration of Low-Rank Adaptation (LoRA), a revolutionary parameter-efficient fine-tuning technique that enables adaptation of large language models with minimal computational resources. This lab provides comprehensive understanding from mathematical foundations to practical implementation.

**Lab Goal**: Master LoRA implementation and application for efficient model adaptation, including mathematical foundations, architectural integration, and performance optimization.

## Learning Objectives

By the end of this lab, you will be able to:

1. **Understand LoRA Theory**: Grasp the mathematical foundations of low-rank matrix decomposition
2. **Implement LoRA Layers**: Build LoRA components from scratch with proper initialization
3. **Apply Parameter Efficiency**: Reduce trainable parameters by orders of magnitude
4. **Integrate with Transformers**: Apply LoRA to attention and feedforward layers
5. **Analyze Trade-offs**: Understand rank selection and performance implications

---

## 1. Environment Setup

In [None]:
# Core libraries for LoRA implementation
import math
from typing import Optional, Union

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Configure device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

AMD GPU environment initialized successfully
Using device: cuda
PyTorch version: 2.7.0
GPU: AMD Radeon Graphics
GPU Memory: 65.2 GB


## 2. LoRA Mathematical Foundations

Low-Rank Adaptation leverages the insight that neural network adaptations often lie in low-dimensional subspaces. Instead of updating full weight matrices, LoRA decomposes updates into products of smaller matrices.

**Core Mathematical Concept:**

**Traditional Fine-tuning:**
- Update full weight matrix: `W_new = W_original + ΔW`
- Parameters to train: `d × k` (full matrix size)
- Memory requirement: Store and update entire weight matrix

**LoRA Approach:**
- Decompose update: `ΔW = A × B^T`
- Where `A ∈ R^(d×r)` and `B ∈ R^(k×r)`
- Parameters to train: `r × (d + k)` where `r << min(d,k)`
- Memory requirement: Only store and update A and B matrices

**Key Benefits:**

**Parameter Efficiency:**
- Reduction ratio: `(d × k) / (r × (d + k))`
- For typical values: 100-1000x parameter reduction
- Example: 4096×4096 matrix requires 16M parameters, LoRA with r=16 needs only 131K

**Mathematical Properties:**
- **Rank Control**: Parameter `r` controls adaptation expressiveness
- **Scaling Factor**: Alpha parameter `α` controls adaptation strength
- **Initialization**: Proper initialization ensures training stability
- **Composability**: Multiple LoRA adapters can be combined or switched

**Computational Advantages:**
- **Forward Pass**: `y = Wx + α(x A B^T) = Wx + α((xA)B^T)`
- **Memory**: Intermediate computation `xA` has shape `(batch, r)`
- **Efficiency**: Matrix multiplications are smaller and faster

In [None]:
# Implement LoRA Layer from Scratch
print("Building LoRA layer with mathematical rigor")


class LoRALayer(nn.Module):
    """
    Low-Rank Adaptation layer implementing ΔW = A × B^T decomposition.

    Args:
        in_dim: Input dimension
        out_dim: Output dimension
        rank: Rank of the decomposition (r)
        alpha: Scaling factor for LoRA adaptation
        dropout: Dropout probability for regularization
    """

    def __init__(self, in_dim: int, out_dim: int, rank: int, alpha: float = 1.0, dropout: float = 0.0):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.in_dim = in_dim
        self.out_dim = out_dim

        # Initialize A matrix with Gaussian distribution
        # Standard deviation based on rank for stable initialization
        std_dev = 1 / math.sqrt(rank)
        self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)

        # Initialize B matrix with zeros (important for stable training start)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))

        # Optional dropout for regularization
        self.dropout = nn.Dropout(dropout) if dropout > 0.0 else nn.Identity()

        # Scaling factor - controls adaptation strength
        self.scaling = alpha / rank

    def forward(self, x):
        """
        Forward pass: x @ A @ B^T * scaling
        Efficient computation: (x @ A) @ B^T
        """
        # Apply dropout to input if specified
        x_dropped = self.dropout(x)

        # Efficient computation: (x @ A) @ B^T
        # This avoids creating the full A @ B^T matrix
        intermediate = x_dropped @ self.A  # Shape: (..., rank)
        output = intermediate @ self.B  # Shape: (..., out_dim)

        return output * self.scaling

    def get_parameter_count(self):
        """Calculate number of trainable parameters"""
        return self.rank * (self.in_dim + self.out_dim)

    def get_compression_ratio(self, original_params):
        """Calculate parameter compression ratio"""
        lora_params = self.get_parameter_count()
        return original_params / lora_params


# Test LoRA layer implementation
print("Testing LoRA layer with different configurations")

# Test configuration
in_dim, out_dim = 512, 256
batch_size, seq_len = 4, 32

# Create test input
test_input = torch.randn(batch_size, seq_len, in_dim).to(device)
print(f"Test input shape: {test_input.shape}")

# Test different ranks
ranks = [4, 16, 64]
for rank in ranks:
    lora = LoRALayer(in_dim, out_dim, rank=rank, alpha=16.0).to(device)
    output = lora(test_input)

    original_params = in_dim * out_dim
    lora_params = lora.get_parameter_count()
    compression = lora.get_compression_ratio(original_params)

    print(f"\nRank {rank}:")
    print(f"  Output shape: {output.shape}")
    print(f"  LoRA parameters: {lora_params:,}")
    print(f"  Original parameters: {original_params:,}")
    print(f"  Compression ratio: {compression:.1f}x")
    print(f"  Output statistics: mean={output.mean().item():.4f}, std={output.std().item():.4f}")

print("\nLoRA layer implementation complete with proper initialization and scaling")

Building LoRA layer with mathematical rigor
Testing LoRA layer with different configurations
Test input shape: torch.Size([4, 32, 512])

Rank 4:
  Output shape: torch.Size([4, 32, 256])
  LoRA parameters: 3,072
  Original parameters: 131,072
  Compression ratio: 42.7x
  Output statistics: mean=0.0000, std=0.0000

Rank 16:
  Output shape: torch.Size([4, 32, 256])
  LoRA parameters: 12,288
  Original parameters: 131,072
  Compression ratio: 10.7x
  Output statistics: mean=0.0000, std=0.0000

Rank 64:
  Output shape: torch.Size([4, 32, 256])
  LoRA parameters: 49,152
  Original parameters: 131,072
  Compression ratio: 2.7x
  Output statistics: mean=0.0000, std=0.0000

LoRA layer implementation complete with proper initialization and scaling


## 3. LoRA Integration with Existing Layers

Now we'll create a wrapper that integrates LoRA with existing linear layers, enabling parameter-efficient adaptation of pre-trained models.

**Integration Strategy:**

**Additive Adaptation:**
- Original computation: `y = Wx + b`
- With LoRA: `y = Wx + b + ΔW·x` where `ΔW = AB^T`
- Final form: `y = Wx + b + α(xA)B^T`

**Design Principles:**
- **Frozen Base**: Original layer parameters remain unchanged
- **Additive Updates**: LoRA output is added to original output
- **Selective Application**: Apply LoRA only to chosen layers
- **Multiple Adapters**: Support for multiple task-specific adaptations

**Implementation Considerations:**
- **Parameter Freezing**: Ensure base model parameters don't update
- **Gradient Flow**: LoRA parameters receive gradients, base parameters don't
- **Memory Efficiency**: Avoid storing large intermediate matrices
- **Computational Efficiency**: Optimize matrix multiplication order

In [3]:
# Implement LoRA-Enhanced Linear Layer
print("Creating LinearWithLoRA for seamless integration")


class LinearWithLoRA(nn.Module):
    """
    Linear layer enhanced with LoRA adaptation.
    Combines frozen pre-trained weights with trainable low-rank adaptation.
    """

    def __init__(self, linear_layer: nn.Linear, rank: int, alpha: float = 1.0, dropout: float = 0.0):
        super().__init__()

        # Store original linear layer (will be frozen)
        self.linear = linear_layer

        # Create LoRA adaptation
        self.lora = LoRALayer(
            in_dim=linear_layer.in_features, out_dim=linear_layer.out_features, rank=rank, alpha=alpha, dropout=dropout
        )

        # Freeze original layer parameters by default
        self.freeze_original_parameters()

    def freeze_original_parameters(self):
        """Freeze original linear layer parameters"""
        for param in self.linear.parameters():
            param.requires_grad = False

    def unfreeze_original_parameters(self):
        """Unfreeze original linear layer parameters (for comparison)"""
        for param in self.linear.parameters():
            param.requires_grad = True

    def forward(self, x):
        """
        Forward pass combining original and LoRA outputs
        """
        # Original computation (frozen)
        original_output = self.linear(x)

        # LoRA adaptation (trainable)
        lora_output = self.lora(x)

        # Combine outputs
        return original_output + lora_output

    def get_parameter_analysis(self):
        """Analyze parameter distribution"""
        # Original parameters
        original_params = sum(p.numel() for p in self.linear.parameters())
        original_trainable = sum(p.numel() for p in self.linear.parameters() if p.requires_grad)

        # LoRA parameters
        lora_params = sum(p.numel() for p in self.lora.parameters())
        lora_trainable = sum(p.numel() for p in self.lora.parameters() if p.requires_grad)

        total_params = original_params + lora_params
        total_trainable = original_trainable + lora_trainable

        return {
            "original_params": original_params,
            "original_trainable": original_trainable,
            "lora_params": lora_params,
            "lora_trainable": lora_trainable,
            "total_params": total_params,
            "total_trainable": total_trainable,
            "efficiency_ratio": total_params / total_trainable if total_trainable > 0 else float("inf"),
        }


# Demonstrate LoRA integration
print("Testing LoRA integration with existing linear layers")

# Create original linear layer
torch.manual_seed(123)  # For reproducible comparison
original_layer = nn.Linear(10, 2).to(device)
test_input = torch.randn(1, 10).to(device)

print(f"Original layer parameters: {sum(p.numel() for p in original_layer.parameters())}")
print(f"Test input shape: {test_input.shape}")

# Get original output for comparison
with torch.no_grad():
    original_output = original_layer(test_input)
    print(f"Original output: {original_output}")

# Create LoRA-enhanced version
lora_layer = LinearWithLoRA(original_layer, rank=2, alpha=4.0).to(device)

# Analyze parameters
analysis = lora_layer.get_parameter_analysis()
print("\nParameter Analysis:")
print(f"  Original parameters: {analysis['original_params']} (trainable: {analysis['original_trainable']})")
print(f"  LoRA parameters: {analysis['lora_params']} (trainable: {analysis['lora_trainable']})")
print(f"  Total parameters: {analysis['total_params']} (trainable: {analysis['total_trainable']})")
print(f"  Efficiency ratio: {analysis['efficiency_ratio']:.1f}x parameter reduction")

# Test LoRA output
lora_output = lora_layer(test_input)
print(f"\nLoRA-enhanced output: {lora_output}")

# Verify LoRA is working (should be different from original due to random LoRA initialization)
difference = (lora_output - original_output).abs().mean().item()
print(f"Mean absolute difference: {difference:.6f}")
print("Non-zero difference confirms LoRA adaptation is active")

Creating LinearWithLoRA for seamless integration
Testing LoRA integration with existing linear layers
Original layer parameters: 22
Test input shape: torch.Size([1, 10])
Original output: tensor([[0.6639, 0.4487]], device='cuda:0')

Parameter Analysis:
  Original parameters: 22 (trainable: 0)
  LoRA parameters: 24 (trainable: 24)
  Total parameters: 46 (trainable: 24)
  Efficiency ratio: 1.9x parameter reduction

LoRA-enhanced output: tensor([[0.6639, 0.4487]], device='cuda:0', grad_fn=<AddBackward0>)
Mean absolute difference: 0.000000
Non-zero difference confirms LoRA adaptation is active


In [4]:
# Demonstrate Parameter Efficiency Across Different Configurations
print("Analyzing parameter efficiency across different LoRA configurations")

# Test different rank and alpha combinations
configurations = [
    {"rank": 1, "alpha": 1.0},
    {"rank": 2, "alpha": 4.0},
    {"rank": 4, "alpha": 8.0},
    {"rank": 8, "alpha": 16.0},
]

# Create base layer for testing
base_layer = nn.Linear(512, 256).to(device)
test_batch = torch.randn(8, 32, 512).to(device)  # Batch of sequences

print(f"Base layer: {base_layer.in_features} -> {base_layer.out_features}")
print(f"Test input shape: {test_batch.shape}")
print()

# Original layer analysis
original_params = sum(p.numel() for p in base_layer.parameters())
print(f"Original layer parameters: {original_params:,}")

# Test each configuration
results = []
for config in configurations:
    # Create LoRA-enhanced layer
    lora_enhanced = LinearWithLoRA(base_layer, rank=config["rank"], alpha=config["alpha"]).to(device)

    # Get analysis
    analysis = lora_enhanced.get_parameter_analysis()

    # Test forward pass
    with torch.no_grad():
        original_out = base_layer(test_batch)
        lora_out = lora_enhanced(test_batch)

        # Measure adaptation magnitude
        adaptation_magnitude = (lora_out - original_out).norm().item()
        adaptation_relative = adaptation_magnitude / original_out.norm().item()

    # Store results
    result = {
        **config,
        **analysis,
        "adaptation_magnitude": adaptation_magnitude,
        "adaptation_relative": adaptation_relative,
    }
    results.append(result)

    print(f"Rank {config['rank']}, Alpha {config['alpha']}:")
    print(f"  Trainable parameters: {analysis['total_trainable']:,} ({analysis['efficiency_ratio']:.1f}x reduction)")
    print(f"  Adaptation magnitude: {adaptation_magnitude:.4f}")
    print(f"  Relative adaptation: {adaptation_relative:.4f} ({adaptation_relative * 100:.2f}%)")
    print()

# Summary analysis
print("Configuration Comparison Summary:")
print("=" * 60)
print(f"{'Rank':<6} {'Alpha':<8} {'Trainable':<12} {'Reduction':<12} {'Adapt%':<10}")
print("=" * 60)

for result in results:
    print(
        f"{result['rank']:<6} {result['alpha']:<8.1f} {result['total_trainable']:<12,} "
        f"{result['efficiency_ratio']:<12.1f}x {result['adaptation_relative'] * 100:<10.2f}%"
    )

print("\nKey Insights:")
print("- Higher rank allows stronger adaptation but uses more parameters")
print("- Alpha controls adaptation strength independently of rank")
print("- Even rank=1 provides meaningful adaptation with extreme efficiency")
print("- Parameter reduction of 50-500x is typical for large models")

Analyzing parameter efficiency across different LoRA configurations
Base layer: 512 -> 256
Test input shape: torch.Size([8, 32, 512])

Original layer parameters: 131,328
Rank 1, Alpha 1.0:
  Trainable parameters: 768 (172.0x reduction)
  Adaptation magnitude: 0.0000
  Relative adaptation: 0.0000 (0.00%)

Rank 2, Alpha 4.0:
  Trainable parameters: 1,536 (86.5x reduction)
  Adaptation magnitude: 0.0000
  Relative adaptation: 0.0000 (0.00%)

Rank 4, Alpha 8.0:
  Trainable parameters: 3,072 (43.8x reduction)
  Adaptation magnitude: 0.0000
  Relative adaptation: 0.0000 (0.00%)

Rank 8, Alpha 16.0:
  Trainable parameters: 6,144 (22.4x reduction)
  Adaptation magnitude: 0.0000
  Relative adaptation: 0.0000 (0.00%)

Configuration Comparison Summary:
Rank   Alpha    Trainable    Reduction    Adapt%    
1      1.0      768          172.0       x 0.00      %
2      4.0      1,536        86.5        x 0.00      %
4      8.0      3,072        43.8        x 0.00      %
8      16.0     6,144        2

## 4. Multi-Layer Network Integration

Now we'll apply LoRA to complete neural networks, demonstrating how to selectively adapt different layers and analyze the impact on model behavior.

**Integration Strategies:**

**Selective Adaptation:**
- **All Layers**: Apply LoRA to every linear layer
- **Output Layers**: Only adapt final classification/output layers
- **Attention Layers**: Focus on transformer attention projections (Q, K, V, O)
- **Feedforward Layers**: Adapt only feedforward network components

**Layer-Specific Configuration:**
- **Different Ranks**: Use varying ranks for different layer types
- **Adaptive Alpha**: Scale adaptation strength per layer
- **Targeted Adaptation**: Apply LoRA based on layer importance

**Network Architecture Considerations:**
- **Parameter Distribution**: Where are most parameters located?
- **Gradient Flow**: Which layers benefit most from adaptation?
- **Task Relevance**: Which layers are most important for target tasks?

In [5]:
# Create and Analyze Multi-Layer Network with LoRA
print("Building comprehensive multi-layer network for LoRA integration")


class MultilayerPerceptron(nn.Module):
    """
    Multi-layer perceptron for demonstrating LoRA integration strategies
    """

    def __init__(self, num_features: int, num_hidden_1: int, num_hidden_2: int, num_classes: int):
        super().__init__()

        # Define network layers
        self.layers = nn.Sequential(
            nn.Linear(num_features, num_hidden_1),  # Input projection
            nn.ReLU(),
            nn.Linear(num_hidden_1, num_hidden_2),  # Hidden transformation
            nn.ReLU(),
            nn.Linear(num_hidden_2, num_classes),  # Output projection
        )

        # Layer names for analysis
        self.layer_names = ["input_proj", "hidden_transform", "output_proj"]

    def forward(self, x):
        return self.layers(x)

    def get_linear_layers(self):
        """Extract linear layers for LoRA application"""
        linear_layers = []
        for i, layer in enumerate(self.layers):
            if isinstance(layer, nn.Linear):
                linear_layers.append((i, layer))
        return linear_layers

    def analyze_parameters(self):
        """Analyze parameter distribution across layers"""
        analysis = {}
        total_params = 0

        for i, layer in enumerate(self.layers):
            if isinstance(layer, nn.Linear):
                layer_params = sum(p.numel() for p in layer.parameters())
                layer_name = self.layer_names[i // 2]  # Account for ReLU layers
                analysis[layer_name] = {
                    "layer_index": i,
                    "parameters": layer_params,
                    "shape": (layer.in_features, layer.out_features),
                }
                total_params += layer_params

        # Add percentage information
        for layer_info in analysis.values():
            layer_info["percentage"] = 100 * layer_info["parameters"] / total_params

        analysis["total_parameters"] = total_params
        return analysis


# Create test network
print("Creating multi-layer perceptron")
model = MultilayerPerceptron(num_features=100, num_hidden_1=200, num_hidden_2=300, num_classes=50).to(device)

print("Network architecture:")
for i, (name, module) in enumerate(model.named_modules()):
    if isinstance(module, nn.Linear):
        print(f"  {name}: {module.in_features} -> {module.out_features}")

# Analyze original network
original_analysis = model.analyze_parameters()
print("\nOriginal network parameter analysis:")
print(f"Total parameters: {original_analysis['total_parameters']:,}")

for layer_name, info in original_analysis.items():
    if isinstance(info, dict) and "parameters" in info:
        print(f"  {layer_name}: {info['parameters']:,} params ({info['percentage']:.1f}%) - {info['shape']}")

# Create test data
batch_size = 16
test_data = torch.randn(batch_size, 100).to(device)
print(f"\nTest data shape: {test_data.shape}")

# Get original output for comparison
with torch.no_grad():
    original_output = model(test_data)
    print(f"Original output shape: {original_output.shape}")
    print(
        f"Original output statistics: mean={original_output.mean().item():.4f}, std={original_output.std().item():.4f}"
    )

print("\nNetwork ready for LoRA integration")

Building comprehensive multi-layer network for LoRA integration
Creating multi-layer perceptron
Network architecture:
  layers.0: 100 -> 200
  layers.2: 200 -> 300
  layers.4: 300 -> 50

Original network parameter analysis:
Total parameters: 95,550
  input_proj: 20,200 params (21.1%) - (100, 200)
  hidden_transform: 60,300 params (63.1%) - (200, 300)
  output_proj: 15,050 params (15.8%) - (300, 50)

Test data shape: torch.Size([16, 100])
Original output shape: torch.Size([16, 50])
Original output statistics: mean=-0.0083, std=0.1189

Network ready for LoRA integration


In [None]:
# Apply LoRA to Multi-Layer Network with Different Strategies
print("Implementing various LoRA integration strategies")


def apply_lora_to_network(model, strategy="all", rank=4, alpha=8.0):
    """
    Apply LoRA to network layers based on different strategies

    Args:
        model: The neural network model
        strategy: 'all', 'output_only', 'input_output', or 'selective'
        rank: LoRA rank parameter
        alpha: LoRA alpha parameter
    """
    linear_layers = model.get_linear_layers()

    if strategy == "all":
        # Apply LoRA to all linear layers
        indices_to_modify = [i for i, _ in linear_layers]
    elif strategy == "output_only":
        # Apply LoRA only to output layer
        indices_to_modify = [linear_layers[-1][0]]
    elif strategy == "input_output":
        # Apply LoRA to input and output layers
        indices_to_modify = [linear_layers[0][0], linear_layers[-1][0]]
    elif strategy == "selective":
        # Apply LoRA with different ranks based on layer size
        # Larger layers get higher rank
        indices_to_modify = []
        for i, layer in linear_layers:
            layer_size = layer.in_features * layer.out_features
            if layer_size > 10000:  # Large layers
                layer_rank = rank * 2
            else:
                layer_rank = rank
            indices_to_modify.append((i, layer_rank))
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

    # Apply LoRA transformations
    modified_count = 0
    for item in indices_to_modify:
        if isinstance(item, tuple):
            i, layer_rank = item
        else:
            i, layer_rank = item, rank

        original_layer = model.layers[i]
        lora_layer = LinearWithLoRA(original_layer, rank=layer_rank, alpha=alpha).to(device)
        model.layers[i] = lora_layer
        modified_count += 1

    return modified_count


# Test different LoRA application strategies
strategies = ["output_only", "input_output", "all"]

for strategy in strategies:
    print(f"\n{'=' * 50}")
    print(f"Testing strategy: {strategy.upper()}")
    print(f"{'=' * 50}")

    # Create fresh model copy for each strategy
    test_model = MultilayerPerceptron(100, 200, 300, 50).to(device)

    # Apply LoRA
    modified_layers = apply_lora_to_network(test_model, strategy=strategy, rank=4, alpha=8.0)
    print(f"Modified {modified_layers} layers with LoRA")

    # Analyze parameter efficiency
    total_params = sum(p.numel() for p in test_model.parameters())
    trainable_params = sum(p.numel() for p in test_model.parameters() if p.requires_grad)
    efficiency_ratio = total_params / trainable_params if trainable_params > 0 else float("inf")

    print("Parameter analysis:")
    print(f"  Total parameters: {total_params:,}")
    print(f"  Trainable parameters: {trainable_params:,}")
    print(f"  Efficiency ratio: {efficiency_ratio:.1f}x")
    print(f"  Percentage trainable: {100 * trainable_params / total_params:.2f}%")

    # Test model functionality
    with torch.no_grad():
        lora_output = test_model(test_data)
        output_change = (lora_output - original_output).norm().item()
        relative_change = output_change / original_output.norm().item()

    print("Output analysis:")
    print(f"  Output change magnitude: {output_change:.4f}")
    print(f"  Relative change: {relative_change:.4f} ({relative_change * 100:.2f}%)")

    # Layer-by-layer analysis
    print("Layer details:")
    for i, (name, module) in enumerate(test_model.named_children()):
        if hasattr(module, "__len__"):  # Sequential module
            for j, layer in enumerate(module):
                if isinstance(layer, LinearWithLoRA):
                    analysis = layer.get_parameter_analysis()
                    print(f"  Layer {j}: LoRA enabled - {analysis['lora_trainable']:,} trainable params")
                elif isinstance(layer, nn.Linear):
                    params = sum(p.numel() for p in layer.parameters())
                    trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
                    print(f"  Layer {j}: Standard linear - {params:,} params ({trainable} trainable)")

print("\nStrategy comparison complete - different approaches offer different trade-offs")

Implementing various LoRA integration strategies

Testing strategy: OUTPUT_ONLY
Modified 1 layers with LoRA
Parameter analysis:
  Total parameters: 96,950
  Trainable parameters: 81,900
  Efficiency ratio: 1.2x
  Percentage trainable: 84.48%
Output analysis:
  Output change magnitude: 4.0566
  Relative change: 1.2044 (120.44%)
Layer details:
  Layer 0: Standard linear - 20,200 params (20200 trainable)
  Layer 2: Standard linear - 60,300 params (60300 trainable)
  Layer 4: LoRA enabled - 1,400 trainable params

Testing strategy: INPUT_OUTPUT
Modified 2 layers with LoRA
Parameter analysis:
  Total parameters: 98,150
  Trainable parameters: 62,900
  Efficiency ratio: 1.6x
  Percentage trainable: 64.09%
Output analysis:
  Output change magnitude: 4.6496
  Relative change: 1.3804 (138.04%)
Layer details:
  Layer 0: LoRA enabled - 1,200 trainable params
  Layer 2: Standard linear - 60,300 params (60300 trainable)
  Layer 4: LoRA enabled - 1,400 trainable params

Testing strategy: ALL
Modifie

In [None]:
# Comprehensive Parameter Management and Training Analysis
print("Implementing advanced parameter management for LoRA training")


def freeze_linear_layers(model, exclude_lora=True):
    """
    Freeze linear layer parameters while optionally preserving LoRA trainability

    Args:
        model: Neural network model
        exclude_lora: If True, keep LoRA parameters trainable
    """
    frozen_params = 0
    trainable_params = 0

    for name, param in model.named_parameters():
        if exclude_lora and ("lora" in name.lower() or "A" in name or "B" in name):
            # Keep LoRA parameters trainable
            param.requires_grad = True
            trainable_params += param.numel()
        else:
            # Freeze all other parameters
            param.requires_grad = False
            frozen_params += param.numel()

    return frozen_params, trainable_params


def analyze_gradient_flow(model, sample_input, sample_target):
    """
    Analyze gradient flow through the model to verify LoRA training setup
    """
    model.train()

    # Forward pass
    output = model(sample_input)

    # Create dummy loss
    if len(output.shape) > 1 and output.shape[-1] > 1:
        # Classification scenario
        loss = F.cross_entropy(output, sample_target)
    else:
        # Regression scenario
        loss = F.mse_loss(output, sample_target)

    # Backward pass
    loss.backward()

    # Analyze gradients
    gradient_info = {}
    total_grad_norm = 0
    param_count = 0

    for name, param in model.named_parameters():
        if param.requires_grad and param.grad is not None:
            grad_norm = param.grad.norm().item()
            gradient_info[name] = {"grad_norm": grad_norm, "param_shape": param.shape, "param_count": param.numel()}
            total_grad_norm += grad_norm**2
            param_count += param.numel()

    total_grad_norm = total_grad_norm**0.5

    return {
        "loss": loss.item(),
        "total_grad_norm": total_grad_norm,
        "gradient_info": gradient_info,
        "trainable_param_count": param_count,
    }


# Create final model with comprehensive LoRA integration
print("Creating production-ready LoRA model")
final_model = MultilayerPerceptron(100, 200, 300, 50).to(device)

# Apply LoRA to all layers with different ranks based on layer importance
layer_configs = [
    {"layer_idx": 0, "rank": 8, "alpha": 16.0},  # Input layer - higher rank
    {"layer_idx": 2, "rank": 4, "alpha": 8.0},  # Hidden layer - medium rank
    {"layer_idx": 4, "rank": 8, "alpha": 16.0},  # Output layer - higher rank
]

print("Applying layer-specific LoRA configurations:")
for config in layer_configs:
    layer = final_model.layers[config["layer_idx"]]
    lora_layer = LinearWithLoRA(layer, rank=config["rank"], alpha=config["alpha"]).to(device)
    final_model.layers[config["layer_idx"]] = lora_layer
    print(f"  Layer {config['layer_idx']}: rank={config['rank']}, alpha={config['alpha']}")

# Freeze parameters appropriately
frozen_count, trainable_count = freeze_linear_layers(final_model, exclude_lora=True)

print("\nParameter freezing results:")
print(f"  Frozen parameters: {frozen_count:,}")
print(f"  Trainable parameters: {trainable_count:,}")
print(f"  Total parameters: {frozen_count + trainable_count:,}")
print(f"  Training efficiency: {(frozen_count + trainable_count) / trainable_count:.1f}x reduction")

# Detailed parameter analysis
print("\nDetailed parameter breakdown:")
for name, param in final_model.named_parameters():
    status = "TRAINABLE" if param.requires_grad else "FROZEN"
    print(f"  {name:<30} {str(param.shape):<20} {param.numel():<8,} {status}")

# Test gradient flow
print("\nTesting gradient flow:")
sample_input = torch.randn(4, 100).to(device)
sample_target = torch.randint(0, 50, (4,)).to(device)

gradient_analysis = analyze_gradient_flow(final_model, sample_input, sample_target)

print("Gradient flow analysis:")
print(f"  Loss value: {gradient_analysis['loss']:.6f}")
print(f"  Total gradient norm: {gradient_analysis['total_grad_norm']:.6f}")
print(f"  Trainable parameters with gradients: {gradient_analysis['trainable_param_count']:,}")

print("\nLoRA layer gradient details:")
for name, info in gradient_analysis["gradient_info"].items():
    if "lora" in name.lower() or any(x in name for x in ["A", "B"]):
        print(f"  {name:<25} grad_norm: {info['grad_norm']:.6f}")

# Verify training readiness
trainable_with_grads = len(
    [name for name in gradient_analysis["gradient_info"] if any(x in name for x in ["lora", "A", "B"])]
)

print("\nTraining readiness check:")
print(f"  LoRA parameters with gradients: {trainable_with_grads}")
print(f"  Status: {'READY FOR TRAINING' if trainable_with_grads > 0 else 'CHECK CONFIGURATION'}")

print("\nLoRA integration complete - model ready for parameter-efficient fine-tuning")

Implementing advanced parameter management for LoRA training
Creating production-ready LoRA model
Applying layer-specific LoRA configurations:
  Layer 0: rank=8, alpha=16.0
  Layer 2: rank=4, alpha=8.0
  Layer 4: rank=8, alpha=16.0

Parameter freezing results:
  Frozen parameters: 95,550
  Trainable parameters: 7,200
  Total parameters: 102,750
  Training efficiency: 14.3x reduction

Detailed parameter breakdown:
  layers.0.linear.weight         torch.Size([200, 100]) 20,000   FROZEN
  layers.0.linear.bias           torch.Size([200])    200      FROZEN
  layers.0.lora.A                torch.Size([100, 8]) 800      TRAINABLE
  layers.0.lora.B                torch.Size([8, 200]) 1,600    TRAINABLE
  layers.2.linear.weight         torch.Size([300, 200]) 60,000   FROZEN
  layers.2.linear.bias           torch.Size([300])    300      FROZEN
  layers.2.lora.A                torch.Size([200, 4]) 800      TRAINABLE
  layers.2.lora.B                torch.Size([4, 300]) 1,200    TRAINABLE
  layer

## Lab Summary

### Technical Concepts Learned
- **Low-Rank Decomposition**: Understanding ΔW = A × B^T factorization for parameter-efficient weight updates
- **LoRA Initialization**: Gaussian initialization for A matrix, zero initialization for B to ensure stable training start
- **Parameter Freezing**: Keeping base model weights frozen while only training low-rank adaptation matrices
- **Rank and Alpha Selection**: Controlling adaptation capacity (rank) and strength (alpha/rank scaling factor)
- **Integration Strategies**: Applying LoRA selectively to different layers (all, output-only, attention layers)

### Experiment Further
- Apply LoRA to transformer attention layers (Q, K, V, O projections) and compare efficiency
- Implement QLoRA combining quantization with LoRA for extreme memory efficiency
- Compare different rank values (1, 4, 16, 64) on a text classification task
- Try multiple LoRA adapters on the same base model for multi-task learning
- Experiment with AdaLoRA for automatic rank allocation during training