## Exercises XP: W7_D2

#### What You’ll Learn

- The fundamentals of LoRA (Low-Rank Adaptation) and its application in deep learning models.  
- How to implement LoRA layers in PyTorch and integrate them into existing neural network architectures.  
- The differences between standard linear layers and LoRA-enhanced layers.  
- How to merge LoRA matrices with original weights for efficiency.  
- Freezing original layers to fine-tune only the LoRA layers in a model.

---

#### What You Will Create

- A custom LoRALayer module for efficient model adaptation.  
- A LinearWithLoRA class that applies LoRA to existing linear layers.  
- A modified Multi-Layer Perceptron (MLP) incorporating LoRA layers.  
- A complete training pipeline using LoRA-enhanced models.  
- A workflow to freeze and fine-tune only the LoRA parameters for efficient training.

---

#### LoRA Implementation Exercises

#### Exercise 1: Implementing the LoRALayer

**Objective:**

Implement a custom PyTorch module called *LoRALayer* that introduces low-rank adaptation matrices A and B.

**Instructions:**

1. Create a PyTorch class *LoRALayer* inheriting from *nn.Module*.  
2. Define the *__init__* method to initialize matrices A and B with the appropriate dimensions.  
3. Implement the *forward* method to compute the LoRA transformation on input *x*.  
4. Test the class with a small input tensor to verify its functionality.

---

#### Exercise 2: Implementing the LinearWithLoRA Layer

**Objective:**

Extend a standard PyTorch Linear layer to incorporate the *LoRALayer* for adaptable training.

**Instructions:**

1. Create a new class *LinearWithLoRA* that wraps an existing *nn.Linear* layer.  
2. Add an instance of *LoRALayer* to introduce low-rank adaptation.  
3. Implement the *forward* method to return the sum of the standard Linear transformation and the LoRA adaptation.  
4. Test this new layer with an input tensor.

---

#### Exercise 3: Creating a Small Neural Network and Applying LoRA

**Objective:**

Implement a simple feedforward neural network and apply LoRA to one of its layers.

**Instructions:**

1. Define a single-layer neural network using *nn.Linear*.  
2. Generate a random input tensor to test the layer.  
3. Replace the Linear layer with *LinearWithLoRA* and verify that the outputs remain unchanged initially.

---

#### Exercise 4: Merging LoRA Matrices and Testing Equivalence

**Objective:**

Implement an alternative approach where LoRA matrices are merged with the original weights for efficiency.

**Instructions:**

1. Create a new class *LinearWithLoRAMerged* that computes the combined weight matrix.  
2. Ensure that the output remains the same as *LinearWithLoRA*.  
3. Test with a sample input to verify correctness.

---

#### Exercise 5: Implementing a Multilayer Perceptron (MLP) and Replacing Layers with LoRA

**Objective:**

Extend a simple MLP and modify its layers to use LoRA.

**Instructions:**

1. Implement a 3-layer MLP.  
2. Replace each Linear layer with *LinearWithLoRAMerged*.  
3. Print the model architecture to verify modifications.

---

#### Exercise 6: Freezing the Original Linear Layers and Training LoRA

**Objective:**

Ensure only LoRA layers are trainable and train the model.

**Instructions:**

1. Implement a function to freeze standard Linear layers.  
2. Apply it to the MLP model.  
3. Print trainable parameters to confirm only LoRA layers are trainable.  
4. Train the model on a dataset and evaluate performance.

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import copy
import time

### Exercise 1: Implementing the LoRALayer

In [None]:
class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        """
        LoRALayer initializes two low-rank matrices A and B
        for efficient fine-tuning of large models.

        Args:
            in_dim (int): Input dimension
            out_dim (int): Output dimension
            rank (int): Rank of the low-rank decomposition
            alpha (float): Scaling factor for LoRA contribution
        """
        super().__init__()

        # Standard deviation for initialization (scaled by rank)
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())

        # Matrix A: projects input to low-rank space
        self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)

        # Matrix B: projects back to output space (initialized as zeros)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))

        # Scaling factor for LoRA contribution
        self.alpha = alpha

    def forward(self, x):
        """
        Forward pass applies low-rank adaptation:
        (x @ A) @ B scaled by alpha

        Args:
            x (Tensor): Input tensor of shape (batch_size, in_dim)

        Returns:
            Tensor: LoRA-transformed output of shape (batch_size, out_dim)
        """
        # Step 1: Project input x into low-rank space using A
        low_rank = x @ self.A

        # Step 2: Project back to output space using B
        lora_output = low_rank @ self.B

        # Step 3: Scale the output by alpha
        return self.alpha * lora_output

### Testing LoRALayer with dummy input

In [None]:
# Set parameters
in_dim = 4      # input dimension
out_dim = 3     # output dimension
rank = 2        # low-rank dimension
alpha = 1.0     # scaling factor

# Create layer
lora_layer = LoRALayer(in_dim, out_dim, rank, alpha)

# Dummy input tensor (batch_size=1)
x = torch.randn(1, in_dim)

# Forward pass
output = lora_layer(x)
print("Input:", x)
print("Output:", output)
print("Output shape:", output.shape)

Input: tensor([[ 0.3072,  0.1348, -1.7869,  3.1623]])
Output: tensor([[0., 0., 0.]], grad_fn=<MulBackward0>)
Output shape: torch.Size([1, 3])


### Manually modify B to test non-zero output

In [4]:
# Manually modify B to test non-zero output
with torch.no_grad():
    lora_layer.B.copy_(torch.randn(rank, out_dim))  # random non-zero B

# Forward pass again
output = lora_layer(x)
print("Output after modifying B:", output)

Output after modifying B: tensor([[-1.2190, -1.2040, -0.5797]], grad_fn=<MulBackward0>)


### Interpretation of the Two Outputs

**First Output (zeros):**

- The initial output of the *LoRALayer* is zero because matrix **B** is initialized with zeros.
- Even if the multiplication *x @ A* produces non-zero values, multiplying by *B* results in zeros.
- This is intentional in LoRA: starting with no contribution ensures the pre-trained weights are not disturbed at the beginning of fine-tuning.

**Second Output (non-zero after modifying B):**

- After manually setting matrix **B** to random values, the output becomes non-zero.
- This confirms that the computation *(x @ A) @ B* alpha is working as expected.
- During training, *A* and *B* will learn task-specific adaptations while keeping the original model parameters frozen.

### Exercise 2: Implementing the LinearWithLoRA Layer

#### LinearWithLoRA: Standard Linear + LoRA Adaptation

In [7]:
class LinearWithLoRA(nn.Module):
    def __init__(self, linear: nn.Linear, rank: int, alpha: float):
        """
        Wraps a standard nn.Linear layer and adds a LoRALayer for low-rank adaptation.

        Args:
            linear (nn.Linear): Pre-existing linear layer
            rank (int): Rank for LoRA decomposition
            alpha (float): Scaling factor for LoRA contribution
        """
        super().__init__()
        self.linear = linear  # Standard linear transformation
        self.lora = LoRALayer(
            linear.in_features,  # input dimension
            linear.out_features, # output dimension
            rank,
            alpha
        )

    def forward(self, x):
        """
        Computes the sum of:
        - Standard Linear output
        - LoRA low-rank adaptation
        """
        return self.linear(x) + self.lora(x)

#### Testing LinearWithLoRA

In [8]:
# Hyperparameters
in_dim = 4
out_dim = 3
rank = 2
alpha = 1.0

# Create a standard linear layer
linear_layer = nn.Linear(in_dim, out_dim)

# Wrap it with LoRA
linear_with_lora = LinearWithLoRA(linear_layer, rank, alpha)

# Dummy input tensor
x = torch.randn(1, in_dim)

# Outputs
print("Input:", x)
print("Standard Linear output:", linear_layer(x))
print("LinearWithLoRA output:", linear_with_lora(x))

Input: tensor([[-0.8154, -0.7798, -0.6719,  0.5486]])
Standard Linear output: tensor([[-0.3813, -0.1985, -0.5075]], grad_fn=<AddmmBackward0>)
LinearWithLoRA output: tensor([[-0.3813, -0.1985, -0.5075]], grad_fn=<AddBackward0>)


### Interpretation of LinearWithLoRA Test

- The output of *LinearWithLoRA* is **identical** to the output of the standard *Linear* layer.  
- This happens because matrix **B** inside the *LoRALayer* is initialized with zeros, meaning the LoRA contribution is initially **zero**.  
- As a result, *LinearWithLoRA(x) = Linear(x)* at the beginning of training.  
- During fine-tuning, the LoRA parameters (*A* and *B*) will update and add **low-rank adaptations** to the output without changing the pre-trained linear weights.

### Exercise 3: Creating a Small Neural Network and Applying LoRA

#### Interpretation of Small Network with LoRA

- A simple neural network with a single *Linear* layer was created and tested.  
- After replacing the *Linear* layer with *LinearWithLoRA*, the output remains **identical** at initialization.  
- This demonstrates that adding LoRA layers does **not affect the initial behavior** of the model.  
- During training, LoRA layers will adapt to the task while keeping the original weights frozen.

In [9]:
# 1. Define a simple single-layer network
class SimpleNet(nn.Module):
    def __init__(self, in_dim, out_dim):
        """
        Single Linear layer network
        """
        super().__init__()
        self.linear = nn.Linear(in_dim, out_dim)

    def forward(self, x):
        return self.linear(x)

# 2. Create and test the network
in_dim = 4
out_dim = 3
net = SimpleNet(in_dim, out_dim)

x = torch.randn(1, in_dim)
print("Input:", x)
print("Original network output:", net(x))

# 3. Replace the Linear layer with LinearWithLoRA
net.linear = LinearWithLoRA(net.linear, rank=2, alpha=1.0)
print("Network with LoRA output:", net(x))

Input: tensor([[-0.0342, -1.9971, -0.8241, -1.4815]])
Original network output: tensor([[-1.5232, -1.2068,  0.0345]], grad_fn=<AddmmBackward0>)
Network with LoRA output: tensor([[-1.5232, -1.2068,  0.0345]], grad_fn=<AddBackward0>)


### Interpretation of Small Network with LoRA

- Input used: *tensor([[-0.0342, -1.9971, -0.8241, -1.4815]])*  
- Original network output: *tensor([[-1.5232, -1.2068,  0.0345]])*  
- Network with LoRA output: *tensor([[-1.5232, -1.2068,  0.0345]])*

**Analysis:**

- The outputs are identical because the LoRA contribution is initialized to zero (matrix **B** is filled with zeros).  
- This confirms that introducing LoRA layers does **not change the original behavior** of the model at initialization.  
- When training begins, the LoRA parameters (*A* and *B*) will learn task-specific adaptations while leaving the original network weights untouched.

### Exercise 4: Merging LoRA Matrices and Testing Equivalence

In [11]:
class LinearWithLoRAMerged(nn.Module):
    def __init__(self, linear: nn.Linear, rank: int, alpha: float):
        """
        Combines LoRA matrices into a single weight update for efficiency.

        Args:
            linear (nn.Linear): Pre-existing linear layer
            rank (int): Rank for LoRA decomposition
            alpha (float): Scaling factor for LoRA contribution
        """
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, 
            linear.out_features,
            rank,
            alpha
        )

    def forward(self, x):
        """
        Computes the forward pass using merged LoRA weights:
        combined_weight = W + alpha * (A @ B)
        """
        # Combine low-rank matrices A and B
        lora_matrix = self.lora.A @ self.lora.B  # shape: (in_dim, out_dim)

        # Merge with original weights (transpose because nn.Linear stores weight as (out_dim, in_dim))
        combined_weight = self.linear.weight + self.lora.alpha * lora_matrix.T

        # Standard linear operation with combined weights
        return F.linear(x, combined_weight, self.linear.bias)

#### Testing LinearWithLoRAMerged

##### Interpretation of LinearWithLoRAMerged Test

- We tested two implementations:
  1. *LinearWithLoRA* (standard, non-merged)
  2. *LinearWithLoRAMerged* (merged weights)

- Both outputs should be **identical** at initialization because:
  - Matrix **B** is initialized to zeros in both cases.
  - Therefore, the contribution *(A @ B)* is zero, and only the original linear weights are active.

- This merged version is more **efficient** for inference because it avoids two separate matrix multiplications and directly uses a combined weight matrix.

In [12]:
# Create a standard linear layer
linear_layer = nn.Linear(4, 3)

# Wrap with LinearWithLoRA (non-merged)
layer_lora = LinearWithLoRA(linear_layer, rank=2, alpha=1.0)

# Wrap with LinearWithLoRAMerged
layer_lora_merged = LinearWithLoRAMerged(linear_layer, rank=2, alpha=1.0)

# Dummy input
x = torch.randn(1, 4)

# Outputs
print("Input:", x)
print("LinearWithLoRA output:", layer_lora(x))
print("LinearWithLoRAMerged output:", layer_lora_merged(x))

Input: tensor([[ 2.1742, -1.3012,  1.3707,  0.1366]])
LinearWithLoRA output: tensor([[-1.0264,  0.9748,  1.0572]], grad_fn=<AddBackward0>)
LinearWithLoRAMerged output: tensor([[-1.0264,  0.9748,  1.0572]], grad_fn=<AddmmBackward0>)


#### Interpretation of LinearWithLoRAMerged Test

- Input used: *tensor([[ 2.1742, -1.3012,  1.3707,  0.1366]])* 
- *LinearWithLoRA* output: *tensor([[-1.0264,  0.9748,  1.0572]])*  
- *LinearWithLoRAMerged* output: *tensor([[-1.0264,  0.9748,  1.0572]])*

**Analysis:**

- The outputs are identical, confirming that the merged implementation is mathematically equivalent to the non-merged version.  
- At initialization, LoRA does not contribute (matrix **B** is zeros), so both methods behave like a standard *Linear* layer.  
- The merged version is advantageous for **inference speed and memory** because it combines weights into a single matrix multiplication.

### Exercise 5: Implementing a Multilayer Perceptron (MLP) and Replacing Layers with LoRA

#### Interpretation of MLP with LoRA

- A standard 3-layer MLP was built and tested on a dummy input.  
- The same MLP was then modified by replacing each *Linear* layer with *LinearWithLoRAMerged*.  

**Observations:**
- The outputs of the original MLP and the LoRA-enhanced MLP are **identical at initialization** because the LoRA matrices start with zero contribution.  
- This confirms that integrating LoRA layers does not disrupt the base model’s behavior.  
- During fine-tuning, only the LoRA matrices will update, enabling parameter-efficient adaptation.

In [13]:
class MultilayerPerceptron(nn.Module):
    def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
        """
        Standard 3-layer MLP:
        Input -> Linear -> ReLU -> Linear -> ReLU -> Linear -> Output
        """
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(num_features, num_hidden_1),
            nn.ReLU(),
            nn.Linear(num_hidden_1, num_hidden_2),
            nn.ReLU(),
            nn.Linear(num_hidden_2, num_classes)
        )

    def forward(self, x):
        return self.layers(x)

# =========================================
# Create and test the standard MLP
# =========================================
num_features = 4
num_hidden_1 = 8
num_hidden_2 = 6
num_classes = 3

# Create standard MLP
mlp = MultilayerPerceptron(num_features, num_hidden_1, num_hidden_2, num_classes)

# Dummy input
x = torch.randn(1, num_features)
print("Input:", x)
print("Standard MLP output:", mlp(x))


# =========================================
# Replace Linear layers with LoRA-enhanced layers
# =========================================
# Deep copy to preserve original MLP
import copy
mlp_lora = copy.deepcopy(mlp)

# Replace each Linear with LinearWithLoRAMerged
mlp_lora.layers[0] = LinearWithLoRAMerged(mlp_lora.layers[0], rank=2, alpha=1.0)
mlp_lora.layers[2] = LinearWithLoRAMerged(mlp_lora.layers[2], rank=2, alpha=1.0)
mlp_lora.layers[4] = LinearWithLoRAMerged(mlp_lora.layers[4], rank=2, alpha=1.0)

# Test the LoRA-enhanced MLP
print("MLP with LoRA output:", mlp_lora(x))

Input: tensor([[-1.0206,  0.8164,  1.1792, -1.3017]])
Standard MLP output: tensor([[-0.4673, -0.3002,  0.0748]], grad_fn=<AddmmBackward0>)
MLP with LoRA output: tensor([[-0.4673, -0.3002,  0.0748]], grad_fn=<AddmmBackward0>)


#### Interpretation of MLP with LoRA

- Input used: *tensor([[-1.0206,  0.8164,  1.1792, -1.3017]])*  
- Standard MLP output: *tensor([[-0.4673, -0.3002,  0.0748]])*  
- MLP with LoRA output: *tensor([[-0.4673, -0.3002,  0.0748]])*

**Analysis:**

- The outputs are identical because LoRA matrices are initialized with zeros.  
- This confirms that adding LoRA does **not affect the model’s initial predictions**.  
- Once training starts, LoRA parameters will adapt to the new task while the original weights remain frozen, enabling **parameter-efficient fine-tuning**.

### Exercise 6: Freezing the Original Linear Layers and Training LoRA

In [14]:
def freeze_linear_layers(model):
    """
    Recursively freeze parameters of all nn.Linear layers
    inside the given model.
    """
    for child in model.children():
        if isinstance(child, nn.Linear):
            for param in child.parameters():
                param.requires_grad = False
        else:
            # Recursively check child modules
            freeze_linear_layers(child)

# =========================================
# Apply freezing to the MLP with LoRA
# =========================================
freeze_linear_layers(mlp_lora)

# Check which parameters are trainable
print("Trainable parameters after freezing:")
for name, param in mlp_lora.named_parameters():
    print(f"{name}: {param.requires_grad}")

Trainable parameters after freezing:
layers.0.linear.weight: False
layers.0.linear.bias: False
layers.0.lora.A: True
layers.0.lora.B: True
layers.2.linear.weight: False
layers.2.linear.bias: False
layers.2.lora.A: True
layers.2.lora.B: True
layers.4.linear.weight: False
layers.4.linear.bias: False
layers.4.lora.A: True
layers.4.lora.B: True


In [15]:
# =========================================
# Mini training loop (dummy example)
# =========================================
import torch.optim as optim

# Dummy data (random)
X_train = torch.randn(10, num_features)   # 10 samples
y_train = torch.randint(0, num_classes, (10,))  # random labels

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(filter(lambda p: p.requires_grad, mlp_lora.parameters()), lr=0.01)

# Training for a few epochs
for epoch in range(3):
    optimizer.zero_grad()
    outputs = mlp_lora(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}")

Epoch 1: Loss = 1.0680
Epoch 2: Loss = 1.0657
Epoch 3: Loss = 1.0634


#### Interpretation of Freezing and Training LoRA

**Trainable parameters check:**

- layers.0.linear.weight: False  
- layers.0.linear.bias: False  
- layers.0.lora.A: True  
- layers.0.lora.B: True  
- layers.2.linear.weight: False  
- layers.2.linear.bias: False  
- layers.2.lora.A: True  
- layers.2.lora.B: True  
- layers.4.linear.weight: False  
- layers.4.linear.bias: False  
- layers.4.lora.A: True  
- layers.4.lora.B: True  

Only the LoRA parameters (A and B) are trainable, while all original Linear weights are frozen.

---

**Mini training results (loss over 3 epochs):**

- Epoch 1: Loss = 1.0680  
- Epoch 2: Loss = 1.0657  
- Epoch 3: Loss = 1.0634  

The slight decrease in loss confirms that the network is learning using only the LoRA parameters.

---

**Conclusion:**
This setup allows efficient fine-tuning:  
- Original model weights stay intact (pre-trained knowledge preserved).  
- Only a small number of LoRA parameters are updated, reducing memory and compute costs.

### LoRA Implementation on MNIST - Full Pipeline

In [21]:
# -----------------------------------------
# 1. LoRALayer
# -----------------------------------------
class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        """
        Implements low-rank adaptation using matrices A and B.
        """
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        # Compute (x @ A) @ B scaled by alpha
        return self.alpha * (x @ self.A @ self.B)

# -----------------------------------------
# 2. LinearWithLoRA (non-merged version)
# -----------------------------------------
class LinearWithLoRA(nn.Module):
    def __init__(self, linear: nn.Linear, rank: int, alpha: float):
        """
        Adds a LoRA adaptation to a standard Linear layer.
        """
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

# -----------------------------------------
# 3. LinearWithLoRAMerged (merged weights)
# -----------------------------------------
class LinearWithLoRAMerged(nn.Module):
    def __init__(self, linear: nn.Linear, rank: int, alpha: float):
        """
        Merges LoRA matrices with original weight for efficient inference.
        """
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(linear.in_features, linear.out_features, rank, alpha)

    def forward(self, x):
        # Combine LoRA matrices
        lora_matrix = self.lora.A @ self.lora.B
        combined_weight = self.linear.weight + self.lora.alpha * lora_matrix.T
        return F.linear(x, combined_weight, self.linear.bias)

# -----------------------------------------
# 4. Multilayer Perceptron (MLP)
# -----------------------------------------
class MultilayerPerceptron(nn.Module):
    def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
        """
        Standard 3-layer MLP: Linear -> ReLU -> Linear -> ReLU -> Linear
        """
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(num_features, num_hidden_1),
            nn.ReLU(),
            nn.Linear(num_hidden_1, num_hidden_2),
            nn.ReLU(),
            nn.Linear(num_hidden_2, num_classes)
        )

    def forward(self, x):
        return self.layers(x)

# -----------------------------------------
# 5. Dataset and DataLoader
# -----------------------------------------
BATCH_SIZE = 64

transform = transforms.ToTensor()

train_dataset = datasets.MNIST(root="data", train=True, transform=transform, download=True)
test_dataset = datasets.MNIST(root="data", train=False, transform=transform, download=True)

train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# -----------------------------------------
# 6. Training and Evaluation Functions
# -----------------------------------------
def compute_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples = 0, 0
    with torch.no_grad():
        for features, targets in data_loader:
            features = features.view(-1, 28*28).to(device)
            targets = targets.to(device)
            logits = model(features)
            _, predicted_labels = torch.max(logits, 1)
            num_examples += targets.size(0)
            correct_pred += (predicted_labels == targets).sum().item()
    return correct_pred / num_examples * 100

def train(num_epochs, model, optimizer, train_loader, device):
    start_time = time.time()
    criterion = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        model.train()
        for batch_idx, (features, targets) in enumerate(train_loader):
            features = features.view(-1, 28*28).to(device)
            targets = targets.to(device)

            # forward
            logits = model(features)
            loss = criterion(logits, targets)

            # backward
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if not batch_idx % 400:
                print(f"Epoch {epoch+1}/{num_epochs} | Batch {batch_idx}/{len(train_loader)} | Loss: {loss:.4f}")

        acc = compute_accuracy(model, train_loader, device)
        print(f"Epoch {epoch+1}/{num_epochs} Training Accuracy: {acc:.2f}%")

    print(f"Total Training Time: {(time.time() - start_time)/60:.2f} min")

# -----------------------------------------
# 7. Hyperparameters and Model Initialization
# -----------------------------------------
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

num_features = 28 * 28
num_hidden_1 = 128
num_hidden_2 = 64
num_classes = 10
learning_rate = 0.001
num_epochs = 2  # for demo, keep it low

# Standard MLP
model = MultilayerPerceptron(num_features, num_hidden_1, num_hidden_2, num_classes).to(DEVICE)
optimizer_pretrained = torch.optim.Adam(model.parameters(), lr=learning_rate)

print("Training standard MLP...")
train(num_epochs, model, optimizer_pretrained, train_loader, DEVICE)
print(f"Test Accuracy (Standard MLP): {compute_accuracy(model, test_loader, DEVICE):.2f}%")

# -----------------------------------------
# 8. Apply LoRA to MLP
# -----------------------------------------
model_lora = copy.deepcopy(model)

# Replace all linear layers by LoRA merged
model_lora.layers[0] = LinearWithLoRAMerged(model_lora.layers[0], rank=4, alpha=8)
model_lora.layers[2] = LinearWithLoRAMerged(model_lora.layers[2], rank=4, alpha=8)
model_lora.layers[4] = LinearWithLoRAMerged(model_lora.layers[4], rank=4, alpha=8)

print("Test Accuracy (LoRA MLP before fine-tuning):")
print(f"{compute_accuracy(model_lora, test_loader, DEVICE):.2f}%")

# -----------------------------------------
# 9. Freeze original Linear weights
# -----------------------------------------
def freeze_linear_layers(model):
    for child in model.children():
        if isinstance(child, nn.Linear):
            for param in child.parameters():
                param.requires_grad = False
        else:
            freeze_linear_layers(child)

freeze_linear_layers(model_lora)

# Check trainable params
print("Trainable parameters after freezing:")
for name, param in model_lora.named_parameters():
    print(f"{name}: {param.requires_grad}")

# -----------------------------------------
# 10. Fine-tune LoRA
# -----------------------------------------
optimizer_lora = torch.optim.Adam(filter(lambda p: p.requires_grad, model_lora.parameters()), lr=learning_rate)

print("Fine-tuning LoRA MLP...")
train(num_epochs, model_lora, optimizer_lora, train_loader, DEVICE)
print(f"Test Accuracy (LoRA Fine-tuned): {compute_accuracy(model_lora, test_loader, DEVICE):.2f}%")

print(f"Test Accuracy (Original MLP): {compute_accuracy(model, test_loader, DEVICE):.2f}%")

Training standard MLP...
Epoch 1/2 | Batch 0/938 | Loss: 2.3064
Epoch 1/2 | Batch 400/938 | Loss: 0.1621
Epoch 1/2 | Batch 800/938 | Loss: 0.1703
Epoch 1/2 Training Accuracy: 94.98%
Epoch 2/2 | Batch 0/938 | Loss: 0.1207
Epoch 2/2 | Batch 400/938 | Loss: 0.2612
Epoch 2/2 | Batch 800/938 | Loss: 0.2854
Epoch 2/2 Training Accuracy: 96.71%
Total Training Time: 0.49 min
Test Accuracy (Standard MLP): 96.19%
Test Accuracy (LoRA MLP before fine-tuning):
96.19%
Trainable parameters after freezing:
layers.0.linear.weight: False
layers.0.linear.bias: False
layers.0.lora.A: True
layers.0.lora.B: True
layers.2.linear.weight: False
layers.2.linear.bias: False
layers.2.lora.A: True
layers.2.lora.B: True
layers.4.linear.weight: False
layers.4.linear.bias: False
layers.4.lora.A: True
layers.4.lora.B: True
Fine-tuning LoRA MLP...
Epoch 1/2 | Batch 0/938 | Loss: 0.0400
Epoch 1/2 | Batch 400/938 | Loss: 0.1347
Epoch 1/2 | Batch 800/938 | Loss: 0.0904
Epoch 1/2 Training Accuracy: 96.81%
Epoch 2/2 | Batch 

### LoRA Implementation on MNIST – Final Results

**Training standard MLP:**
- Epoch 1 Training Accuracy: 94.98%
- Epoch 2 Training Accuracy: 96.71%
- Test Accuracy: **96.19%**

**LoRA MLP before fine-tuning:**
- Test Accuracy: **96.19%** (identical to standard MLP at initialization)

**Trainable parameters after freezing:**
- *linear.weight* and *linear.bias* → **False** (frozen)  
- *lora.A* and *lora.B* → **True** (trainable)  

**Fine-tuning LoRA:**
- Epoch 1 Training Accuracy: 96.81%
- Epoch 2 Training Accuracy: 97.05%
- Test Accuracy: **96.45%**

---

#### Conclusion

- LoRA integration preserves the original model behavior at start (no accuracy loss).  
- Fine-tuning only LoRA parameters achieves comparable (even slightly higher) accuracy.  
- This confirms **parameter-efficient fine-tuning**: only small matrices *A* and *B* are trained, saving memory and computation while keeping pre-trained weights intact.