# Overview

**Note: All the images are come from the credit setion at the bottom.**

Low-rank adaptation(LoRA) is a machine learning technique that modifies a pretrained model(for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.

This approach is important because it allows for efficient finetuning of large models on task-specific data significantly reducing the computational cost and time required for finetuning.

Since LLMs are large, updating all model weights during training can be expensive due to GPU memory limitations. Suppoese we have a large weight matrix $W$ for a given layer. During backpropagation, we learn a $\Delta W$ matrix, which contains information on how much we want to update the original weights to minimize the loss function during training.

In regular training and finetuning, the weight update is defined as follows:

$$W_{updated}=W+\Delta W$$

The LoRA method proposed by [Hu et al](https://arxiv.org/abs/2106.09685), offers a more efficient alternative to computing the weight updates $\Delta W$ **by learning an approximation of it**, $\Delta W \approx AB$. In other words, in LoRA, we have the following, where $A$ and $B$ are two small weight matrices:

$$W_{updated}=W+A.B$$

(The "." in "A.B" stands for matrix multiplication)

The figure below illustrates these formulas for full finetuning and LoRA side by side.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/966/350/443/956/461/small/e5254a6add0141f4.webp)


## How does LoRA save GPU memory

For example, if a pretrained weight matrix $W$ is a $1000*1000$ matrix, then the weight update matrix $\Delta W$ in regular finetuning is a $1000*1000$ matrix as well. In this case, $\Delta W$ has $1,000,000$ parameters. If we consider a LoRA rank of $2$, them $A$ is a $1000*2$ matrix, and $B$ is a $2*1000$ matrix, and we only have $2*2*1000=4000$ parameters that we ned to update when using LoRA. In the previous example, with a rank of $2$, that's $250$ times fewer parameters.

Of course, $A$ and $B$ can't capture all the information that $\Delta W$ could capture, **but this is by design**. When using LoRA, we hypothesize that the model requires $W$ to be a large matrix with full rank to capture all the knowledge in the pretraining dataset. However, when we finetune an LLM, we don't need to update all the weights and capture the core information for the adaptation in a smaller number of weights than $\Delta W$ would; hence, we have the low-rank updates via $AB$.

If we paid close attention, the full finetuning and LoRA depictions in the figure above look slightly different from the formulas I have shown earlier. That's due to the distributive law of matrix multiplication: we don't have to add the weights with the updated weights but can keep them separate. For instance, if $x$ is the input data, then we can write the following for regular finetuning:

$$x.(W+\Delta W)=x.W+x.\Delta W$$

Similarly, we can write the following for LoRA:

$$x.(W+A.B)=x.W+x.A.B$$

The fact that we can keep the LoRA weight matrices separate makes LoRA especially attractive. In practice, this means that we don't have to modify the weights of the pretrained model at all, as we can apply the LoRA matrices on the fly. This is especially useful if you are considering hosting a model for multiple customers. Instead of having to save the large updated models for each customer, you only have to save the small set of LoRA weights alongside the original pretrained model.


# A LoRA Layer Code Implementation

We begin by initializing the **LoRALayer** that creates the matrices A and B, along with the **alpha** scaling hyperparameter and the rank hyperparameters. This layer can accept an input and compute the corresponding output, as illustrated in the figure below(The LoRA matrices A and B with rank r).

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/966/454/563/523/540/original/fcc1997b5e645a15.webp)

In [1]:
import torch
import torch.nn as nn

if torch.cuda.is_available():
    # NVIDIA CUDA Deep Neural Network (cuDNN) is a GPU-accelerated library of primitives for deep neural networks
    torch.backends.cudnn.deterministic=True


class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev=1/torch.sqrt(torch.tensor(rank).float())
        self.A=nn.Parameter(torch.randn(in_dim, rank)*std_dev)
        self.B=nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha=alpha
        
    def forward(self, x):
        x=self.alpha*(x@self.A@self.B)
        return x

In this code above, $rank$ is the hyperparameter that controls the inner dimension of the matrices $A$ and $B$. In other words, this parameter controls the number of additional parameters introduced by LoRA and is a key factor in determining the balance between model adaptability and parameter efficiency.

The second hyperparameter, $alpha$ is a scaling hyperparameter applied to the output of the low-rank adaptation. It essentially controls the extent to which the adapted layer's output is allowed to influence the original output of the layer being adapted. This can be seen as a way to regulate the impact of the low-rank adaptation on the layer's output.

So far, the $LoRALayer$ class we implemented above allows us to transform the layer inputs x. However, in LoRA, we are usually interested in replacing existing $Linear$ layers so that the weight update is applied to the existing pretrained weights, as shown in the figure below(LoRA applied to an existing linear layer):

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/966/573/735/919/892/original/d5bcf6590c3a346b.png)

To incorporate the original Linear layer weights as shown in the figure above, we will implement a $LinearWithLoRA$ layer that uses the previously implemented $LoRALayer$ and can be used to replace existing $Linear$ layers in a neural network, for example; self-attention module or feed forward modules in an LLM: 

In [2]:
class LinearWithLoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear=linear
        self.lora=LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )
    
    def forward(self, x):
        return self.linear(x)+self.lora(x)

Note that since we initialize the weight matrix B(self.b in LoRALayer) with zero values in the LoRA layer, the matrix multiplication between A and B results in a matrix consisting of 0's and doesn't affect the original weights (since adding 0 to the original weights does not modify them).


# Define a small Single Layer Neural Network

Let's try out LoRA on a small neural network layer represented by a single $Linear$ layer:

In [3]:
# Hyperparameters
random_seed=123

torch.manual_seed(random_seed)
layer=nn.Linear(10,2)
x=torch.randn((1, 10))

print(x)
print(layer)
print('Original output:', layer(x))

tensor([[ 0.5490,  0.3671,  0.1219,  0.6466, -1.4168,  0.8429, -0.6307,  1.2340,
          0.3127,  0.6972]])
Linear(in_features=10, out_features=2, bias=True)
Original output: tensor([[0.6639, 0.4487]], grad_fn=<AddmmBackward0>)


## Applying LoRA to Linear Layer

Let's apply LoRA to the Linear layer, we see that the results are the same since we haven't trained the LoRA weights yet. In other words, everything works as expected:

In [4]:
## Applying LoRA to Linear Layer
layer_lora_1=LinearWithLoRA(layer, rank=2, alpha=4)
print(layer_lora_1(x))

tensor([[0.6639, 0.4487]], grad_fn=<AddBackward0>)


## Merging LoRA matrices and Original Weights

As we mentioned above, the distributive law of matrix multiplication $x.(W+A.B)=x.W+x.A.B$.

This means that we can also combine or merge the LoRA matrices and original weights, which should result in an equivalent implementation. In code, this alternative implementation to the LinearWithLoRA layer looks as follows:

In [5]:
import torch.nn.functional as F

# This LoRA code is equivalent to LinearWithLoRA
class LinearWithLoRAMerged(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear=linear
        self.lora=LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )
    
    def forward(self, x):
        lora=self.lora.A @ self.lora.B # combine LoRA metrices
        # then combine LoRA original weights
        combined_weight=self.linear.weight+self.lora.alpha*lora.T
        return F.linear(x, combined_weight, self.linear.bias)

In short, $LinearWithLoRAMerged$ computes the left side of equation $x.(W+A.B)=x.W+x.A.B$ whereas $LinearWithLoRA$ computes the right side -- both are euqivalent. We can verify that this results in the same outputs as before via the following code:

In [6]:
layer_lora_2=LinearWithLoRAMerged(layer, rank=2, alpha=4)
print(layer_lora_2(x))

tensor([[0.6639, 0.4487]], grad_fn=<AddmmBackward0>)


# Applying LoRA Layers to LLM

**Why did we implement LoRA in the manner described above using PyTorch modules?**

THis approach enables us to easily replace a Linear layer in an existing neural network(for example, the feed forward or attention modules of a LLM) with our new $LienarWithLoRA$ or $LinearWithLoRAMerged$ layers.


## Multilayer Perceptron Model(without LoRA)

For simplicity, let's focus on a small 3-layer multilayer perception instead of an LLM for now, which is illustrated in the figure below:

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/966/738/332/188/839/original/c29fe212ed35f033.webp)

In [7]:
class MultilayerPerceptron(nn.Module):
    def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
        super().__init__()
        self.layers=nn.Sequential(
            nn.Linear(num_features, num_hidden_1),
            nn.ReLU(),
            nn.Linear(num_hidden_1, num_hidden_2),
            nn.ReLU(),
            nn.Linear(num_hidden_2, num_classes)
        )
    
    def forward(self, x):
        x=self.layers(x)
        return x

# Architecture
num_features=784
num_hidden_1=128
num_hidden_2=256
num_classes=10

# Settings
DEVICE=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
learning_rate=0.005
num_epochs=10
    
model=MultilayerPerceptron(
    num_features=num_features,
    num_hidden_1=num_hidden_1,
    num_hidden_2=num_hidden_2,
    num_classes=num_classes
)

model.to(DEVICE)
optimizer_pretrained=torch.optim.Adam(model.parameters(), lr=learning_rate)
print(DEVICE)
print(model)
print(optimizer_pretrained)

cuda
MultilayerPerceptron(
  (layers): Sequential(
    (0): Linear(in_features=784, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=10, bias=True)
  )
)
Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.005
    maximize: False
    weight_decay: 0
)


## Loading dataset

In [8]:
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

BATCH_SIZE=64

# Note: transforms.ToTensor() scales input images to 0-1 range
train_dataset=datasets.MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)

test_dataset=datasets.MNIST(root='data', train=False, transform=transforms.ToTensor())

train_loader=DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True)

test_loader=DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=False)

for images, labels in train_loader:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 457494319.18it/s]


Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 54614830.40it/s]


Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 164269052.91it/s]


Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 8527541.97it/s]


Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw

Image batch dimensions: torch.Size([64, 1, 28, 28])
Image label dimensions: torch.Size([64])


## Define evaluation

In [9]:
def compute_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples=0,0
    with torch.no_grad():
        for features, targets in data_loader:
            features=features.view(-1, 28*28).to(device)
            targets=targets.to(device)
            logits=model(features)
            _, predicted_labels=torch.max(logits,1)
            num_examples+=targets.size(0)
            correct_pred+=(predicted_labels==targets).sum()
        return correct_pred.float()/num_examples*100

## Training

In [10]:
import time

def train(num_epochs, model, optimizer, train_loader, device):
    start_time=time.time()
    for epoch in range(num_epochs):
        model.train()
        for batch_idx, (features, targets) in enumerate(train_loader):
            features=features.view(-1, 28*28).to(device)
            targets=targets.to(device)
            
            # forward and back propagation
            logits=model(features)
            loss=F.cross_entropy(logits, targets)
            optimizer.zero_grad()
            
            loss.backward()
            
            # update model parameters
            optimizer.step()
            
            # logging
            if not batch_idx %400:
                print('Epoch: %03d/%03d|Batch %03d/%03d| Loss: %.4f' % (epoch+1, num_epochs, batch_idx, len(train_loader), loss))
        
        with torch.set_grad_enabled(False):
            print('Epoch: %03d/%03d training accuracy: %.2f%%' % (epoch+1, num_epochs, compute_accuracy(model, train_loader, device)))
        
        print('Time elapsed: %.2f min' % ((time.time() - start_time)/60))
    print('Total Training Time: %.2f min' % ((time.time() - start_time)/60))
                  
                  
train(num_epochs, model, optimizer_pretrained, train_loader, DEVICE)
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 001/010|Batch 000/938| Loss: 2.3172
Epoch: 001/010|Batch 400/938| Loss: 0.0508
Epoch: 001/010|Batch 800/938| Loss: 0.0969
Epoch: 001/010 training accuracy: 96.87%
Time elapsed: 0.24 min
Epoch: 002/010|Batch 000/938| Loss: 0.1350
Epoch: 002/010|Batch 400/938| Loss: 0.1447
Epoch: 002/010|Batch 800/938| Loss: 0.2466
Epoch: 002/010 training accuracy: 97.66%
Time elapsed: 0.47 min
Epoch: 003/010|Batch 000/938| Loss: 0.1082
Epoch: 003/010|Batch 400/938| Loss: 0.2204
Epoch: 003/010|Batch 800/938| Loss: 0.1170
Epoch: 003/010 training accuracy: 97.75%
Time elapsed: 0.71 min
Epoch: 004/010|Batch 000/938| Loss: 0.0820
Epoch: 004/010|Batch 400/938| Loss: 0.0058
Epoch: 004/010|Batch 800/938| Loss: 0.0278
Epoch: 004/010 training accuracy: 98.04%
Time elapsed: 0.94 min
Epoch: 005/010|Batch 000/938| Loss: 0.0326
Epoch: 005/010|Batch 400/938| Loss: 0.0173
Epoch: 005/010|Batch 800/938| Loss: 0.0074
Epoch: 005/010 training accuracy: 98.43%
Time elapsed: 1.17 min
Epoch: 006/010|Batch 000/938| Loss:

# Replacing Linear with LoRA Layers

Using $LinearWithLoRA$, we can then add the LoRA layers by replacing the original $Linear$ layers in the multilayer perception model:

In [11]:
import copy

model_lora=copy.deepcopy(model)

model_lora.layers[0]=LinearWithLoRAMerged(model_lora.layers[0], rank=4, alpha=8)
model_lora.layers[2]=LinearWithLoRAMerged(model_lora.layers[2], rank=4, alpha=8)
model_lora.layers[4]=LinearWithLoRAMerged(model_lora.layers[4], rank=4, alpha=8)
model_lora.to(DEVICE)
optimizer_lora=torch.optim.Adam(model_lora.parameters(), lr=learning_rate)
print(model_lora)

MultilayerPerceptron(
  (layers): Sequential(
    (0): LinearWithLoRAMerged(
      (linear): Linear(in_features=784, out_features=128, bias=True)
      (lora): LoRALayer()
    )
    (1): ReLU()
    (2): LinearWithLoRAMerged(
      (linear): Linear(in_features=128, out_features=256, bias=True)
      (lora): LoRALayer()
    )
    (3): ReLU()
    (4): LinearWithLoRAMerged(
      (linear): Linear(in_features=256, out_features=10, bias=True)
      (lora): LoRALayer()
    )
  )
)


In [12]:
print(f'Test accuracy orig model:{compute_accuracy(model, test_loader, DEVICE):.2f}%')
print(f'Test accuracy LoRA model:{compute_accuracy(model_lora, test_loader, DEVICE):.2f}%')

Test accuracy orig model:97.21%
Test accuracy LoRA model:97.21%


## Freezing the Original Linear Layers

Then, we can freeze the original $Lienar$ layers and only make the $LoRALayer$ layers trainable, as follows:

In [13]:
def freeze_linear_layers(model):
    for child in model.children():
        if isinstance(child, nn.Linear):
            for param in child.parameters():
                param.requires_grad=False
        else:
            # recursively freeze linear layers in children modules
            freeze_linear_layers(child)

freeze_linear_layers(model_lora)
for name, param in model_lora.named_parameters():
    print(f'{name}:{param.requires_grad}')

layers.0.linear.weight:False
layers.0.linear.bias:False
layers.0.lora.A:True
layers.0.lora.B:True
layers.2.linear.weight:False
layers.2.linear.bias:False
layers.2.lora.A:True
layers.2.lora.B:True
layers.4.linear.weight:False
layers.4.linear.bias:False
layers.4.lora.A:True
layers.4.lora.B:True


Based on the True and False values above, we can visually confirm that only the LoRA layers are trainble now(**True means trainable, False means frozen**). In practice, we would then train the network with this LoRA configuration on a new dataset or task. Before we do this, let understand DoRA first.

In [14]:
optimizer_lora=torch.optim.Adam(model_lora.parameters(), lr=learning_rate)
train(num_epochs, model_lora, optimizer_lora, train_loader, DEVICE)
print(f'Test accuracy LoRA finetune: {compute_accuracy(model_lora, test_loader, DEVICE):.2f}%')

Epoch: 001/010|Batch 000/938| Loss: 0.0239
Epoch: 001/010|Batch 400/938| Loss: 0.0857
Epoch: 001/010|Batch 800/938| Loss: 0.0432
Epoch: 001/010 training accuracy: 98.47%
Time elapsed: 0.24 min
Epoch: 002/010|Batch 000/938| Loss: 0.0381
Epoch: 002/010|Batch 400/938| Loss: 0.0532
Epoch: 002/010|Batch 800/938| Loss: 0.0172
Epoch: 002/010 training accuracy: 98.68%
Time elapsed: 0.49 min
Epoch: 003/010|Batch 000/938| Loss: 0.0049
Epoch: 003/010|Batch 400/938| Loss: 0.0515
Epoch: 003/010|Batch 800/938| Loss: 0.0338
Epoch: 003/010 training accuracy: 99.07%
Time elapsed: 0.73 min
Epoch: 004/010|Batch 000/938| Loss: 0.0116
Epoch: 004/010|Batch 400/938| Loss: 0.0180
Epoch: 004/010|Batch 800/938| Loss: 0.3807
Epoch: 004/010 training accuracy: 98.66%
Time elapsed: 0.97 min
Epoch: 005/010|Batch 000/938| Loss: 0.0021
Epoch: 005/010|Batch 400/938| Loss: 0.0014
Epoch: 005/010|Batch 800/938| Loss: 0.0160
Epoch: 005/010 training accuracy: 98.88%
Time elapsed: 1.21 min
Epoch: 006/010|Batch 000/938| Loss:

In [15]:
print(f'Test accuracy orig model:{compute_accuracy(model, test_loader, DEVICE):.2f}%')
print(f'Test accuracy LoRA model:{compute_accuracy(model_lora, test_loader, DEVICE):.2f}%')

Test accuracy orig model:97.21%
Test accuracy LoRA model:97.46%


# Credit

* https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch
* https://arxiv.org/abs/2402.09353
* https://arxiv.org/abs/2106.09685
* https://github.com/rasbt/dora-from-scratch