# Overview

**Note: All the images are come from the credit setion at the bottom.**

Low-rank adaptation(LoRA) is a machine learning technique that modifies a pretrained model(for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.

This approach is important because it allows for efficient finetuning of large models on task-specific data significantly reducing the computational cost and time required for finetuning.

More details see notebook [Lora From Scratch](https://www.kaggle.com/code/aisuko/lora-from-scratch)

In this notebook, we are going to talk about [Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/abs/2402.09353), which is a new alterative to LoRA, which may outperform LoRA by a large margin. We are going to implement both LoRA and DoRA in PyTorch from scratch in this notebook. 

Thanks for [Sebastian Raschka, Phd's greate write-up](https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch) and I credit it at the bottom.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/966/286/141/991/915/small/82bb43214ea19389.webp)

# Weight-Decomposed Low-Rank Adaptation(DoRA)

DoRA can be seen as an improvement or extension of LoRA that is built on top of it, and we can now easily adapt some of our previous code to implement DoRA. DoRA can be described in two steps, where the first step is to decompose a pretrained weight matrix into a magnitude vector($m$) and a directional matrix($V$). The second step is applyting LoRA to the directional matrix $V$ and training the magnitude vector $m$ separately.

The decomposition into magnitude and directional components is inspired by the mathematical principle that **any vector can be represented as the product of its magnitude(a scalar value indicating its length)** and its direction (a unit vector indicating its orientation in space).

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/578/668/588/233/original/6444016e59fb5aa1.png)

Illustration of the direction and magnitude of a single vector. For example, if have a 2D vector [1,2], we can decompose it into a magnitude 2.24 and a directional vector [0.447, 0.894]. Then 2.24 * [0.447, 0.894]=[1,2].

In DoRA, we apply the decomposition into magnitude and directinal components to a whole pretrained matrix $W$ instead of a vector, where each column (vector) of the weight matrix corresponds to the weights connecting all inputs to a particular output neuron.

So the result of decomposing $W$ is a magnitude vector $m$ that represents the scale or length of each column vector in the weight matrix, as illustrated in the figure below.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/632/855/320/899/original/ecf712f2fac89b88.png)

Illustration of the weight matrix decomposition in DoRA

Then, DoRA takes the directional matrix $V$ and applies standard LoRA, for instance:

$$W^{\prime}=\frac{m(V+\Delta V)}{norm}=\frac{m(W+AB)}{norm}$$

The normalization, which I abbreviated as `norm` to not further complicate things in this overview, is based on the weight normalization method proposed in Saliman's and Kingma's [Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks paper](https://arxiv.org/abs/1602.07868).

The DoRA two-step process(decomposing a pretrained weight matrix and applying LoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/675/274/524/590/original/52ec39d84afdc908.webp)


The motivation for developing DoRA is based on analyzing and comparing the LoRA and full finetuning learning patterns. The DoRA authors found that LoRA either increases or decreases magnitude and direction updates proportionally but seems to lack the capability to make only subtle directional changes as found in full finetuning. Hence, the researchers propose the decoupling of magnitude and directional components.

In other words, their DoRA method aims to apply LoRA only to the directional component, $V$, while also allowing the magnitude component, $m$, to be trained separately.

Introducing the magnitude vector m adds $0.01%$ more parameters if DoRA is compared to LoRA. However, across both LLM and vision transformer benchmarks, they found the DoRA even outperforms LoRA if the DoRA rank is halved, for instance, when DoRA only uses half the parameters of regular LoRA, as shown in the performance comparison below.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/733/748/743/976/original/cf8372d4af905b1b.webp)

So, it seems that DoRA is much more robust to changes in rank. The possibility to successfully use DoRA with relatively small ranks makes this method even more parameter-efficient than LoRA.


## Implementing DoRA Layers in PyTorch

Previously, we said that we can initialize a pretrained weight $W_{0}$ with magnitude $m$ and directional component $V$. For instance, we have the following equation:

$$W_{0}=m \frac{V}{||V||_{c}}=||W||_{c} \frac{W}{||W||_{c}}$$

Where $||V||_{c}$ is the vector-wide norm of $V$. Then we can write DoRA including the LoRA weight update $BA$ as shown below:

$$W^{\prime}=\underline{m} \frac{V+\Delta V}{||V+\Delta V||_{c}}=\underline{m} \frac{W_{0}+\underline{BA}}{||W_{0}+\underline{BA}||_{c}}$$

Here, $\Delta V$ is the update to the directional component, matrix $V$.

In [1]:
import torch

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic=True
    DEVICE=torch.device('cuda')
else:
    DEVICE=torch.device('cpu')

DEVICE

device(type='cuda')

Compare to $LinearWithLoRAMerged$ class in [LoRA From Scratch](https://www.kaggle.com/code/aisuko/lora-from-scratch). Both classes integrate a $LoRALayer$ to augment the original linear layer's weights, but DoRA adds **weight normalization** and **adjustment**. 

And we can see in the code below, $LinearWithDoRAMerged$ introduces an additional step involving dynamic normalization of the augmented weights. After combining the original weights with the LoRA-adjusted weights $self.linear.weight+self.lora.alpha*lora.T$, it calculates the **norm of these combined weights across columns(column_norm)**. Then, it **normalizes the combined weights** by dividing them by their norms $V=combined_weight/column_norm$. This step ensures taht each column of the combined weight matrix has a unit norm, which can help stabilize the learning process by maintaining the scale of weight updates.

DoRA also introduces a learnable vector $self.m$, **which represents the magnitude of each column of the normalized weight matrix**. This parameter allows the model to dynamically adjust the scale of each weight vector in the combined weight matrix during training. This additional flexibility can help the model better capture the importance of different features.

In summary, $LinearWithDoRAMerged$ extends the concept of $LinearWithLoRAMerged$ by incorporating dynamic weight normalization and scaling to improve the training performance.


In [2]:
import torch.nn as nn
import torch.nn.functional as F

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev=1/torch.sqrt(torch.tensor(rank).float())
        self.A=nn.Parameter(torch.randn(in_dim, rank)*std_dev)
        self.B=nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha=alpha
    
    def forward(self, x):
        # @ means matrix 
        x=self.alpha*(x @ self.A @ self.B)
        return x


class LinearWithDoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear=linear
        self.lora=LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )
        self.m=nn.Parameter(torch.ones(1, linear.out_features))
        
    def forward(self, x):
        linear_output=self.linear(x)
        lora_output=self.lora(x)
        lora_output_norm=lora_output/lora_output.norm(p=2, dim=1, keepdim=True)
        dora_modification=self.m * lora_output_norm
        dora_output=self.lora(x)
        return linear_output+dora_output


# this code is equivalent to LinearWithDoRA
class LinearWithDoRAMerged(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear=linear
        self.lora=LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )
        self.m=nn.Parameter(self.linear.weight.norm(p=2, dim=0, keepdim=True))
        
    def forward(self, x):
        lora=self.lora.A @self.lora.B
        numerator=self.linear.weight+self.lora.alpha*lora.T
        denominator=numerator.norm(p=2, dim=0, keepdim=True)
        directional_component=numerator/denominator
        new_weight=self.m*directional_component
        return F.linear(x, new_weight, self.linear.bias)

In [3]:
# hyperparameters
random_seed=123

torch.manual_seed(random_seed)

layer=nn.Linear(10,2)
x=torch.randn(1,10)

# Swaping existing Linear layers

In [4]:
layer_dora_1=LinearWithDoRA(layer, rank=2, alpha=4)
print(layer_dora_1(x))

tensor([[0.6639, 0.4487]], grad_fn=<AddBackward0>)


In [5]:
layer_dora_2=LinearWithDoRAMerged(layer, rank=2, alpha=4)
print(layer_dora_2(x))

tensor([[0.6639, 0.4487]], grad_fn=<AddmmBackward0>)


# Define Multilayer Perceptron Model(Without DoRA)



In [6]:
# hyperparameters
learning_rate=0.005
num_epochs=10

# architecture
num_features=784
num_hidden_1=128
num_hidden_2=256
num_classes=10

class MultilayerPerceptron(nn.Module):
    def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
        super().__init__()
        
        self.layers=nn.Sequential(
            nn.Linear(num_features, num_hidden_1),
            nn.ReLU(),
            nn.Linear(num_hidden_1, num_hidden_2),
            nn.ReLU(),
            nn.Linear(num_hidden_2, num_classes)
        )
    
    def forward(self, x):
        x=self.layers(x)
        return x

model=MultilayerPerceptron(
    num_features=num_features,
    num_hidden_1=num_hidden_1,
    num_hidden_2=num_hidden_2,
    num_classes=num_classes
)

model.to(DEVICE)
optimizer_pretrained=torch.optim.Adam(model.parameters(), lr=learning_rate)

# Loading Dataset

In [7]:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

BATCH_SIZE=64

# note transforms.ToTensor() scales input images to 0-1 range
train_dataset=datasets.MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)
test_dataset=datasets.MNIST(root='data', train=False, transform=transforms.ToTensor())

train_loader=DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader=DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# checking the dataset
for images, labels in train_loader:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 129495548.31it/s]

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw






Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 50431179.78it/s]


Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 38749048.33it/s]

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw






Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 29581566.41it/s]


Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw

Image batch dimensions: torch.Size([64, 1, 28, 28])
Image label dimensions: torch.Size([64])


# Define Evaluation and Training functions

In [8]:
import time

def compute_accuracy(model, data_loader, device):
    model.eval()
    correct_pred, num_examples=0,0
    with torch.no_grad():
        for features, targets in data_loader:
            features=features.view(-1, 28*28).to(device)
            targets=targets.to(device)
            logits=model(features)
            _, predicted_labels=torch.max(logits,1)
            num_examples+=targets.size(0)
            correct_pred+=(predicted_labels==targets).sum()
        return correct_pred.float()/num_examples*100

def train(num_epochs, model, optimizer, train_loader, device):
    start_time=time.time()
    for epoch in range(num_epochs):
        model.train()
        for batch_idx, (features, targets) in enumerate(train_loader):
            features=features.view(-1, 28*28).to(device)
            targets=targets.to(device)
            
            # Forward and backpropgation
            logits=model(features)
            loss=F.cross_entropy(logits, targets)
            optimizer.zero_grad()
            
            loss.backward()
            
            # update model parameters
            optimizer.step()
            
            #logging
            if not batch_idx %400:
                print('Epoch %03d/%03d | Batch %03d/%03d | Loss: %.4f' % (epoch+1, num_epochs, batch_idx, len(train_loader), loss))
        with torch.set_grad_enabled(False):
            print('Epoch: %03d/%03d training accuracy: %0.2f%%' % (epoch+1, num_epochs, compute_accuracy(model, train_loader, device)))
        print('Time elapsed: %.2f min' % ((time.time()- start_time)/60))
    print('Total Training TIme: %.2f min' % ((time.time()-start_time)/60))
    

train(num_epochs, model, optimizer_pretrained, train_loader, DEVICE)
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch 001/010 | Batch 000/938 | Loss: 2.3110
Epoch 001/010 | Batch 400/938 | Loss: 0.1470
Epoch 001/010 | Batch 800/938 | Loss: 0.2974
Epoch: 001/010 training accuracy: 95.42%
Time elapsed: 0.24 min
Epoch 002/010 | Batch 000/938 | Loss: 0.0643
Epoch 002/010 | Batch 400/938 | Loss: 0.1181
Epoch 002/010 | Batch 800/938 | Loss: 0.1319
Epoch: 002/010 training accuracy: 96.01%
Time elapsed: 0.48 min
Epoch 003/010 | Batch 000/938 | Loss: 0.1533
Epoch 003/010 | Batch 400/938 | Loss: 0.0541
Epoch 003/010 | Batch 800/938 | Loss: 0.1072
Epoch: 003/010 training accuracy: 97.22%
Time elapsed: 0.71 min
Epoch 004/010 | Batch 000/938 | Loss: 0.1163
Epoch 004/010 | Batch 400/938 | Loss: 0.0741
Epoch 004/010 | Batch 800/938 | Loss: 0.1829
Epoch: 004/010 training accuracy: 97.68%
Time elapsed: 0.95 min
Epoch 005/010 | Batch 000/938 | Loss: 0.0283
Epoch 005/010 | Batch 400/938 | Loss: 0.1460
Epoch 005/010 | Batch 800/938 | Loss: 0.1727
Epoch: 005/010 training accuracy: 96.90%
Time elapsed: 1.18 min
Epoch

In [9]:
import copy

model_dora=copy.deepcopy(model)
print(model_dora)

MultilayerPerceptron(
  (layers): Sequential(
    (0): Linear(in_features=784, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=10, bias=True)
  )
)


In [10]:
model_dora.layers[0]=LinearWithDoRAMerged(model_dora.layers[0], rank=4, alpha=8)
model_dora.layers[2]=LinearWithDoRAMerged(model_dora.layers[2], rank=4, alpha=8)
model_dora.layers[4]=LinearWithDoRAMerged(model_dora.layers[4], rank=4, alpha=8)

model_dora.to(DEVICE)
optimizer_dora=torch.optim.Adam(model_dora.parameters(), lr=learning_rate)

print(model_dora)

MultilayerPerceptron(
  (layers): Sequential(
    (0): LinearWithDoRAMerged(
      (linear): Linear(in_features=784, out_features=128, bias=True)
      (lora): LoRALayer()
    )
    (1): ReLU()
    (2): LinearWithDoRAMerged(
      (linear): Linear(in_features=128, out_features=256, bias=True)
      (lora): LoRALayer()
    )
    (3): ReLU()
    (4): LinearWithDoRAMerged(
      (linear): Linear(in_features=256, out_features=10, bias=True)
      (lora): LoRALayer()
    )
  )
)


# Freeze the orignal weights

In [11]:
def freeze_linear_layers(model):
    for child in model.children():
        if isinstance(child, nn.Linear):
            for param in child.parameters():
                param.requires_grad=False
        else:
            # recursively freeze linear layers in children modules
            freeze_linear_layers(child)

freeze_linear_layers(model_dora)

# check if linear layers are frozen
for name, param in model_dora.named_parameters():
    print(f'{name}: {param.requires_grad}')

layers.0.m: True
layers.0.linear.weight: False
layers.0.linear.bias: False
layers.0.lora.A: True
layers.0.lora.B: True
layers.2.m: True
layers.2.linear.weight: False
layers.2.linear.bias: False
layers.2.lora.A: True
layers.2.lora.B: True
layers.4.m: True
layers.4.linear.weight: False
layers.4.linear.bias: False
layers.4.lora.A: True
layers.4.lora.B: True


In [12]:
optimizer_dora=torch.optim.Adam(model_dora.parameters(), lr=learning_rate)
train(num_epochs, model_dora, optimizer_dora, train_loader, DEVICE)
print(f'Test accuracy DoRA finetune: {compute_accuracy(model_dora, test_loader, DEVICE):.2f}%')

Epoch 001/010 | Batch 000/938 | Loss: 0.0049
Epoch 001/010 | Batch 400/938 | Loss: 0.0406
Epoch 001/010 | Batch 800/938 | Loss: 0.1579
Epoch: 001/010 training accuracy: 98.90%
Time elapsed: 0.26 min
Epoch 002/010 | Batch 000/938 | Loss: 0.0279
Epoch 002/010 | Batch 400/938 | Loss: 0.0279
Epoch 002/010 | Batch 800/938 | Loss: 0.0810
Epoch: 002/010 training accuracy: 98.96%
Time elapsed: 0.52 min
Epoch 003/010 | Batch 000/938 | Loss: 0.0214
Epoch 003/010 | Batch 400/938 | Loss: 0.0106
Epoch 003/010 | Batch 800/938 | Loss: 0.0505
Epoch: 003/010 training accuracy: 99.12%
Time elapsed: 0.78 min
Epoch 004/010 | Batch 000/938 | Loss: 0.0062
Epoch 004/010 | Batch 400/938 | Loss: 0.0302
Epoch 004/010 | Batch 800/938 | Loss: 0.0102
Epoch: 004/010 training accuracy: 98.98%
Time elapsed: 1.04 min
Epoch 005/010 | Batch 000/938 | Loss: 0.0135
Epoch 005/010 | Batch 400/938 | Loss: 0.0144
Epoch 005/010 | Batch 800/938 | Loss: 0.0662
Epoch: 005/010 training accuracy: 99.08%
Time elapsed: 1.30 min
Epoch

In [13]:
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')
print(f'Test accuracy DoRA finetune: {compute_accuracy(model_dora, test_loader, DEVICE):.2f}%')

Test accuracy: 97.19%
Test accuracy DoRA finetune: 97.62%


We can see the result of LoRA in notebook [LoRA From Scratch](https://www.kaggle.com/code/aisuko/lora-from-scratch) is $97.46\%$. Comparing these results, DoRA seems like a effective extension of LoRA.

# Credit

* https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch
* https://arxiv.org/abs/2402.09353
* https://arxiv.org/abs/2106.09685
* https://github.com/rasbt/dora-from-scratch/blob/main/lora-dora-mlp.ipynb