# Overview

**Note: All the images are come from the credit setion at the bottom.**

Low-rank adaptation(LoRA) is a machine learning technique that modifies a pretrained model(for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.

This approach is important because it allows for efficient finetuning of large models on task-specific data significantly reducing the computational cost and time required for finetuning.

More details see notebook [Lora From Scratch](https://www.kaggle.com/code/aisuko/lora-from-scratch)

In this notebook, we are going to talk about [Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/abs/2402.09353), which is a new alterative to LoRA, which may outperform LoRA by a large margin. We are going to implement both LoRA and DoRA in PyTorch from scratch in this notebook. 

Thanks for [Sebastian Raschka, Phd's greate write-up](https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch) and I credit it at the bottom.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/966/286/141/991/915/small/82bb43214ea19389.webp)

# Weight-Decomposed Low-Rank Adaptation(DoRA)

DoRA can be seen as an improvement or extension of LoRA that is built on top of it, and we can now easily adapt some of our previous code to implement DoRA. DoRA can be described in two steps, where the first step is to decompose a pretrained weight matrix into a magnitude vector($m$) and a directional matrix($V$). The second step is applyting LoRA to the directional matrix $V$ and training the magnitude vector $m$ separately.

The decomposition into magnitude and directional components is inspired by the mathematical principle that **any vector can be represented as the product of its magnitude(a scalar value indicating its length)** and its direction (a unit vector indicating its orientation in space).

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/578/668/588/233/original/6444016e59fb5aa1.png)

Illustration of the direction and magnitude of a single vector. For example, if have a 2D vector [1,2], we can decompose it into a magnitude 2.24 and a directional vector [0.447, 0.894]. Then 2.24 * [0.447, 0.894]=[1,2].

In DoRA, we apply the decomposition into magnitude and directinal components to a whole pretrained matrix $W$ instead of a vector, where each column (vector) of the weight matrix corresponds to the weights connecting all inputs to a particular output neuron.

So the result of decomposing $W$ is a magnitude vector $m$ that represents the scale or length of each column vector in the weight matrix, as illustrated in the figure below.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/632/855/320/899/original/ecf712f2fac89b88.png)

Illustration of the weight matrix decomposition in DoRA

Then, DoRA takes the directional matrix $V$ and applies standard LoRA, for instance:

$$W^{\prime}=\frac{m(V+\Delta V)}{norm}=\frac{m(W+AB)}{norm}$$

The normalization, which I abbreviated as `norm` to not further complicate things in this overview, is based on the weight normalization method proposed in Saliman's and Kingma's [Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks paper](https://arxiv.org/abs/1602.07868).

The DoRA two-step process(decomposing a pretrained weight matrix and applying LoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/675/274/524/590/original/52ec39d84afdc908.webp)


The motivation for developing DoRA is based on analyzing and comparing the LoRA and full finetuning learning patterns. The DoRA authors found that LoRA either increases or decreases magnitude and direction updates proportionally but seems to lack the capability to make only subtle directional changes as found in full finetuning. Hence, the researchers propose the decoupling of magnitude and directional components.

In other words, their DoRA method aims to apply LoRA only to the directional component, $V$, while also allowing the magnitude component, $m$, to be trained separately.

Introducing the magnitude vector m adds $0.01%$ more parameters if DoRA is compared to LoRA. However, across both LLM and vision transformer benchmarks, they found the DoRA even outperforms LoRA if the DoRA rank is halved, for instance, when DoRA only uses half the parameters of regular LoRA, as shown in the performance comparison below.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/967/733/748/743/976/original/cf8372d4af905b1b.webp)

So, it seems that DoRA is much more robust to changes in rank. The possibility to successfully use DoRA with relatively small ranks makes this method even more parameter-efficient than LoRA.


## Implementing DoRA Layers in PyTorch

Previously, we said that we can initialize a pretrained weight $W_{0}$ with magnitude $m$ and directional component $V$. For instance, we have the following equation:

$$W_{0}=m \frac{V}{||V||_{c}}=||W||_{c} \frac{W}{||W||_{c}}$$

Where $||V||_{c}$ is the vector-wide norm of $V$. Then we can write DoRA including the LoRA weight update $BA$ as shown below:

$$W^{\prime}=\underline{m} \frac{V+\Delta V}{||V+\Delta V||_{c}}=\underline{m} \frac{W_{0}+\underline{BA}}{||W_{0}+\underline{BA}||_{c}}$$

Here, $\Delta V$ is the update to the directional component, matrix $V$.

In [14]:
class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev=1/torch.sqrt(torch.tensor(rank).float())
        self.A=nn.Parameter(torch.randn(in_dim, rank)*std_dev)
        self.B=nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha=alpha
    
    def forward(self, x):
        # @ means matrix 
        x=self.alpha*(x @ self.A @ self.B)
        return x

class LinearWithDoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear=linear
        self.lora=LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )
        self.m=nn.Parameter(torch.ones(1, linear.out_features))
        
    def forward(self, x):
        linear_output=self.linear(x)
        lora_output=self.lora(x)
        lora_output_norm=lora_output/lora_output.norm(p=2, dim=1, keepdim=True)
        dora_modification=self.m * lora_output_norm
        dora_output=self.lora(x)
        return linear_output+dora_output
        
    
class LinearWithDoRAMerged(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear=linear
        self.lora=LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )
        self.m=nn.Parameter(self.linear.weight.norm(p=2, dim=0, keepdim=True))
        
    def forward(self, x):
        lora=self.lora.A @self.lora.B
        numerator=self.linear.weight+self.lora.alpha*lora.T
        denominator=numerator.norm(p=2, dim=0, keepdim=True)
        directional_component=numerator/denominator
        new_weight=self.m*directional_component
        return F.linear(x, new_weight, self.linear.bias)

In [15]:
layer_dora_1=LinearWithDoRA(layer, rank=2, alpha=4)
print(layer_dora_1(x))

tensor([[0.6639, 0.4487]], grad_fn=<AddBackward0>)


# Credit

* https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch
* https://arxiv.org/abs/2402.09353
* https://arxiv.org/abs/2106.09685
* https://github.com/rasbt/dora-from-scratch/blob/main/lora-dora-mlp.ipynb