## Low-Rank Adaptation (LoRA) 

* Reference paper: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)


In [1]:
import torch 

The two main ideas behind LoRA are:
1. We can decompose the updated weight matrix as $W^{\prime} = W_{0} + W_{\Delta}$,
1. $W_{\Delta}$ can further be decomposed as the multiplication of two Low-Rank matrices, $W_{\Delta} = W_{A} \times W_{B}$.

Putting the two concepts together, we can freeze the original model, and add small learnable Low-Rank Adapters (LoRA) to finetune a LLM. This is a critical *discovery* because it enables fineturing LLMs on devices with limited amount of VRAM. 

To ensure we understand how LoRA works, let's implement it with a small example.

Let's first create a weight matrix, $W_0$, that represents the original array of learned weights after training. 

In [2]:
w_0 = torch.randn(4, 4)
m, n = w_0.shape

Now, let's create two low rank matrices, that multiplied, have the same dimension of the original weight matrix. We are going to chose an arbitrary rank, $r$, of $2$. Notice how we are initializing $W_{A}$ and $W_{B}$ respectively with a normal gaussian and with all zeros. This is intentional, so that the first forward pass would return the same exact output of the original model.

In [4]:
r = 2
w_a, w_b = torch.randn(m, r), torch.zeros(r, n)
w_prime = w_0 + w_a @ w_b

Let's verify that.

In [8]:
torch.isclose(w_0, w_prime)

tensor([[True, True, True, True],
        [True, True, True, True],
        [True, True, True, True],
        [True, True, True, True]])