## Low-Rank Adaptation (LoRA) 

* Reference paper: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)


In [1]:
import torch 

The two main ideas behind LoRA are:
1. We can decompose the updated weight matrix as $W^{\prime} = W_{0} + W_{\Delta}$,
1. $W_{\Delta}$ can further be decomposed as the multiplication of two Low-Rank matrices, $W_{\Delta} = W_{A} \times W_{B}$.

Putting the two concepts together, we can freeze the original model, and add small learnable Low-Rank Adapters (LoRA) to finetune a LLM. This is a critical *discovery* because it enables fineturing LLMs on devices with limited amount of VRAM. 

To ensure we understand how LoRA works, let's implement it with a small example.

In [2]:
w_0 = torch.randn(4, 4)
m, n = w_0.shape
r = 2

In [3]:
w_a, w_b = torch.randn(m, r), torch.zeros(r, n)

In [4]:
w_prime = w_0 + w_a @ w_b

In [5]:
torch.isclose(w_0, w_prime)

tensor([[True, True, True, True],
        [True, True, True, True],
        [True, True, True, True],
        [True, True, True, True]])

## Scaling up 

We understood the main ideas behind LoRA. Now let's apply them to a more complex example. We are going to use a small-ish LLM (`microsoft/Phi-3-mini-4k-instruct`), and run a couple of tests. 