## LoRA -- Low-Rank Adaptation

Ref: https://iaee.substack.com/p/lora-intuitively-and-exhaustively-explained-e944a6bff46b

### Fine-tuning?

- process of tailoring a machine learning model to a specific application, which can be vital in achieving consistent and high quality performance
- LoRA is a method of fine-tuning

### What, and Why, is Fine Tuning?

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1131461-209a-4d88-a297-cf840ccbe2dd_800x166.png)

- As the state of the art of machine learning has evolved, expectations of model performance have increased; requiring more complex machine learning approaches to match the demand for heightened performance. In the earlier days of machine learning it was feasible to build a model and train it in a single pass.

- This is still a popular strategy for simple problems, but for more complex problems it can be useful to think of training as two parts; “pre-training” then “fine tuning”. 

- The general idea is to do an initial training pass on a bulk dataset and to then refine the model on a tailored dataset.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F865d7d46-c49e-4872-9b91-f378d611da71_800x337.png)

- This “pre-training” then “fine tuning” strategy can allow practioners to leverage multiple forms of data and use large pre-trained models for specific tasks. As a result, pre-training then fine tuning is a common and incredibly powerful paradigm. It comes with a few difficulties, though

### Difficulties with Fine Tuning

- The most basic form of fine tuning is to use the same exact process you used to pre-train a model to then fine tune that model on new data. You might train a model on a huge corpus of general text data, for instance, then fine tune that model using the same training strategy on a more specific dataset.

- this can be very expensive, and it can be very slow.

- you would need enough memory to store not only the entire model, but also gradients for every parameter in the entire model

- LoRA can help us deal with these issues and more, Less GPU memory, less time, and the ability to train larger models.

### LoRA in a Nutshell

- “Low-Rank Adaptation” (LoRA) is a form of “parameter efficient fine tuning” (PEFT), which allows one to fine tune a large model using a small number of learnable parameters. 

- LoRA employs a few concepts which, when used together, massively improve fine tuning:
    - We can think of fine tuning as learning changes to parameters, instead of adjusting parameters themselves.
    - We can try to compress those changes into a smaller representation by removing duplicate information.
    - We can “load” our changes by simply adding them to the pre-trained parameters.

### 1. Fine-tuning as Parameter Changes

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd795783a-75dd-4bb8-b419-699fd9f968ff_800x216.png)

The most basic approach to fine tuning consists of iteratively updating parameters. Just like normal model training, you have the model make an inference, then update the parameters of the model based on how wrong that inference was.

- LoRA thinks of this slightly differently
    - Instead of thinking of fine tuning as learning better parameters, you can think of fine tuning as learning parameter changes.
    - You can freeze the model parameters, exactly how they are, and learn the changes to those parameters necessary to make the model perform better at the fine tuned task.

- This is done very similarly to training, but:
    - In LoRA, we freeze the model parameters, and learn the changes to those parameters.
    - you have the model make an inference, then update based on how wrong the inference was
    - However, instead of updating the model parameters, you update the change in the model parameters.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F911fd8a7-fc78-4ba8-87e4-e63dfd281c4d_800x359.png)

The whole point of LoRA is that we want to make fine tuning smaller and faster, how does adding more data and extra steps allow us to do that?

### 2. Parameter Change Compression

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01892d10-8ef7-476f-86a7-71c92f139207_800x235.png)

Weights in a network are a matrix -- a matrix has certain properties which can be used to condense information

#### Matrix Property 1. Linear Independence

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F210d267a-0a42-4596-bedf-31761daf7a87_800x327.png)

Each of these vectors point in different directions. You can’t squash and stretch one vector to be equal to the other vector. -- Linearly Independent

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9ea4851-7d50-496e-9e87-a6b564ce4cce_800x370.png)


Vectors A and B are pointing in the same exact direction, while vector C is pointing in a different direction. As a result, no matter how you squash and stretch either A or B, they can never be used to describe C. Therefore, C is linearly independent from A and B. However, you can stretch A to equal B , and vice versa, so A and B are linearly dependent.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca8a54ff-9e4a-41d9-a411-4935acb6ba09_800x386.png)

Let's say A and B pointed in slightly different directions.

Now A and B can be used together (With some squashing and stretching) to describe C , and likewise A and B can be described by the other vectors. In this situation we would say none of the vectors are linearly independent, because all vectors can be described with other vectors in the matrix.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c722798-7e18-4210-8d8f-c7b3faf97d9a_800x416.png)

- Conceptually speaking, linearly independent vectors can be thought of as containing different information, while linearly dependent vectors contain some duplicate information between them.

#### Matrix Property 2. Rank

- The idea of rank is to quantify the amount of linear independence within a matrix.
- We can break a matrix down into some number of linearly independent vectors; This form of the matrix is called “reduced row echelon form”.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff66c7ce0-a145-4451-9e85-bad14aa415c9_800x194.png)

- By breaking the matrix down into this for --> you can count how many linearly independent vectors can be used to describe the original matrix.
-  The number of linearly independent vectors is the “rank” of the matrix. (4 in the above case)

#### Matrix Property 3. Matrix Factors

- So, matrices can contain some level of “duplicate information” in the form of linear dependence.
-  We can exploit this idea using factorization to represent a large matrix in terms of two smaller matrices. Similarly to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99835497-f13d-4e9a-85d4-0540477ffdc1_800x194.png)

The two vectors on the left when multiplied together gives the matrix on the right. But crucially, even though they have the same value, the vectors on the left occupy 40% of the size that the matrix on the right occupies.

-- The larger the matrix, the more you can save by using this factorization trick.

* This idea of factorization is what allows LoRA to occupy such a small memory footprint.

#### The Core Idea Behind LoRA

LoRA thinks of tuning not as adjusting parameters, but as learning parameter changes. With LoRA we don’t learn the parameter changes directly, however; we learn the factors of the parameter change matrix.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F939c41f0-cf60-4cf3-b36d-be60e7aef880_800x222.png)

- This idea of learning factors of the change matrix relies on the core assumption that weight matrices within a large language model have a lot of linear dependence, as a result of having significantly more parameters than is theoretically required.

- The idea behind LoRA is that, once you’ve learned the general task with pre-training, you can do fine tuning with significantly less information.

From the LoRA paper:

> learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen 

This results in a significantly smaller amount of parameters being trained which means an overall faster and more storage and memory efficient fine tuning process.

#### Fine-Tuning Flow with LoRA

-  first, we freeze the model parameters. We’ll be using these parameters to make inferences, but we won’t update them.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae8b590-41ff-42d0-99d2-c481a386ee82_800x173.png)

- We create two matrices. These are sized in such a way that, when they’re multiplied together, they’ll be the same size as the weight matrices of the model we’re fine tuning. 
- In a large model, with multiple weight matrices, you would create one of these pairs for each weight matrix.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3dafddc-df1c-4a5d-9a5c-fe5862bc3a83_800x184.png)

We then calculate the change matrix:

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43a7cc8-7610-4aa3-a7a8-3bbd4e2403d1_800x189.png)

Then we pass our input through both the frozen weights and the change matrix.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a46d28-3340-4d78-9de1-fd2d5d980190_800x384.png)

We calculate a loss based on the combination of both outputs then we update matrix A and B based on the loss

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef177f86-d5a3-42f8-a097-c09e3acf7fb9_800x374.png)

- We do this operation until we’ve optimized the factors of the change matrix for our fine tuning task. 
- The backpropagation step to update the matrices A and B is much faster than the process to update the full set of model parameters, on account of A and B being significantly smaller.
- This is why, despite more operations in the training process, LoRA is still typically faster than traditional fine-tuning.

#### At Inference

When we ultimately want to make inferences with this fine tuned model, we can simply compute the change matrix, and add the changes to the weights. 

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff563e214-005a-4f41-aede-d9c7e3a59a44_800x157.png)

We can also multiply the change matrix by a scaling factor, allowing us to control the level of impact that change matrix has on the model.

#### A Note on LoRA Rank

LoRA has a hyperparameter, named r, which describes the depth of the A and B matrix used to construct the change matrix discussed previously. Higher r values mean larger A and B matrices, which means they can encode more linearly independent information in the change matrix.

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2895306a-33a3-402f-94f6-8ed0f36e7589_800x222.png)

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8def686-26de-40d7-b078-da2a59da124b_800x429.png)

![img](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd33d2e1c-78db-4749-9146-91b1d3c27d13_800x264.png)

Code to checkout: https://github.com/DanielWarfield1/MLWritingAndResearch/blob/main/LoRA.ipynb