### Parameter-efficient fine-tuning with LoRA

<i>Low-rank adaptation (LoRA)</i> is one of the most widely used techniques for parameter-efficient fine-tuning. LoRA is a technique that adapts a pretrained model to better suit a specific, often smaller dataset by adjusting only a small subset of the model's weight parameters. The "low-rank" aspect refers to the mathematical concept of limiting model adjustments to a smaller dimensional subspace of the total weight parameter space. This effectively captures the most influential directions of the weight parameter changes during training. LoRA is useful and popular because it enables efficient fine-tuning of large models on task-specific dat, significantly cutting down on computational costs and resources usually required for fine-tuning. 

Suppose a large weight matrix W is associated with a specific layer (LoRA can be applied to all linear layers in an LLM but we focus on a single layer for illustration purposes). During backpropagation, we learn a ΔW matrix, which contains information on how much we want to update the original weight parameters to minimise the loss function during training (from now on "weight" = model's weight parameters). In regular training and fine-tuning, the weight update is defined as<br>

W<sub>updated</sub> = W + WΔ

The LoRA method offers a more efficient alternative to computing the weight updates by learning an approximation of it:

ΔW ≈ AB

where A and B are two matrices much smaller than W, and AB represents the matrix multiplication product between A and B. Using LoRA, we can reformulate the weight update defined earlier:

W<sub>updated</sub> = W + AB

#### Preparing the dataset

In [1]:
import pandas as pd
train_df = pd.read_parquet("../Datasets/train.parquet")
valid_df = pd.read_parquet("../Datasets/valid.parquet")
test_df = pd.read_parquet("../Datasets/test.parquet")

In [2]:
import torch
from torch.utils.data import Dataset
from Chapter05 import tokeniser
from Chapter06 import SpamDataset

train_dataset = SpamDataset("../Datasets/train.parquet", 
                            max_length=None,
                            tokeniser=tokeniser
)
val_dataset = SpamDataset("../Datasets/valid.parquet", 
                            max_length=None,
                            tokeniser=tokeniser
)
test_dataset = SpamDataset("../Datasets/test.parquet", 
                            max_length=None,
                            tokeniser=tokeniser
)

In [3]:
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8
torch.manual_seed(42)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    drop_last=True
)
val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False
)
test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False
)

In [4]:
print("Train loader:")
for input_batch, target_batch in train_loader:
    pass

print("Input batch dimensions:", input_batch.shape)
print("Target batch dimensions:", target_batch.shape)

Train loader:
Input batch dimensions: torch.Size([8, 109])
Target batch dimensions: torch.Size([8])


In [5]:
print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} test batches")

130 training batches
19 validation batches
38 test batches


#### Initialising the model

In [None]:

from Chapter05 import model_configs, load_weights_into_gpt
from Chapter04 import GPTModel
from gpt_download import download_and_load_gpt2

CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves"

BASE_CONFIG = {
    "vocab_size": 50_257,
    "context_length": 1_024,
    "drop_rate": 0.0,
    "qkv_bias": True
}

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(
    model_size=model_size,
    models_dir="gpt2"
)

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()

In [4]:
# Double-check coherent text is generated. Clearly not.
from Chapter04 import generate_text_simple
from Chapter05 import text_to_token_ids, token_ids_to_text

text_1 = "Every effort moves you"

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_1, tokeniser),
    max_new_tokens=15,
    context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokeniser))

Every effort moves you to the point of the "I-don-it-from-the


In [None]:
# Replace output layer for classification fine-tuning
torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(in_features=768, out_features=num_classes)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [9]:
# Classification accuracy of not-fine-tuned model
from Chapter06 import calc_accuracy_loader

torch.manual_seed(123)
train_accuracy = calc_accuracy_loader(
        train_loader, model, device, num_batches=10
)
val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)
test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)

In [10]:
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

Training accuracy: 46.25%
Validation accuracy: 53.75%
Test accuracy: 58.75%


#### Parameter-efficient fine-tuning with LoRA

We initialise a LoRA-Layer that creates the matrices A and B, along with the 'alpha' scaling factor and the 'rank' (r) setting. 

In [11]:
import math

class LoRALayer(torch.nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
        torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

The rank governs the inner dimension of matrices A and B. Essentially this determines the number of extra parameters introduced by LoRA, which creates balance between the adaptability of the model and its efficiency via the number of parameters used.

Alpha is a scaling factor for the output from the low-rank adaptation. It primarily dictates the degree to which the output from the adapted layer can affect the original layer's output. This can be seen as a way to regulate the effect of the low-rank adaptation on the layer's output. 