<a href="https://colab.research.google.com/github/NID123-CH/LLM-Codes/blob/main/PEFT_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PEFT LoRA

Let's start by loading a model:

In [1]:
!pip install peft



In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "EleutherAI/pythia-160m"

model = AutoModelForCausalLM.from_pretrained(base_model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

In [3]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [4]:
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
      

In [5]:
print_trainable_parameters(model)

trainable params: 162322944 || all params: 162322944 || trainable%: 100.0


It has 160M parameters - as expected - and they're all trainable.

We can use LoRA to get low-rank matrices for all the big linear layers in the model (our `target_modules`):

In [6]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
        "embed_out",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 1,588,224 || all params: 163,911,168 || trainable%: 0.9690


Thanks to LoRA, now there's only 1.5M trainable parameters - a bit less than 1% of the original number!

Notice that A and B matrices were created for each targeted linear layer:

In [7]:
peft_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPTNeoXForCausalLM(
      (gpt_neox): GPTNeoXModel(
        (embed_in): Embedding(50304, 768)
        (emb_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-11): 12 x GPTNeoXLayer(
            (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (post_attention_dropout): Dropout(p=0.0, inplace=False)
            (post_mlp_dropout): Dropout(p=0.0, inplace=False)
            (attention): GPTNeoXSdpaAttention(
              (rotary_emb): GPTNeoXRotaryEmbedding()
              (query_key_value): lora.Linear(
                (base_layer): Linear(in_features=768, out_features=2304, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (defaul

Let's take a closer look:

In [8]:
lin = peft_model.base_model.model.gpt_neox.layers[0].attention.query_key_value
lin

lora.Linear(
  (base_layer): Linear(in_features=768, out_features=2304, bias=True)
  (lora_dropout): ModuleDict(
    (default): Dropout(p=0.05, inplace=False)
  )
  (lora_A): ModuleDict(
    (default): Linear(in_features=768, out_features=8, bias=False)
  )
  (lora_B): ModuleDict(
    (default): Linear(in_features=8, out_features=2304, bias=False)
  )
  (lora_embedding_A): ParameterDict()
  (lora_embedding_B): ParameterDict()
  (lora_magnitude_vector): ModuleDict()
)

In [9]:
lin.lora_A, lin.lora_B

(ModuleDict(
   (default): Linear(in_features=768, out_features=8, bias=False)
 ),
 ModuleDict(
   (default): Linear(in_features=8, out_features=2304, bias=False)
 ))

In [10]:
print_trainable_parameters(lin.base_layer)
print_trainable_parameters(lin.lora_A)
print_trainable_parameters(lin.lora_B)

trainable params: 0 || all params: 1771776 || trainable%: 0.0
trainable params: 6144 || all params: 6144 || trainable%: 100.0
trainable params: 18432 || all params: 18432 || trainable%: 100.0


Now, let's see how the output is produced under-the-hood:

In [11]:
torch.manual_seed(42)
x = torch.randn(1, 5, 768).float()

In [12]:
previous_dtype = x.dtype

# Uses the base model to produce outputs
result = lin.base_layer(x)

for active_adapter in lin.active_adapters:
    if active_adapter not in lin.lora_A.keys():
        continue

    lora_A = lin.lora_A[active_adapter]
    lora_B = lin.lora_B[active_adapter]
    dropout = lin.lora_dropout[active_adapter]
    scaling = lin.scaling[active_adapter]
    x = x.to(lora_A.weight.dtype)

    result += lora_B(lora_A(dropout(x))) * scaling

result = result.to(previous_dtype)
result

tensor([[[-0.7590,  0.8904, -1.9062,  ..., -0.2870, -0.3660,  0.3444],
         [ 0.6140,  0.5264, -0.9148,  ...,  0.5473,  0.2673,  0.2375],
         [ 0.4470,  0.2263,  0.6217,  ..., -0.0528, -0.2111, -0.5163],
         [-2.4433,  1.6739, -0.1461,  ..., -0.4212,  0.1469,  0.1895],
         [ 0.8305, -1.2330,  0.0519,  ...,  0.3947, -0.0971,  0.4093]]],
       grad_fn=<AsStridedBackward0>)

We can also try out a real sentence as input:

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-160m",
)

inputs = tokenizer("The capital of Argentina is", return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



First, it tokenizes the sentence:

In [14]:
inputs['input_ids']

tensor([[  510,  5347,   273, 23881,   310]])

Then, it retrieves its input embeddings:

In [15]:
embed = peft_model.base_model.model.gpt_neox.embed_in(inputs['input_ids'])
embed

tensor([[[ 0.0002,  0.0048, -0.0329,  ...,  0.0067,  0.0170,  0.0054],
         [ 0.0592, -0.0108, -0.0004,  ..., -0.0036, -0.0077, -0.0434],
         [ 0.0005, -0.0028,  0.0054,  ...,  0.0027, -0.0020, -0.0037],
         [-0.0086, -0.0027,  0.0196,  ..., -0.0363,  0.0144,  0.0205],
         [-0.0053, -0.0024,  0.0104,  ...,  0.0185,  0.0039, -0.0238]]])

The inputs are layer-normalized next:

In [16]:
lnorm = peft_model.base_model.model.gpt_neox.layers[0].input_layernorm
lnorm

LayerNorm((768,), eps=1e-05, elementwise_affine=True)

In [17]:
norm_embed = lnorm(embed)
norm_embed

tensor([[[-0.0725,  0.2857, -1.0191,  ...,  0.1744,  0.5764,  0.1426],
         [ 1.6593, -0.1973,  0.0192,  ..., -0.1529, -0.1523, -1.3258],
         [-0.0525,  0.0059,  0.3059,  ...,  0.0892, -0.0103, -0.2053],
         [-0.3531,  0.0372,  0.6126,  ..., -1.0641,  0.4470,  0.5608],
         [-0.3176,  0.0268,  0.5030,  ...,  0.7324,  0.2268, -1.0857]]])

What if we pass these values as arguments to the linear layer we're experimenting with?

In [18]:
result = lin.base_layer(norm_embed)
result.shape

torch.Size([1, 5, 2304])

The variable `result` contains the output we're trying to replicate.

Now, let's use matrices A and B to manually compute this output and compare to the one above:

In [19]:
active_adapter = 'default'
lora_A = lin.lora_A[active_adapter]
lora_B = lin.lora_B[active_adapter]

In [20]:
lora_A.weight.shape, lora_B.weight.shape

(torch.Size([8, 768]), torch.Size([2304, 8]))

In [21]:
lora_A.state_dict(), lora_B.state_dict()

(OrderedDict([('weight',
               tensor([[ 0.0144,  0.0122, -0.0091,  ..., -0.0090, -0.0081,  0.0105],
                       [-0.0340, -0.0193, -0.0330,  ...,  0.0217,  0.0141, -0.0046],
                       [ 0.0318,  0.0055,  0.0235,  ...,  0.0109,  0.0133,  0.0064],
                       ...,
                       [ 0.0317,  0.0257,  0.0197,  ..., -0.0324,  0.0329,  0.0323],
                       [-0.0344,  0.0144,  0.0213,  ...,  0.0324, -0.0068,  0.0024],
                       [ 0.0258, -0.0056, -0.0153,  ..., -0.0037,  0.0069, -0.0011]]))]),
 OrderedDict([('weight',
               tensor([[0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       ...,
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.],
                       [0., 0., 0.,  ..., 0., 0., 0.]]))]))

Did you notice anything?

Matrix B is initialized with **zeros**, so the model's original behavior is preserved before the "add-on" - the adapter - is trained.

In [22]:
low_ranked = lora_A(norm_embed)
low_ranked

tensor([[[-0.7489,  0.5211, -0.8118,  0.0877,  0.2055, -0.6701,  0.4042,
           0.1906],
         [ 0.1474, -0.2758,  0.7828, -0.7500, -0.2218,  0.2913,  0.4416,
          -0.2960],
         [ 0.6522, -0.0904, -0.3519,  0.0844,  0.1331,  0.4722,  0.0928,
           0.2823],
         [-0.2392,  0.7513,  0.3848,  0.5732, -0.1293,  0.4033, -0.4269,
          -0.8909],
         [-0.4936, -0.2486, -0.4032, -0.9119,  0.1926, -0.2389, -0.6967,
           0.9784]]], grad_fn=<UnsafeViewBackward0>)

In [23]:
delta = lora_B(low_ranked)
delta

tensor([[[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]], grad_fn=<UnsafeViewBackward0>)