# Seminar *Training large models*

by Denis Kuznedelev

Adopted from [YDS NLP course](https://github.com/yandexdataschool/nlp_course/blob/2023/week07_peft/practice.ipynb)

## Introduction

In this notebook, you will learn how to finetune large language models with limited GPU memory.

For the last few years models have greatly increased in size and many of them do not fit onto the standard consumer GPU, not said about finetuning these models in conventional way.

However, there exists several techniques that allow one to inference and even finetune large models given modest resources.

**Reduction of the model size**

There existst numerous approaches to model compression (that require a separate lecture for an overview) and the one of the most succesfull in the context of LLM is PTQ (post-training quantization) that stores the weight in low precision.

4-bit quantization typically leads to minor degration in performance relative to the floating point baseline offerring huge memory savings:
* a model in `half` precision requires `16 bits` per parameter
* a model quantized to 4-bits requires `4+eps bits` per parameter (there is small overhead on the storage of quantization statistics)

Therefore, we have almost `4x` reduction in memory!

**Reduction of the memory on optimizer states**

Another challenge are the optimizer states. For `Adam` optimizer commonly adopted for training Transformers one needs `4 bytes` for gradients, and first and second optimizer moment (one may try to store some of these in half precision, but it tends to incur instability).

Therefore, the total memory required to train something like `Llama-7b`, `Mistral-7b`, `gemma-7b` exceeds `80Gb` of high-end `A100, H100`.

Finetuning only the `lm.head` may not suffice for more complicated tasks.

Thus we search for something in between - that allows to adapt in some sense every transformer layer, but with small number of trainable parameters.

Different approach to train small subset of parameters are known in the literature as **parameter-efficient finetuning** (PEFT) methods.



We will cover two known tecnhiques for parameter-efficient finetuning:
* Prompt tuning
* LoRA adapters

## Preparation

In [None]:
!pip install accelerate # needed to reduce RAM consumption and integration with bits and bytes
!pip install bitsandbytes # to work with quantized models
!pip install peft # to work with PEFT techniques

Looking in indexes: https://pypi.org/simple/
Looking in indexes: https://pypi.org/simple/
Looking in indexes: https://pypi.org/simple/


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from tqdm.auto import trange

import torch
import torch.nn as nn
import torch.nn.functional as F

from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
assert torch.cuda.is_available(), "No CUDA, no party"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
model_name = 'mistralai/Mistral-7B-v0.1'

# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
# loading model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 3943.87it/s]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:19<00:00,  9.96s/it]


## Prompt tuning

Prompt tuning injects learnable tokens in the prompts that are optimized via backpropagation.

<img src="https://thumtblog.github.io/images/robust-prefix-tuning/observation-good.png" width=600px>

Number of learnable parameters is: $N_{tokens} \times d_{embed}$.

This approach is pretty cheap and is known to work pretty good in simple cases.

In [None]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

for i in range(7):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print(f"\nOutput: {tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist(), skip_special_tokens=True)}")


Output: A quick brown fox jumps over a lazy dog.


<img src="https://static.wikia.nocookie.net/theodd1souts/images/e/ee/Odd_alphabet.jpg/revision/latest/scale-to-width-down/1000?cb=20180616072819" width=400px>

What a blatant lie!

This particular fox assures you that it didn't in fact jump over the lazy dog.

No, sir! The fox was just minding its own business.

Your task is to train the model to say truth: no dog was jumped over today.


In [None]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
outputs = model(**batch)

next_word_logits = outputs.logits[:, :-1]
true_next_tokens = batch['input_ids'][:, 1:]
loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))

print(f"Loss: {loss.item():.2f}")

Loss: 3.06


**Your task**

Implement prompt tuning using the template below.

In [None]:
class WordEmbeddingsWithLearnedPrompts(nn.Module):
    """
    To perform prompt tuning, you will need to replace model's original word embeddings with a layer - THIS layer
     - that inserts trainable prompts instead of the first N token embeddings. """

    def __init__(self, word_embeddings: nn.Embedding, num_prompts: int):
        super().__init__()
        self.original_word_embeddings = word_embeddings
        self.num_prompts = num_prompts
        self.learnable_prompts = nn.Parameter(
            torch.randn(1, num_prompts, word_embeddings.embedding_dim),
            requires_grad=True
          )

    def forward(self, input_ids: torch.LongTensor):
        # input_ids shape: [batch_size, seq length]
        assert input_ids.dtype == torch.int64
        assert input_ids.shape[1] > self.num_prompts
        assert torch.all(input_ids[:, :self.num_prompts] == tokenizer.pad_token_id).item(), "don't forget to prepend several BOS tokens to input_ids"

        # Your task: embed input_ids, but replace the first :num_prompts: tokens with self.learnable_prompts
        # This is because we will prepend :num_prompts: padding tokens at the beginning

        # After you are done, you must produce a word embedding vector for each token in input_ids,
        # except that the first :num_prompts: vectors should equal learnable_prompts;
        # any additional vectors after first :num_prompts: ones should be embedded as usual
        # Note: since you're dealing with trainable params, please torch.cat instead of item assignment

        learnable_embeddings = self.learnable_prompts.repeat(input_ids.shape[0], 1, 1)
        inputs_embeddings = self.original_word_embeddings(input_ids[:, self.num_prompts:])

        embeddings = torch.cat((learnable_embeddings, inputs_embeddings), dim=1)

        return embeddings

In [None]:
num_prompts = 16
test_emb_layer = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)
test_input_ids = tokenizer("a cat say on a may", return_tensors='pt')['input_ids'].to(device)

space_for_prompts = torch.full(
    size=(len(test_input_ids), num_prompts),
    fill_value=tokenizer.pad_token_id,
    dtype=torch.int64,
    device=device
)
test_inputs_with_prompts = torch.cat([space_for_prompts, test_input_ids], dim=1)

with torch.cuda.amp.autocast():
  test_prompt_embeddings = test_emb_layer(test_inputs_with_prompts)

assert test_prompt_embeddings.shape[:2] == test_inputs_with_prompts.shape
assert test_prompt_embeddings.shape[-1] == model.config.hidden_size
assert torch.allclose(test_prompt_embeddings[:, :num_prompts], test_emb_layer.learnable_prompts.float())
assert torch.allclose(test_prompt_embeddings[:, num_prompts:], model.model.embed_tokens(test_input_ids).float())
print("Looks legit!")

Looks legit!


In [None]:
assert isinstance(model.model.embed_tokens, nn.Embedding), "you have already replaced the embedding layer. If the replacement is broken, please reload the model"
model.model.embed_tokens = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)

In [None]:
opt = torch.optim.Adam([model.model.embed_tokens.learnable_prompts], lr=0.01)

Prepare batch

In [None]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
space_for_prompts = torch.full([len(test_input_ids), num_prompts], fill_value=tokenizer.pad_token_id,
                               dtype=torch.int64, device=device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)

In [None]:
n_iter = 250
atol = 0.17

pbar = trange(n_iter)
for i in pbar:
  outputs = model(**batch)
  next_word_logits = outputs.logits[:, num_prompts:-1, :]
  true_next_tokens = batch['input_ids'][:, num_prompts + 1:]
  loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))
  loss.backward()
  opt.step()
  opt.zero_grad()
  pbar.set_description(f"Loss {loss.item():.2f}")
  if loss < atol:
    break

assert loss.item() <= atol
print("Good job!")

Loss 0.17:  16%|â–ˆâ–‹        | 41/250 [00:07<00:40,  5.20it/s]

Good job!





In [None]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)

for i in range(15):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print(f"\nOutput: {tokenizer.decode(batch['input_ids'][0, num_prompts:].cpu().numpy().tolist(), skip_special_tokens=True)}")


Output: A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway


If you did everything right, the model will deny that the fox jumped over the lazy dog


### Using HuggingFace PEFT (2 points)


[PEFT](https://github.com/huggingface/peft) is a transformer's ðŸ¤— sister library that allows you to apply various __p__arameter __e__fficient __f__ine-__t__uning methods to pre-trained transformers. This library provides an implementation of the common PEFT techniques:
* LoRA
* Prefix-Tuning
* Prompt-Tuning
* IA3
* and more

In [None]:
import peft

In [None]:
del model
torch.cuda.empty_cache()

In [None]:
# re-loading model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad = False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 2776.77it/s]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:24<00:00, 12.17s/it]


Sanity check that we have reloaded the model

In [None]:
assert isinstance(model.model.embed_tokens, nn.Embedding), "please reload the model"

In [None]:
peft_config = peft.PromptTuningConfig(task_type=peft.TaskType.CAUSAL_LM, num_virtual_tokens=16)
model = peft.get_peft_model(model, peft_config)  # note: for most peft methods, this line also modifies model in-place
model.print_trainable_parameters()

trainable params: 65,536 || all params: 7,241,797,632 || trainable%: 0.000904968673943746


**Your task**

Optimize the PEFT-wrapped model to achieve next token prediction `loss < 0.17`, but this time using PEFT

**Note**

You no longer need to prepend PAD tokens, but you still need to skip `:num_virtual_tokens`: first logits.

Finally, generate the sentence to make sure that the model learned the truth.

In [None]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)

In [None]:
opt = torch.optim.Adam(model.parameters(), lr=0.01)

In [None]:
n_iter = 200
atol = 0.17

pbar = trange(n_iter)
for i in pbar:
  outputs = model(**batch)
  next_word_logits = outputs.logits[:, num_prompts:-1, :]
  true_next_tokens = batch['input_ids'][:, 1:]
  loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))
  loss.backward()
  opt.step()
  opt.zero_grad()
  pbar.set_description(f"Loss {loss.item():.2f}")
  if loss < atol:
    break

assert loss.item() <= atol
print("Good job!")

Loss 0.17:  15%|â–ˆâ–Œ        | 30/200 [00:06<00:38,  4.38it/s]

Good job!





In [None]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

for i in range(15):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print(f"\nOutput: {tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist(), skip_special_tokens=True)}")


Output: A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway


## Parameter-efficient finetuning with LoRA

When training on more serious tasks, you can use low-rank adapters based on the LoRA paper.

The core idea is to add low-rank adapters in parallel with existing linear layers, like this:

<img src="https://i.imgur.com/6bQLNiG.png" width=300px>

Specifically, the application of adapter looks as follows:
$$
y = W_{0} x + \frac{\alpha}{r} B A x
$$
Above:
* $W_{0}$ - is the original weight
* A, B - are learnable matrices
* r - is their rank
* $\alpha$ - is the relative weight of the weight update

In the original LoRA paper, the adapters were only added to attention projection matrices.

However, subsequent works show that it is useful to adapt FFNs as well. But before we do any training, we need to implement the basic LoRA layer.

In [None]:
# re-load the model to remove any previous PEFT tuners
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 6413.31it/s]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:25<00:00, 12.99s/it]


**Your task**

Implement LoRA adapter.

In [None]:
class LoRALayer(nn.Module):
    """Wraps a linear layer with LoRA-like adapter. Wraps an existing OPT linear layer"""
    def __init__(self, module: nn.Linear, rank: int):
        super().__init__()
        self.module = module  # pre-trained (frozen) linear layer
        self.adapter_A = nn.Parameter(torch.empty(module.in_features, rank, device=module.weight.device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5 ** 0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, module.out_features, device=module.weight.device))

    def forward(self, inputs):
        # Apply self.module and LoRA adapter, return the sum (self.module outputs + adapter outputs)
        #  <YOUR CODE HERE>
        module_outputs = self.module(inputs)
        adapter_outputs = inputs @ self.adapter_A.unsqueeze(0) @ self.adapter_B.unsqueeze(0)
        return module_outputs + adapter_outputs

Test implementation

In [None]:
test_linear = nn.Linear(128, 128)
test_linear.weight.data[...] = torch.eye(128)
test_adapter = LoRALayer(test_linear, rank=8)

assert torch.allclose(test_adapter(torch.ones(1, 1, 128)), test_linear.bias + 1), "please check your forward pass"

test_adapter.adapter_A.data[...] = torch.linspace(0.1, -0.5, 128 * 8).view(128, 8)
test_adapter.adapter_B.data[...] = torch.linspace(0.5, -0.1, 128 * 8).view(8, 128)
test_linear.bias.data[...] = torch.linspace(1., -1., 128)

dummy_loss = F.mse_loss(test_adapter(torch.ones(1, 128) / 128).squeeze(), torch.linspace(-1, 1, 128))
assert torch.allclose(dummy_loss, torch.tensor(1.3711389), rtol=0, atol=1e-4)
dummy_loss.backward()
assert all(w.grad is not None for w in [test_adapter.adapter_A, test_adapter.adapter_B]), "some adapter weights have no grad"
assert torch.allclose(test_adapter.adapter_A.grad.sum(), torch.tensor(-0.60158), rtol=0, atol=1e-4), "bad grad w.r.t. A"
assert torch.allclose(test_adapter.adapter_B.grad.sum(), torch.tensor(0.9931), rtol=0, atol=1e-4), "bad grad w.r.t. B"
# note: bad grad means that your code is different from LoRA paper OR that your code is not autograd-friendly (e.g. no_grad)
del dummy_loss, test_linear, test_adapter
print("All tests passed!")

All tests passed!


### Apply LoRA to the model

The code below applies LoRA adapters on top of `Q/K/V` linear layers of Transformer attention.

You may also choose to modify other layers:

    self_attn.o_proj - attention output projection
    mlp.up_proj, mlp.gate_proj, mlp.down_proj - transformer feedforward layers
    lm_head - output LM head

In [None]:
lora_rank = 8
attention_layer_name = 'Attention'

for name, module in model.model.layers.named_modules():
    if attention_layer_name in repr(type(module)):
        module.q_proj = LoRALayer(module.q_proj, rank=lora_rank).to(device)
        module.k_proj = LoRALayer(module.k_proj, rank=lora_rank).to(device)
        module.v_proj = LoRALayer(module.v_proj, rank=lora_rank).to(device)

assert sum(isinstance(module, LoRALayer) for module in model.modules()) == 96 # for Mistral-7b

In [None]:
batch = tokenizer("This model wants to share its greatest secret:", return_tensors='pt', return_token_type_ids=False)
# test a single training step, make sure we get meaningful gradients
with torch.cuda.amp.autocast(dtype=torch.float32):
    out = model.forward(**batch)
    (out.logits.norm() / 100).backward()

for i, module in enumerate(model.modules()):
    if isinstance(module, LoRALayer):
        assert module.adapter_B.grad is not None
        assert module.adapter_B.grad.norm().item() > 0

model.zero_grad(set_to_none=True)
print("Grad check successful, well done!")

Grad check successful, well done!


Let us finetune the model on some custom dataset

In [None]:
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)
model._hf_peft_config_loaded = True  # silence a warning from HF trainer

trainer = transformers.Trainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=200,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# Silence the warnings. Please re-enable for inference!
model.config.use_cache = False
trainer.train()

Detected kernel version 5.4.161, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mspiridon_sun_rotator[0m ([33mist[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
1,1.6455
2,1.1446
3,1.5689
4,1.3259
5,1.4543
6,1.5173
7,1.3265
8,1.5284
9,1.7489
10,1.943


TrainOutput(global_step=200, training_loss=1.2602548521757126, metrics={'train_runtime': 413.8954, 'train_samples_per_second': 7.731, 'train_steps_per_second': 0.483, 'total_flos': 1.3073398340124672e+16, 'train_loss': 1.2602548521757126, 'epoch': 1.28})

### Inference finetuned model

In [None]:
model.config.use_cache = True

In [None]:
batch = tokenizer("Two things are infinite: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=24)

print(f"\n\n{tokenizer.decode(output_tokens[0], skip_special_tokens=True)}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




Two things are infinite:  the universe and human stupidity; and I'm not sure about the universe.

- Albert Einstein




PEFT library provides implemenation of LoRA as well.

In [None]:
from peft import LoraConfig, get_peft_model

In [None]:
# re-load the model to remove any previous PEFT tuners
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Downloading shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00, 2402.24it/s]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:25<00:00, 12.70s/it]


In [None]:
config = LoraConfig(
    r=16, # rank of the LoRA adapter
    lora_alpha=16, # weight of LoRA adapter
    target_modules=["q_proj", "k_proj"], # layers to apply LoRA
    lora_dropout=0.1, # dropout in LoRA layers
    bias="none", # whether to add bias
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 6,815,744 || all params: 7,248,547,840 || trainable%: 0.0940290959023318


You can train model exactly in the same way as before.

**Note** PEFT allows you to save the adapter weights alone and push to hub.

In [None]:
model.push_to_hub("username/adapter_name", use_auth_token=True)

Note, for floating models LoRA adapter can be **seamlessly** merged into model, thus incuring **zero** overhead on inference.

However, for quantized models things are more subtle and one has to process them in parallel with the main weight or apply some additional hack to merge them into the model.

## Materials for further study

*  [PEFT documentation](https://huggingface.co/docs/peft/v0.10.0/en/index)
*  [LoRA paper](https://arxiv.org/abs/2106.09685)
*  [Prompt tuning paper](https://arxiv.org/abs/2104.08691)