### Parameter Efficient Fine-Tuning
In this notebook, you're gonna fine-tune large language models within limited GPU memory.

In [1]:
# Original library versions
# %pip install --quiet transformers==4.34.1 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2

# Preferred versions for Colab as of October 2025 (thanks, Lev!)
%pip install --quiet "bitsandbytes==0.45.3" "transformers>=4.43,<4.46" "accelerate>=0.33,<0.36" "peft>=0.11.1" "optimum>=1.20.0" "sentencepiece"

Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import transformers
from tqdm.auto import tqdm, trange
assert torch.cuda.is_available(), "you need cuda for this part"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


In [16]:
model_name = 'Enoch/llama-7b-hf'

# loading Llama tokenizer ...
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# ... and the model itself
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:09<00:00,  3.51it/s]


### Prompt tuning: the story of a fox (1 point)

![img](https://i.imgur.com/Ux3qQAu.png) (source: theodd1souts.fandom.com)

In [3]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

for i in range(10):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print("\nOutput:", tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist()))

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)



Output: <s>A quick brown fox jumps over the lazy dog.
A quick


What a blatant lie! This particular fox assures you that it didn't in fact jump over the lazy dog. No, sir! The fox was just minding its own business. __Your task is to train the model to say truth: no dog was jumped over today.__

In [4]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
outputs = model(**batch)

next_word_logits = outputs.logits[:, :-1]
true_next_tokens = batch['input_ids'][:, 1:]
loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))

print("Loss:", loss)

Loss: tensor(3.0729, device='cuda:0', grad_fn=<NllLossBackward0>)


Except, we can't train the entire model - that would be 28GB gradients in float32. Instead, let's run [prompt tuning](https://arxiv.org/abs/2104.08691).

![img](https://i.imgur.com/VwNNKnb.png)


In [5]:
class WordEmbeddingsWithLearnedPrompts(nn.Module):
    """
    To perform prompt tuning, you will need to replace the model's original word embeddings with a layer - THIS layer
    - that inserts trainable prompts instead of the first N token embeddings.
    """

    def __init__(self, word_embeddings: nn.Embedding, num_prompts: int):
        super().__init__()
        self.original_word_embeddings = word_embeddings
        self.num_prompts = num_prompts
        self.learnable_prompts = nn.Parameter(
            torch.randn(1, num_prompts, word_embeddings.embedding_dim), requires_grad=True
        )

    def forward(self, input_ids: torch.LongTensor):
        # input_ids shape: [batch_size, seq_length]
        assert input_ids.dtype == torch.int64
        assert input_ids.shape[1] > self.num_prompts
        assert torch.all(input_ids[:, :self.num_prompts] == tokenizer.pad_token_id).item(), (
            "Don't forget to prepend several BOS tokens to input_ids"
        )

        # Embed the input_ids using the original word embeddings
        input_embeddings = self.original_word_embeddings(input_ids) #<YOUR CODE HERE>  # Shape: [batch_size, seq_length, embedding_dim]

        # Replace the first num_prompts token embeddings with the learnable prompts
        batch_size = input_ids.shape[0]
        learnable_prompts_expanded = self.learnable_prompts.expand(batch_size, -1, -1)
 #<YOUR CODE HERE>  # Shape: [batch_size, num_prompts, embedding_dim]
        remaining_embeddings = input_embeddings[:, self.num_prompts:, :] #<YOUR CODE HERE>  # Shape: [batch_size, seq_length - num_prompts, embedding_dim]

        # Concatenate learnable prompts with the embeddings of the remaining tokens
        output_embeddings = torch.cat([learnable_prompts_expanded, remaining_embeddings], dim=1)
 #<YOUR CODE HERE>

        return output_embeddings


In [6]:
num_prompts = 16
test_emb_layer = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)
test_input_ids = tokenizer("a cat say on a may", return_tensors='pt')['input_ids'].to(device)

space_for_prompts = torch.full([len(test_input_ids), num_prompts], fill_value=tokenizer.pad_token_id,
                               dtype=torch.int64, device=device)
test_inputs_with_prompts = torch.cat([space_for_prompts, test_input_ids], dim=1)

with torch.cuda.amp.autocast():
  test_prompt_embeddings = test_emb_layer(test_inputs_with_prompts)

assert test_prompt_embeddings.shape[:2] == test_inputs_with_prompts.shape
assert test_prompt_embeddings.shape[-1] == model.config.hidden_size
assert torch.allclose(test_prompt_embeddings[:, :num_prompts], test_emb_layer.learnable_prompts.float())
assert torch.allclose(test_prompt_embeddings[:, num_prompts:], model.model.embed_tokens(test_input_ids).float())
print("Looks legit!")

Looks legit!


  with torch.cuda.amp.autocast():


__Now that it works,__ let's inject learnable prompts into the main model and teach it about foxes.

In [7]:
assert isinstance(model.model.embed_tokens, nn.Embedding), "you have already replaced the embedding layer. If the replacement is broken, please reload the model"

model.model.embed_tokens = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)

opt = torch.optim.Adam([model.model.embed_tokens.learnable_prompts], lr=0.01)

In [10]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
space_for_prompts = torch.full(
    [len(test_input_ids), num_prompts], 
    fill_value=tokenizer.pad_token_id,
    dtype=torch.int64, device=device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)


target_loss = 0.10
max_steps = 1500
log_every = 50

last_loss = None

for step in range(1, max_steps + 1):
    opt.zero_grad(set_to_none=True)
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        outputs = model(**batch)
        next_word_logits = outputs.logits[:, num_prompts:-1, :]          # [B, T-1-P, V]
        true_next_tokens = batch['input_ids'][:, num_prompts + 1:]       # [B, T-1-P]
        loss = F.cross_entropy(
            next_word_logits.flatten(0, 1),
            true_next_tokens.flatten(0, 1)
        )
    loss.backward()
    opt.step()

    last_loss = loss.item()
    if step % log_every == 0 or step == 1:
        print(f"step={step:4d}  loss={last_loss:.4f}")

    if last_loss <= target_loss:
        print(f"early stop at step {step}, loss={last_loss:.4f}")
        break

        
# outputs = model(**batch)
# next_word_logits = outputs.logits[:, num_prompts : -1, :]
# true_next_tokens = batch['input_ids'][:, num_prompts + 1:]
# loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))
# print("Loss:", loss)


# raise NotImplemented("Your task: iteratively train the model to reduce loss using prompt optimizer (opt)")

step=   1  loss=0.0824
early stop at step 1, loss=0.0824


  with torch.cuda.amp.autocast(dtype=torch.bfloat16):


In [11]:
# Final loss assertion
assert loss.item() <= 0.1
print("Good job!")

Good job!


In [14]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)


for i in range(15):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print("\nOutput:", tokenizer.decode(batch['input_ids'][0, num_prompts:].cpu().numpy().tolist()))

# if you did everything right, the model will deny that the fox jumped over the lazy dog


Output: <s>A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it


### Using HuggingFace PEFT (2 point)

[`peft`](https://huggingface.co/docs/peft/index) is a transformer's sister library that allows you to apply various __p__arameter __e__fficient __f__ine-__t__uning methods to pre-trained transformers. The library imlements both prompt tuning, prefix tuning, as well as several adapter-based techniques under a common interface:



In [23]:
model_name = 'Enoch/llama-7b-hf'

# loading Llama tokenizer ...
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# ... and the model itself
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:10<00:00,  3.06it/s]


In [24]:
import peft
assert isinstance(model.model.embed_tokens, nn.Embedding), "please reload the model"

peft_config = peft.PromptTuningConfig(task_type=peft.TaskType.CAUSAL_LM, num_virtual_tokens=16)
model = peft.get_peft_model(model, peft_config)  # note: for most peft methods, this line also modifies model in-place
print("Trainable parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad))
print("Total parameters (excluding quantization):", sum(p.numel() for p in model.parameters()))

Trainable parameters: 65536
Total parameters (excluding quantization): 3500478464


In [None]:
# Your task: optimize the PEFT-wrapped model to achieve next token prediction loss < 0.1, but this time using PEFT
# Please note: you no longer need to prepend PAD tokens, but you still need to skip :num_virtual_tokens: first logits.
# Finally, generate the sentence to make sure that the model learned the truth.

In [25]:
# Define the ground truth sentence
# the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors="pt", return_token_type_ids=False).to(device)

In [26]:
# Training Configuration
loss_threshold = 0.1  # Desired loss threshold
num_epochs = 100  # Max number of epochs
learning_rate = 5e-2 #<YOUR CODE HERE>  # Learning rate

# Define the optimizer for trainable parameters (PEFT prompts)
optimizer = torch.optim.AdamW(
    [p for p in model.parameters() if p.requires_grad],
    lr=learning_rate, betas=(0.9, 0.999), weight_decay=0.0
) #<YOUR CODE HERE>

# Define the ground truth
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors="pt", return_token_type_ids=False).to(device)

model.train()

# Training Loop
for epoch in range(num_epochs):
    optimizer.zero_grad(set_to_none=True)

    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        # Forward pass
        outputs = model(**batch) #<YOUR CODE HERE>
    
        # Skip logits for virtual tokens and the last token
        next_word_logits = outputs.logits[:, peft_config.num_virtual_tokens:-1, :] # <YOUR CODE HERE>  # Skip virtual tokens
        true_next_tokens = batch["input_ids"][:, 1:] #<YOUR CODE HERE>  # Shift ground truth tokens by one
    
        # Compute the loss
        loss = F.cross_entropy(
            next_word_logits.reshape(-1, next_word_logits.size(-1)),
            true_next_tokens.reshape(-1)
        )
    
        # Backpropagation
        # <YOUR CODE HERE>
        loss.backward()
        optimizer.step()
    
        # Print loss for tracking
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item():.6f}")
    
        # Stop training if loss is below threshold
        if loss.item() < loss_threshold:
            print("Loss threshold reached. Stopping training.")
            break
else:
    print("Maximum epochs reached without meeting the loss threshold.")

  with torch.cuda.amp.autocast(dtype=torch.bfloat16):


Epoch 1/100, Loss: 8.140562
Epoch 2/100, Loss: 6.515396
Epoch 3/100, Loss: 5.520456
Epoch 4/100, Loss: 5.023056
Epoch 5/100, Loss: 4.482587
Epoch 6/100, Loss: 3.950639
Epoch 7/100, Loss: 3.455940
Epoch 8/100, Loss: 2.990002
Epoch 9/100, Loss: 2.574778
Epoch 10/100, Loss: 2.191291
Epoch 11/100, Loss: 1.843169
Epoch 12/100, Loss: 1.551942
Epoch 13/100, Loss: 1.306780
Epoch 14/100, Loss: 1.090453
Epoch 15/100, Loss: 0.881252
Epoch 16/100, Loss: 0.714252
Epoch 17/100, Loss: 0.551675
Epoch 18/100, Loss: 0.417700
Epoch 19/100, Loss: 0.319308
Epoch 20/100, Loss: 0.247861
Epoch 21/100, Loss: 0.197905
Epoch 22/100, Loss: 0.157392
Epoch 23/100, Loss: 0.120003
Epoch 24/100, Loss: 0.097450
Loss threshold reached. Stopping training.


In [27]:
# Final assertion to ensure loss is below threshold
assert loss.item() < loss_threshold, "Training failed to reduce loss below threshold."
print("Training successful! Loss is below 0.1.")

Training successful! Loss is below 0.1.


In [28]:
prompt = "A quick brown fox"
batch = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to(device)

# Generate 18 tokens
for i in range(15):
    # Forward pass to get the logits
    outputs = model(**batch)
    next_token = outputs.logits[0, -1].argmax(-1).reshape(1, 1)

    # Append the next token to input_ids
    batch["input_ids"] = torch.cat([batch["input_ids"], next_token], dim=-1)

    # Update the attention_mask to match the new input_ids length
    new_attention_mask = torch.ones_like(next_token, dtype=batch["attention_mask"].dtype).to(device)
    batch["attention_mask"] = torch.cat([batch["attention_mask"], new_attention_mask], dim=-1)

# Decode the generated sequence
# Skip the virtual tokens (if applicable) by slicing `batch["input_ids"][:, num_prompts:]`
decoded_output = tokenizer.decode(batch["input_ids"][0].cpu().numpy().tolist(), skip_special_tokens=True)
print("\nOutput:", decoded_output)



Output: A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it


### Parameter-efficient finetuning with LoRA (2 points)

When training on more serious tasks, you can use low-rank adapters based on the [LoRA paper](https://arxiv.org/pdf/2106.09685.pdf).

The core idea is to add low-rank adapters __in parallel with existing linear layers,__ like this:
<center><img src="https://i.imgur.com/6bQLNiG.png" width=240px></center>

In the original LoRA paper, the adapters were only added to attention projection matrices. However, [subsequent works](https://arxiv.org/abs/2305.14314) show that it is useful to adapt FFNs as well. But before we do any training, we need to implement the basic LoRA layer.

In [29]:
# re-load the model to remove any previous PEFT tuners
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name, device_map='auto', low_cpu_mem_usage=True, offload_state_dict=True,
    load_in_4bit=True, torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:09<00:00,  3.35it/s]


In [30]:
class LoRALayer(nn.Module):
    """Wraps a linear layer with LoRA-like adapter. Wraps an existing OPT linear layer"""
    def __init__(self, module: nn.Linear, rank: int):
        super().__init__()
        self.module = module  # pre-trained (frozen) linear layer
        self.adapter_A = nn.Parameter(torch.empty(module.in_features, rank, device=module.weight.device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5 ** 0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, module.out_features, device=module.weight.device))

    def forward(self, input):
        # Apply self.module and LoRA adapter, return the sum (self.module outputs + adapter outputs)
        original_output =self.module(input) # <YOUR CODE HERE>
        lora_output = input @ self.adapter_A @ self.adapter_B #<YOUR CODE HERE>

        return original_output + lora_output

In [31]:
# test your implementation
test_linear = nn.Linear(128, 128)
test_linear.weight.data[...] = torch.eye(128)
test_adapter = LoRALayer(test_linear, rank=8)

assert torch.allclose(test_adapter(torch.ones(1, 1, 128)), test_linear.bias + 1), "please check your forward pass"

test_adapter.adapter_A.data[...] = torch.linspace(0.1, -0.5, 128 * 8).view(128, 8)
test_adapter.adapter_B.data[...] = torch.linspace(0.5, -0.1, 128 * 8).view(8, 128)
test_linear.bias.data[...] = torch.linspace(1., -1., 128)

dummy_loss = F.mse_loss(test_adapter(torch.ones(1, 128) / 128).squeeze(), torch.linspace(-1, 1, 128))
assert torch.allclose(dummy_loss, torch.tensor(1.3711389), rtol=0, atol=1e-4)
dummy_loss.backward()
assert all(w.grad is not None for w in [test_adapter.adapter_A, test_adapter.adapter_B]), "some adapter weights have no grad"
assert torch.allclose(test_adapter.adapter_A.grad.sum(), torch.tensor(-0.60158), rtol=0, atol=1e-4), "bad grad w.r.t. A"
assert torch.allclose(test_adapter.adapter_B.grad.sum(), torch.tensor(0.9931), rtol=0, atol=1e-4), "bad grad w.r.t. B"
# note: bad grad means that your code is different from LoRA paper OR that your code is not autograd-friendly (e.g. no_grad)
del dummy_loss, test_linear, test_adapter
print("All tests passed!")

All tests passed!


### Apply LoRA to the model

The code below applies LoRA adapters on top of Q/K/V linear layers in Llama attention. You may also choose to modify other layers:
* self_attn.o_proj - attention output projection
* mlp.up_proj, mlp.gate_proj, mlp.down_proj - transformer feedforward layers
* lm_head - output LM head

In [32]:
lora_rank = 8

for name, module in model.model.layers.named_modules():
    if 'LlamaDecoderLayer' in repr(type(module)):
        module.self_attn.q_proj = LoRALayer(module.self_attn.q_proj, rank=lora_rank).to(device)
        module.self_attn.k_proj = LoRALayer(module.self_attn.k_proj, rank=lora_rank).to(device)
        module.self_attn.v_proj = LoRALayer(module.self_attn.v_proj, rank=lora_rank).to(device)

assert sum(isinstance(module, LoRALayer) for module in model.modules()) == 96  # for Llama-7B

In [33]:
batch = tokenizer("This model wants to share its greatest secret:", return_tensors='pt', return_token_type_ids=False)
# test a single training step, make sure we get meaningful gradients
with torch.cuda.amp.autocast(dtype=torch.float32):
    out = model.forward(**batch)
    (out.logits.norm() / 100).backward()

for i, module in enumerate(model.modules()):
    if isinstance(module, LoRALayer):
        assert module.adapter_B.grad is not None
        assert module.adapter_B.grad.norm().item() > 0

model.zero_grad(set_to_none=True)
print("Grad check successful, well done!")

Grad check successful, well done!


  with torch.cuda.amp.autocast(dtype=torch.float32):


### (example) How to train your model

The example below shows how to train the LoRA adapters on a dummy dataset. You will need to run a _similar_ training task later.

__Note:__ please scroll down for the homework task

In [34]:
# checking if the model can learn. Change max_steps for proper training
import datasets
data = datasets.load_dataset("Abirate/english_quotes", split="train[:32]") # 32 lines
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)
model._hf_peft_config_loaded = True  # silence a warning from HF trainer

trainer = transformers.Trainer(
    model=model, train_dataset=data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=2, 
        gradient_accumulation_steps=1, # for effectively larger batch size
        warmup_steps=250, 
        max_steps=100, 
        learning_rate=2e-4, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs', 
        report_to=None,
        save_strategy="no" # to make it work as of October 2025 (thanks, Vlad!) 
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# if you see cache warnings, set `model.config.use_cache = False` to silence them. Please re-enable for inference!

trainer.train()

# NOTE: this is just an example! you do not have to wait for this progressbar to finish :)

Generating train split: 100%|████| 2508/2508 [00:00<00:00, 451938.24 examples/s]
Map: 100%|█████████████████████████████| 32/32 [00:00<00:00, 2510.06 examples/s]
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
max_steps is given, it will override any value given in num_train_epochs
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


Step,Training Loss
1,1.2671
2,0.3777
3,1.4698
4,1.4346
5,0.8571
6,1.6613
7,1.8577
8,1.286
9,0.5655
10,1.3035


TrainOutput(global_step=100, training_loss=0.5433662439323962, metrics={'train_runtime': 28.33, 'train_samples_per_second': 7.06, 'train_steps_per_second': 3.53, 'total_flos': 634829603291136.0, 'train_loss': 0.5433662439323962, 'epoch': 6.25})

### Final task: *actually* train the model (5 points)

Your task is to fine-tune the model to _generate python code_. Please use the above examples for inspiration. More specifically,

* __dataset:__ use [codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean) or any other data containing python code. Since you do not need much data for this excercise, it is enough to use just shorter train subset of `codeparrots`
* __preprocessing:__ select python code based on file extentions (.py)  (may skip in case of codeparrot - it is 100% python)
* __short lines:__ please take the first 512 characters of each line
* __adapter type:__ please use LoRA as defined above __plus at least one of:__
   - extra adapter on lm_head
   - extra adapter on MLP components (mlp.*)
   - trainable input embeddings (requires tweaking memory usage)

* __training:__ you do not have to train to convergence. If all goes well, your model should `.generate` code after 500 steps. Please use batch size of at least 4 (4 x 1 x 512 tokens) using `gradient_accumulation_steps=4`.


Note: the peft library also has LoRA implementation. However, we ask that for this assignment you show at least one complete training run with your own LoRA code.

__Alternative assignment:__ Instead of doing python code, feel free to substitute the task with any other dataset, e.g. your favorite artist or podcast, as long as it's ethical. If you choose your own task, please show examples of what your model learned - or did not learn, akin to the code examples below.

In [39]:
prompts =  ['', 'import', 'from', 'while', 'try', 'if', 'for', 'torch']  # feel free to add a few more that are not 100% assiciated with Python

# <A WHOLE LOT OF YOUR CODE>
# generate baseline samples with the selected prompts before finetuning
# please feel free to use transformers.Trainer (as above) or your custom training code
# after the training concludes, please show examples of text generated by your model. It is expected to look like Python code fragments
# print the generation examples nicely (suggestion: use pandas or HTML) for easier comparison
# note: your LoRA-enhanced model can run generation the same way as the non-trained model (above)

In [None]:
import torch
from torch import nn
import transformers
from datasets import load_dataset

class LoRALayer(nn.Module):
    """LoRA-адаптер поверх произвольного линейного слоя (в т.ч. bitsandbytes Linear4bit)."""
    def __init__(self, module: nn.Module, rank: int):
        super().__init__()
        self.module = module
        in_features = module.in_features
        out_features = module.out_features

        # A: [in_features, r], B: [r, out_features]
        self.adapter_A = nn.Parameter(torch.empty(in_features, rank, device=next(module.parameters()).device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5 ** 0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, out_features, device=next(module.parameters()).device))

        # alpha = 16
        self.scaling = 16.0 / rank

        # freeze the base layer
        for p in self.module.parameters():
            p.requires_grad = False

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        original_output = self.module(input)
        lora_output = (input @ self.adapter_A @ self.adapter_B) * self.scaling
        return original_output + lora_output


# —— LLaMA linear layers for LoRA replacement — —
TARGET_LINEAR_NAMES = [
    # attention
    "q_proj", "k_proj", "v_proj", "o_proj",
    # MLP
    "gate_proj", "up_proj", "down_proj",
    # lm_head
    "lm_head",
]

def replace_module_with_lora(model: nn.Module, target_names, rank: int, include_lm_head: bool = True):
    """
    Recursively wraps target linear layers (nn.Linear or bnb.nn.Linear4bit) with our LoRALayer.
    Returns a list of paths to replaced modules.
    """
    replaced = []

    def _should_wrap(name):
        if name == "lm_head":
            return include_lm_head
        return any(name.endswith(t) or name == t for t in target_names if t != "lm_head")

    def _wrap(parent, name, module):
        # Check if this is a linear layer; bitsandbytes Linear4bit also has .in_features/.out_features
        if not (hasattr(module, "in_features") and hasattr(module, "out_features")):
            return False

        # Wrap the module with LoRALayer
        lora = LoRALayer(module, rank=rank)
        setattr(parent, name, lora)
        return True

    def _recursive(module: nn.Module, prefix=""):
        for child_name, child in module.named_children():
            full_name = f"{prefix}.{child_name}" if prefix else child_name
            # Decide whether to wrap
            if _should_wrap(child_name):
                if _wrap(module, child_name, child):
                    replaced.append(full_name)
                    continue
            # Or go inside
            _recursive(child, full_name)

    _recursive(model)
    return replaced

def count_trainable_params(model: nn.Module):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total


In [None]:
model_name = 'Enoch/llama-7b-hf'
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32
)

# freeze base weights
for p in model.parameters():
    p.requires_grad = False

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

# LLaMA loggers fix
if getattr(model.config, "use_cache", True):
    model.config.use_cache = False

# wrap LoRA layers
lora_rank = 8 
replaced_paths = replace_module_with_lora(model, TARGET_LINEAR_NAMES, rank=lora_rank, include_lm_head=True)
print(f"LoRA wrapped {len(replaced_paths)} modules:")
for p in replaced_paths[:10]:
    print("  ", p)
if len(replaced_paths) > 10:
    print("  ...")

# count train params
trainable, total = count_trainable_params(model)
print(f"Trainable parameters: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:12<00:00,  2.69it/s]


LoRA wrapped 225 modules:
   model.layers.0.self_attn.q_proj
   model.layers.0.self_attn.k_proj
   model.layers.0.self_attn.v_proj
   model.layers.0.self_attn.o_proj
   model.layers.0.mlp.gate_proj
   model.layers.0.mlp.up_proj
   model.layers.0.mlp.down_proj
   model.layers.1.self_attn.q_proj
   model.layers.1.self_attn.k_proj
   model.layers.1.self_attn.v_proj
  ...
Trainable parameters: 20,277,248 / 3,520,690,176 (0.58%)


In [None]:
from ast import Continue


N_TRAIN = 5000  # as discussed in the tg chat: 2k–5k is enough
stream = load_dataset("codeparrot/codeparrot-clean", split="train", streaming=True)

# Collect only the first N_TRAIN samples into memory
samples = []
for i, ex in enumerate(stream):
    # In codeparrot-clean, the "content" field is the code text (100% python)
    text = (ex.get("content") or "").strip()
    if not text:
        Continue
    text = text[:512]
    if text:
        samples.append({"text": text})
    if len(samples) >= N_TRAIN:
        break

from datasets import Dataset
train_ds = Dataset.from_list(samples)

def tok_map(ex):
    out = tokenizer(
        ex["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors=None,
    )
    out["labels"] = out["input_ids"].copy()
    return out

train_tok = train_ds.map(tok_map, batched=False, remove_columns=["text"])
train_tok = train_tok.with_format("torch")
len(train_tok)


Map: 100%|█████████████████████████| 5000/5000 [00:04<00:00, 1218.11 examples/s]


5000

In [None]:

def generate_snippet(prompt: str, max_new_tokens=80):

    model.eval()

    inputs = tokenizer(prompt, return_tensors="pt").to(next(model.parameters()).device)
    
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True, temperature=0.7, top_p=0.9,
            eos_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(out[0], skip_special_tokens=True)

baseline = {p: generate_snippet(p) for p in prompts}
baseline


{'': 'Delegates to the 2019 ACFAS Annual Scientific Session are invited to attend the ACFAS 2019 Annual Business Meeting. The meeting will be held on Tuesday, March 12, from 11:30 a.m. to 12:30 p.m., in the Regency Ballroom A',
 'import': 'import Foundation\n\nextension NSURL: URL {\n    public var absoluteString: String {\n        return (scheme as NSString).absoluteString\n    }\n}\n\nextension NSURL: URLConvertible {\n    public var url: String {\n        return absoluteString\n    }\n}\n\nextension URL: URLConvertible {\n    public var url: String {\n',
 'while': "while(1){\n    sleep(1);\n    echo 'Looping';\n}\necho 'Done';\n\\end{code}\n\n\\strong{OUTPUT}\n\n\\begin{code}\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping\nLooping",
 'try': "try to get the following error:\n\n\\begin{blockquote}\n\nC:\\ProgramData\\Panopta\\Panopta\\panopta.exe: 15576\n  ERROR: The requested operation requires elevation.\n\\end{blockquote}\n\nI'm u

In [None]:
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="outputs_lora_custom",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,       
    learning_rate=2e-4,
    num_train_epochs=1,
    max_steps=500,         
    warmup_steps=50,
    logging_steps=10,
    save_strategy="no", # as in tg chat
    report_to=None,
    fp16=False,
    bf16=True,     # A100
    optim="adamw_torch",
    dataloader_num_workers=2,
)

# tell Trainer that there are trainable adapters on top of the quantized model (our custom LoRA)
model._hf_peft_config_loaded = True   # critical to bypass the check (fix from tg chat)

# LLaMA loggers fix
if getattr(model.config, "use_cache", True):
    model.config.use_cache = False
trainables = [n for n,p in model.named_parameters() if p.requires_grad]
print(f"Trainable tensors: {len(trainables)} (пример: {trainables[:5]})")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    data_collator=data_collator,
)
trainer.train()


max_steps is given, it will override any value given in num_train_epochs


Trainable tensors: 450 (пример: ['model.layers.0.self_attn.q_proj.adapter_A', 'model.layers.0.self_attn.q_proj.adapter_B', 'model.layers.0.self_attn.k_proj.adapter_A', 'model.layers.0.self_attn.k_proj.adapter_B', 'model.layers.0.self_attn.v_proj.adapter_A'])


Step,Training Loss
10,1.1868
20,1.0539
30,1.1456
40,1.1482
50,1.0195
60,1.2222
70,1.1783
80,1.1065
90,1.0391
100,1.0646


TrainOutput(global_step=500, training_loss=1.1075844860076904, metrics={'train_runtime': 768.3578, 'train_samples_per_second': 2.603, 'train_steps_per_second': 0.651, 'total_flos': 4.0720102588416e+16, 'train_loss': 1.1075844860076904, 'epoch': 0.4})

In [None]:
model.config.use_cache = True

after = {p: generate_snippet(p) for p in prompts}

from IPython.display import HTML, display
table_template = """<table style="border:1px solid black" >
  <tr>
    <th style="text-align: center; border:1px solid black">PROMPT</th>
    <th style="text-align: center; border:1px solid black">BEFORE</th>
    <th style="text-align: center; border:1px solid black">AFTER</th>
  </tr>
{}
</table>"""

row_template = '''  <tr>
    <td style="width:20%; border:1px solid black"><pre align="left">{}</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
  </tr>'''

def escape_html(s: str):
    return (s.replace("&", "&amp;")
             .replace("<", "&lt;")
             .replace(">", "&gt;"))

rows = []
for p in prompts:
    rows.append(row_template.format(
        escape_html(p),
        escape_html(baseline[p]),
        escape_html(after[p])
    ))

display(HTML(table_template.format('\n'.join(rows))))


PROMPT,BEFORE,AFTER
,"Delegates to the 2019 ACFAS Annual Scientific Session are invited to attend the ACFAS 2019 Annual Business Meeting. The meeting will be held on Tuesday, March 12, from 11:30 a.m. to 12:30 p.m., in the Regency Ballroom A","#!/usr/bin/env python # # Copyright 2008 Google Inc. # # Licensed under the Apache License, Version 2.0 (the ""License""); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www."
import,import Foundation extension NSURL: URL {  public var absoluteString: String {  return (scheme as NSString).absoluteString  } } extension NSURL: URLConvertible {  public var url: String {  return absoluteString  } } extension URL: URLConvertible {  public var url: String {,"import unittest from django.conf import settings from django.core.urlresolvers import reverse from django.http import HttpResponseRedirect from django.test import TestCase, RequestFactory from .. import settings_local from ..models import Article from ..views import ArticleView class ArticleViewTestCase(TestCase):  def setUp("
from,from __future__ import absolute_import from __future__ import division from __future__ import print_function from __future__ import unicode_literals import collections import inspect import logging import os import warnings from . import util from . import _version class _PillowError(Exception):  pass,"from __future__ import unicode_literals from django.db import migrations, models import django.utils.timezone from django.conf import settings import django.db.models.deletion class Migration(migrations.Migration):  dependencies = [  ('auth', '0002_0488"
while,while(1){  sleep(1);  echo 'Looping'; } echo 'Done'; \end{code} \strong{OUTPUT} \begin{code} Looping Looping Looping Looping Looping Looping Looping Looping Looping Looping Looping Looping,"while True:  print ""Please enter your name: ""  name = raw_input("""")  print ""Please enter your birthdate: ""  birthdate = raw_input("""")  print ""Please enter your gender: ""  gender = raw_input("""")  print ""Please enter your favorite color: ""  color = raw_input("""
try,try to get the following error: \begin{blockquote} C:\ProgramData\Panopta\Panopta\panopta.exe: 15576  ERROR: The requested operation requires elevation. \end{blockquote} I'm using the latest version of WSL2. Answer: The solution was to install,"try:  import json except ImportError:  json = None from django.conf import settings from django.core.exceptions import ImproperlyConfigured from django.db import models from django.db.models import fields from django.db.models.fields import IntegerField, StringField, BooleanField from django.db.models."
if,if( !( 32767 ) ){ 	return; } var _$ = _; var _$1 = _1; var _$2 = _2; var _$3 = _3; var _$4 = _4; var _$5 = _5; var _$6 = _6; var,"if (typeof(R) == ""undefined"") {  var R = require(""R""); } if (typeof(R.r) == ""undefined"") {  var R.r = R.require; } if (typeof(R.r.require) == ""undefined"") {  R.r.require = R.r.require ||"
for,"forums, the official message boards for the WWE Universe. WWE Universe 101: What is the WWE Universe? WWE Universe 101: What is the WWE Universe? 2018-10-28 07:00:00Z 0 WWE Universe 1","for a = 1:20  for i = 1:10  fprintf(fid, '10%d', i)  fprintf(fid, '10%d', i)  end  fprintf(fid, '10%d', i) end fid.close() %%"
torch,"torchbearer: (noun) a person who carries the Olympic torch in the torch relay. ""I'm a torchbearer for the next generation."" I'm honored to announce that I've been selected as a torchbearer for the 2010 Winter Olympics in Vancouver. I'm part of","torch_path = find_torch_path() def run(argv):  """"""  Run a torch experiment on a dataset.  Args:  argv: A list of command line arguments.  Returns:  A tuple of the last command line output, the elapsed time.  """"""  # Get the dataset"


In [None]:
# This template helps to compare generated code samples in pretty table form
# feel free to present your work in other forms

from IPython.display import HTML, display
table_template = """<table style="border:1px solid black" >
  <tr>
    <th style="text-align: center; border:1px solid black">PROMPT</th>
    <th style="text-align: center; border:1px solid black">BEFORE</th>
    <th style="text-align: center; border:1px solid black">AFTER</th>
  </tr>
{}
</table>"""

row_template = '''  <tr>
    <td style="width:20%; border:1px solid black"><pre align="left">`{}`</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
  </tr>'''

rows = []

for prompt in prompts:
    # replace placeholders in the format() arguments
    rows.append(row_template.format(prompt, "BEFORE FINETUNING", "TO BE GENERATED AFTER FINETUNING"))

display(HTML(table_template.format('\n'.join(rows))))

PROMPT,BEFORE,AFTER
``,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`import`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`from`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`while`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`try`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`if`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`for`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING
`torch`,BEFORE FINETUNING,TO BE GENERATED AFTER FINETUNING


If you reach this: congratulations! you've completed everything in this practice session.

If you want to dig deeper, try to implement prompt-tuning (for bonus points!).
You can read more about prompt tuning variants in paper [1](https://arxiv.org/abs/2104.08691) or paper [2](https://arxiv.org/abs/2101.00190). Both versions can be implemented by passing trainable prompts as `model.forward(..., past_key_values=your_prompts)`.



### Read more

* How post-training quantization works: https://arxiv.org/abs/2208.07339
* An overview of running large models: https://huggingface.co/docs/accelerate/package_reference/big_modeling
* A general library for different adapter types: https://adapterhub.ml/


### [extra info] Running other models.

This notebook's code can run with other models of similar size, such as [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b), [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b) or [BLOOM-7.1B](https://huggingface.co/bigscience/bloom-7b1). However, they will require minor code tweaks:
1. change the model name in `AutoModelForCausalLM.from_pretrained()` __and__ `AutoTokenizer`
2. In the prompt tuning code, change `model.model.embed_tokens` to refer to the target model's word embeddings. Simply `print(model)` to navigate to them.
3. Change code to add Lora layers - specifically where you what the transformer block components, since those components now have different names.