### Parameter Efficient Fine-Tuning

In [1]:
# Original library versions
# %pip install --quiet transformers==4.34.1 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 peft==0.5.0 bitsandbytes==0.41.2.post2

# Preferred versions for Colab as of October 2025 (thanks, Lev!)
# %pip install --quiet "bitsandbytes==0.45.3" "transformers>=4.43,<4.46" "accelerate>=0.33,<0.36" "peft>=0.11.1" "optimum>=1.20.0" "sentencepiece"

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import transformers
from tqdm.auto import tqdm, trange

assert torch.cuda.is_available(), "you need cuda for this part"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
model_name = 'Enoch/llama-7b-hf'

# loading Llama tokenizer ...
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# ... and the model itself
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False

model.gradient_checkpointing_enable()  # only store a small subset of activations, re-compute the rest.
model.enable_input_require_grads()     # override an implementation quirk in gradient checkpoints that disables backprop unless inputs require grad
# more on gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html https://arxiv.org/abs/1604.06174

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

### Prompt tuning: the story of a fox 

![img](https://i.imgur.com/Ux3qQAu.png) (source: theodd1souts.fandom.com)

In [4]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

for i in range(10):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print("\nOutput:", tokenizer.decode(batch['input_ids'][0].cpu().numpy().tolist()))

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)



Output: <s>A quick brown fox jumps over the lazy dog.
A quick


In [5]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
outputs = model(**batch)

next_word_logits = outputs.logits[:, :-1]
true_next_tokens = batch['input_ids'][:, 1:]
loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))

print("Loss:", loss)

Loss: tensor(3.0729, device='cuda:0', grad_fn=<NllLossBackward0>)


Except, we can't train the entire model - that would be 28GB gradients in float32. Instead, let's run [prompt tuning](https://arxiv.org/abs/2104.08691).

![img](https://i.imgur.com/VwNNKnb.png)


In [6]:
class WordEmbeddingsWithLearnedPrompts(nn.Module):
    """
    To perform prompt tuning, you will need to replace the model's original word embeddings with a layer - THIS layer
    - that inserts trainable prompts instead of the first N token embeddings.
    """

    def __init__(self, word_embeddings: nn.Embedding, num_prompts: int):
        super().__init__()
        self.original_word_embeddings = word_embeddings
        self.num_prompts = num_prompts
        self.learnable_prompts = nn.Parameter(
            torch.randn(1, num_prompts, word_embeddings.embedding_dim), requires_grad=True
        )

    def forward(self, input_ids: torch.LongTensor):
        # input_ids shape: [batch_size, seq_length]
        assert input_ids.dtype == torch.int64
        assert input_ids.shape[1] > self.num_prompts
        assert torch.all(input_ids[:, :self.num_prompts] == tokenizer.pad_token_id).item(), (
            "Don't forget to prepend several BOS tokens to input_ids"
        )

        # Embed the input_ids using the original word embeddings
        input_embeddings = self.original_word_embeddings(input_ids)  # Shape: [batch_size, seq_length, embedding_dim]

        # Replace the first num_prompts token embeddings with the learnable prompts
        batch_size = input_ids.shape[0]
        learnable_prompts_expanded = self.learnable_prompts.expand(batch_size, -1, -1)  # Shape: [batch_size, num_prompts, embedding_dim]
        remaining_embeddings = input_embeddings[:, self.num_prompts:, :]  # Shape: [batch_size, seq_length - num_prompts, embedding_dim]

        # Concatenate learnable prompts with the embeddings of the remaining tokens
        output_embeddings = torch.cat([learnable_prompts_expanded, remaining_embeddings], dim=1)

        return output_embeddings


In [7]:
num_prompts = 16
test_emb_layer = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)
test_input_ids = tokenizer("a cat say on a may", return_tensors='pt')['input_ids'].to(device)

space_for_prompts = torch.full([len(test_input_ids), num_prompts], fill_value=tokenizer.pad_token_id,
                               dtype=torch.int64, device=device)
test_inputs_with_prompts = torch.cat([space_for_prompts, test_input_ids], dim=1)

with torch.cuda.amp.autocast():
  test_prompt_embeddings = test_emb_layer(test_inputs_with_prompts)

assert test_prompt_embeddings.shape[:2] == test_inputs_with_prompts.shape
assert test_prompt_embeddings.shape[-1] == model.config.hidden_size
assert torch.allclose(test_prompt_embeddings[:, :num_prompts], test_emb_layer.learnable_prompts.float())
assert torch.allclose(test_prompt_embeddings[:, num_prompts:], model.model.embed_tokens(test_input_ids).float())
print("Looks legit!")

Looks legit!


  with torch.cuda.amp.autocast():


In [8]:
assert isinstance(model.model.embed_tokens, nn.Embedding), "you have already replaced the embedding layer. If the replacement is broken, please reload the model"

model.model.embed_tokens = WordEmbeddingsWithLearnedPrompts(model.model.embed_tokens, num_prompts=num_prompts).to(device)

opt = torch.optim.Adam([model.model.embed_tokens.learnable_prompts], lr=0.01)

In [9]:
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors='pt', return_token_type_ids=False).to(device)
space_for_prompts = torch.full([len(test_input_ids), num_prompts], fill_value=tokenizer.pad_token_id,
                               dtype=torch.int64, device=device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)

outputs = model(**batch)
next_word_logits = outputs.logits[:, num_prompts : -1, :]
true_next_tokens = batch['input_ids'][:, num_prompts + 1:]
loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))
print("Loss:", loss)


# raise NotImplementedError("Your task: iteratively train the model to reduce loss using prompt optimizer (opt)")

model.train()
num_epochs = 100

for epoch in range(num_epochs):
    opt.zero_grad()

    outputs = model(**batch)
    next_word_logits = outputs.logits[:, num_prompts : -1, :]
    true_next_tokens = batch['input_ids'][:, num_prompts + 1:]
    loss = F.cross_entropy(next_word_logits.flatten(0, 1), true_next_tokens.flatten(0, 1))

    loss.backward()
    opt.step()

    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Loss: 

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


tensor(7.9402, device='cuda:0', grad_fn=<NllLossBackward0>)
Epoch 0, Loss: 7.9402
Epoch 20, Loss: 2.3153
Epoch 40, Loss: 0.2475
Epoch 60, Loss: 0.0310
Epoch 80, Loss: 0.0127


In [10]:
# Final loss assertion
assert loss.item() <= 0.1
print("Good job!")

Good job!


In [11]:
prompt = 'A quick brown fox'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
batch['input_ids'] = torch.cat([space_for_prompts, batch['input_ids']], dim=1)
batch['attention_mask'] = torch.cat([torch.ones_like(space_for_prompts), batch['attention_mask']], dim=1)


for i in range(15):
    next_token = model(**batch).logits[0, -1].argmax(-1).reshape(1, 1)
    batch['input_ids'] = torch.cat([batch['input_ids'], next_token], dim=-1)
    batch['attention_mask'] = torch.cat([batch['attention_mask'], torch.ones_like(next_token)], dim=-1)

print("\nOutput:", tokenizer.decode(batch['input_ids'][0, num_prompts:].cpu().numpy().tolist()))

# if you did everything right, the model will deny that the fox jumped over the lazy dog


Output: <s>A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it


### Using HuggingFace PEFT 



In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  
    bnb_4bit_quant_type="nf4",             
    bnb_4bit_use_double_quant=True,        
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map={"": 0},
    low_cpu_mem_usage=True,
    quantization_config=quantization_config,
)

In [17]:
import peft
assert isinstance(model.model.embed_tokens, nn.Embedding), "please reload the model"

peft_config = peft.PromptTuningConfig(task_type=peft.TaskType.CAUSAL_LM, num_virtual_tokens=16)
model = peft.get_peft_model(model, peft_config)  # note: for most peft methods, this line also modifies model in-place
print("Trainable parameters:", sum(p.numel() for p in model.parameters() if p.requires_grad))
print("Total parameters (excluding quantization):", sum(p.numel() for p in model.parameters()))

Trainable parameters: 65536
Total parameters (excluding quantization): 3500478464


In [19]:
# Define the ground truth sentence
# the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors="pt", return_token_type_ids=False).to(device)

In [20]:
# Training Configuration
loss_threshold = 0.1  # Desired loss threshold
num_epochs = 100  # Max number of epochs
learning_rate = 3e-3  # Learning rate

# Define the optimizer for trainable parameters (PEFT prompts)
optimizer = torch.optim.AdamW(
    model.parameters(),       # PEFT автоматически помечает только новые параметры как requires_grad=True
    lr=learning_rate,
    weight_decay=0.01         
)

# градиентный клиппинг
max_grad_norm = 1.0

# Define the ground truth
the_truth = "A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it anyway!"
batch = tokenizer(the_truth, return_tensors="pt", return_token_type_ids=False).to(device)

# Training Loop
for epoch in range(num_epochs):

    num_virtual = model.peft_config["default"].num_virtual_tokens

    # Forward pass
    outputs =  model(**batch)

    # Skip logits for virtual tokens and the last token
    next_word_logits = outputs.logits[:, num_virtual:-1] # Skip virtual tokens
    true_next_tokens = batch["input_ids"][:, 1:]  # Shift ground truth tokens by one

    # Сделаем проверку
    assert next_word_logits.shape[1] == true_next_tokens.shape[1], \
    f"Shape mismatch: logits {next_word_logits.shape} vs targets {true_next_tokens.shape}"

    # Compute the loss
    loss = F.cross_entropy(
        next_word_logits.reshape(-1, next_word_logits.size(-1)),  
        true_next_tokens.reshape(-1)
    )

    # Backpropagation
    optimizer.zero_grad()  
    loss.backward()        
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) #клиппинг
    optimizer.step()

    # Print loss for tracking
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}")

    # Stop training if loss is below threshold
    if loss.item() < loss_threshold:
        print("Loss threshold reached. Stopping training.")
        break
else:
    print("Maximum epochs reached without meeting the loss threshold.")

Epoch 1/100, Loss: 7.7810378074646
Epoch 2/100, Loss: 7.4751691818237305
Epoch 3/100, Loss: 7.222239971160889
Epoch 4/100, Loss: 7.017329692840576
Epoch 5/100, Loss: 6.850759506225586
Epoch 6/100, Loss: 6.703004360198975
Epoch 7/100, Loss: 6.572248935699463
Epoch 8/100, Loss: 6.4522318840026855
Epoch 9/100, Loss: 6.337976455688477
Epoch 10/100, Loss: 6.224721908569336
Epoch 11/100, Loss: 6.1108293533325195
Epoch 12/100, Loss: 5.9946818351745605
Epoch 13/100, Loss: 5.8746418952941895
Epoch 14/100, Loss: 5.755471229553223
Epoch 15/100, Loss: 5.641081809997559
Epoch 16/100, Loss: 5.527312755584717
Epoch 17/100, Loss: 5.416666030883789
Epoch 18/100, Loss: 5.302328109741211
Epoch 19/100, Loss: 5.1896071434021
Epoch 20/100, Loss: 5.080301761627197
Epoch 21/100, Loss: 4.97166633605957
Epoch 22/100, Loss: 4.864184856414795
Epoch 23/100, Loss: 4.765018463134766
Epoch 24/100, Loss: 4.665083408355713
Epoch 25/100, Loss: 4.564681529998779
Epoch 26/100, Loss: 4.464593887329102
Epoch 27/100, Loss: 4

In [21]:
# Final assertion to ensure loss is below threshold
assert loss.item() < loss_threshold, "Training failed to reduce loss below threshold."
print("Training successful! Loss is below 0.1.")

Training successful! Loss is below 0.1.


In [22]:
prompt = "A quick brown fox"
batch = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to(device)

# Generate 18 tokens
for i in range(15):
    # Forward pass to get the logits
    outputs = model(**batch)
    next_token = outputs.logits[0, -1].argmax(-1).reshape(1, 1)

    # Append the next token to input_ids
    batch["input_ids"] = torch.cat([batch["input_ids"], next_token], dim=-1)

    # Update the attention_mask to match the new input_ids length
    new_attention_mask = torch.ones_like(next_token, dtype=batch["attention_mask"].dtype).to(device)
    batch["attention_mask"] = torch.cat([batch["attention_mask"], new_attention_mask], dim=-1)

# Decode the generated sequence
# Skip the virtual tokens (if applicable) by slicing `batch["input_ids"][:, num_prompts:]`
decoded_output = tokenizer.decode(batch["input_ids"][0].cpu().numpy().tolist(), skip_special_tokens=True)
print("\nOutput:", decoded_output)



Output: A quick brown fox did not jump over the lazy dog. Besides, that dog deserved it


### Parameter-efficient finetuning with LoRA

In [None]:
# re-load the model to remove any previous PEFT tuners
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map={"": 0}, 
    low_cpu_mem_usage=True, 
    offload_state_dict=True,
    load_in_4bit=True, 
    torch_dtype=torch.float32,  # weights are 4-bit; layernorms and activations are fp32
)
for param in model.parameters():
    param.requires_grad=False
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

In [27]:
class LoRALayer(nn.Module):
    """Wraps a linear layer with LoRA-like adapter. Wraps an existing OPT linear layer"""
    def __init__(self, module: nn.Linear, rank: int):
        super().__init__()
        self.module = module  # pre-trained (frozen) linear layer
        self.adapter_A = nn.Parameter(torch.empty(module.in_features, rank, device=module.weight.device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5 ** 0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, module.out_features, device=module.weight.device))

    def forward(self, input):
        # Apply self.module and LoRA adapter, return the sum (self.module outputs + adapter outputs)
        original_output = self.module(input)
        lora_output = input @ self.adapter_A @ self.adapter_B

        return original_output + lora_output

In [28]:
# test your implementation
test_linear = nn.Linear(128, 128)
test_linear.weight.data[...] = torch.eye(128)
test_adapter = LoRALayer(test_linear, rank=8)

assert torch.allclose(test_adapter(torch.ones(1, 1, 128)), test_linear.bias + 1), "please check your forward pass"

test_adapter.adapter_A.data[...] = torch.linspace(0.1, -0.5, 128 * 8).view(128, 8)
test_adapter.adapter_B.data[...] = torch.linspace(0.5, -0.1, 128 * 8).view(8, 128)
test_linear.bias.data[...] = torch.linspace(1., -1., 128)

dummy_loss = F.mse_loss(test_adapter(torch.ones(1, 128) / 128).squeeze(), torch.linspace(-1, 1, 128))
assert torch.allclose(dummy_loss, torch.tensor(1.3711389), rtol=0, atol=1e-4)
dummy_loss.backward()
assert all(w.grad is not None for w in [test_adapter.adapter_A, test_adapter.adapter_B]), "some adapter weights have no grad"
assert torch.allclose(test_adapter.adapter_A.grad.sum(), torch.tensor(-0.60158), rtol=0, atol=1e-4), "bad grad w.r.t. A"
assert torch.allclose(test_adapter.adapter_B.grad.sum(), torch.tensor(0.9931), rtol=0, atol=1e-4), "bad grad w.r.t. B"
# note: bad grad means that your code is different from LoRA paper OR that your code is not autograd-friendly (e.g. no_grad)
del dummy_loss, test_linear, test_adapter
print("All tests passed!")

All tests passed!


### Apply LoRA to the model

In [29]:
lora_rank = 8

for name, module in model.model.layers.named_modules():
    if 'LlamaDecoderLayer' in repr(type(module)):
        module.self_attn.q_proj = LoRALayer(module.self_attn.q_proj, rank=lora_rank).to(device)
        module.self_attn.k_proj = LoRALayer(module.self_attn.k_proj, rank=lora_rank).to(device)
        module.self_attn.v_proj = LoRALayer(module.self_attn.v_proj, rank=lora_rank).to(device)

assert sum(isinstance(module, LoRALayer) for module in model.modules()) == 96  # for Llama-7B

In [30]:
batch = tokenizer("This model wants to share its greatest secret:", return_tensors='pt', return_token_type_ids=False)
# test a single training step, make sure we get meaningful gradients
with torch.cuda.amp.autocast(dtype=torch.float32):
    out = model.forward(**batch)
    (out.logits.norm() / 100).backward()

for i, module in enumerate(model.modules()):
    if isinstance(module, LoRALayer):
        assert module.adapter_B.grad is not None
        assert module.adapter_B.grad.norm().item() > 0

model.zero_grad(set_to_none=True)
print("Grad check successful, well done!")

  with torch.cuda.amp.autocast(dtype=torch.float32):


Grad check successful, well done!


### train the model 


In [3]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import load_dataset
from tqdm.auto import tqdm
from IPython.display import HTML, display
import os

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
device = "cuda" if torch.cuda.is_available() else "cpu"

In [4]:
# Загрузка модели
model_name = 'Enoch/llama-7b-hf'

tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map={"": 0},
    low_cpu_mem_usage=True,
    offload_state_dict=True,
    load_in_4bit=True,
    torch_dtype=torch.float32,
)


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [5]:
# Заморозка
for param in model.parameters():
    param.requires_grad = False

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

In [6]:
# мой LoRALayer
class LoRALayer(nn.Module):
    def __init__(self, module: nn.Linear, rank: int):
        super().__init__()
        self.module = module  
        self.adapter_A = nn.Parameter(torch.empty(module.in_features, rank, device=module.weight.device))
        nn.init.kaiming_uniform_(self.adapter_A, a=5 ** 0.5)
        self.adapter_B = nn.Parameter(torch.zeros(rank, module.out_features, device=module.weight.device))

    def forward(self, input):
        original_output = self.module(input)
        lora_output = input @ self.adapter_A @ self.adapter_B

        return original_output + lora_output

In [7]:
lora_rank = 8
for name, module in model.model.layers.named_modules():
    if 'LlamaDecoderLayer' in repr(type(module)):
        try:
            if hasattr(module, 'mlp') and isinstance(module.mlp.gate_proj, nn.Linear):
                module.mlp.gate_proj = LoRALayer(module.mlp.gate_proj, rank=lora_rank).to(device)
                module.mlp.up_proj = LoRALayer(module.mlp.up_proj, rank=lora_rank).to(device)
                module.mlp.down_proj = LoRALayer(module.mlp.down_proj, rank=lora_rank).to(device)
        except AttributeError:
             continue

total_lora_layers = sum(isinstance(module, LoRALayer) for module in model.modules())
print(f"Обучаемые параметры: {total_lora_layers:,}")

Обучаемые параметры: 96


In [8]:
# Подготовка датасета
block_size = 256

dataset = load_dataset("codeparrot/codeparrot-clean", split="train", streaming=True)
dataset = dataset.shuffle(seed=42, buffer_size=1000).take(650)

all_text = " ".join(item['content'][:512] for item in dataset if isinstance(item['content'], str))
tokenized = tokenizer(all_text, truncation=False, add_special_tokens=False)
input_ids = tokenized['input_ids']

chunks = [
    input_ids[i : i + block_size]
    for i in range(0, len(input_ids) - block_size, block_size)
    if len(input_ids[i : i + block_size]) == block_size
]

print(f"Общее количество частей: {len(chunks)}")

Resolving data files:   0%|          | 0/54 [00:00<?, ?it/s]

Общее количество частей: 417


In [9]:
class ChunkDataset(Dataset):
    def __init__(self, chunks):
        self.chunks = chunks
    def __len__(self):
        return len(self.chunks)
    def __getitem__(self, idx):
        return torch.tensor(self.chunks[idx], dtype=torch.long)

train_dataset = ChunkDataset(chunks)
data_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)

In [10]:
prompts = ['import', 'from', 'while', 'try', 'if', 'for', 'torch', 'def']

def generate_samples(model, prompts, max_new_tokens=80):
    """
    Функция генерации текста
    """
    model.eval()
    results = []
    with torch.no_grad():
        for p in prompts:
            inputs = tokenizer(p, return_tensors="pt").to(device)
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                top_p=0.95,
                temperature=0.7,
                pad_token_id=tokenizer.eos_token_id
            )
            gen = tokenizer.decode(outputs[0], skip_special_tokens=True)
            results.append(gen)
    return results

In [11]:
print("Генерация до обучения...")
before_samples = generate_samples(model, prompts)

Генерация до обучения...


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


In [12]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
model.train()

data_iter = iter(data_loader)
steps = 0
max_steps = 500
grad_accum = 4

pbar = tqdm(total=max_steps, desc="Training LoRA on Python code")

while steps < max_steps:
    optimizer.zero_grad()
    total_loss = 0.0

    for _ in range(grad_accum):
        try:
            batch = next(data_iter).to(device)
        except StopIteration:
            data_iter = iter(data_loader)
            batch = next(data_iter).to(device)

        if batch.shape[1] > block_size:
            batch = batch[:, :block_size]

        outputs = model(input_ids=batch, labels=batch)
        loss = outputs.loss / grad_accum
        loss.backward()
        total_loss += loss.item()

    optimizer.step()
    steps += 1
    pbar.set_postfix({'loss': total_loss})
    pbar.update(1)

pbar.close()
print("Training completed!")

Training LoRA on Python code:   0%|          | 0/500 [00:00<?, ?it/s]

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Training completed!


In [13]:
model.config.use_cache = True

print("Генерация после обучения...")
after_samples = generate_samples(model, prompts)

Генерация после обучения...


In [16]:
# This template helps to compare generated code samples in pretty table form
# feel free to present your work in other forms

table_template = """<table style="border:1px solid black" >
  <tr>
    <th style="text-align: center; border:1px solid black">PROMPT</th>
    <th style="text-align: center; border:1px solid black">BEFORE</th>
    <th style="text-align: center; border:1px solid black">AFTER</th>
  </tr>
{}
</table>"""

row_template = '''  <tr>
    <td style="width:20%; border:1px solid black"><pre align="left">`{}`</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
    <td style="width:40%; border:1px solid black"><pre align="left">{}</pre></td>
  </tr>'''

rows = []

for i, prompt in enumerate(prompts):
    before = before_samples[i].replace(prompt, "", 1).strip()
    after = after_samples[i].replace(prompt, "", 1).strip()
    rows.append(row_template.format(
        prompt,
        before[:200] + ("..." if len(before) > 200 else ""),
        after[:200] + ("..." if len(after) > 200 else "")
    ))

display(HTML(table_template.format('\n'.join(rows))))

PROMPT,BEFORE,AFTER
`import`,"math from . import Common, Datatype class Value(Common.Value):  def __init__(self, name, value, source):  Common.Value.__init__(self, name, value, source)  def __repr__(self):  ...",re import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 import urllib2 im...
`from`,oslo_logs import context from oslo_log import importutils from oslo_log import log as logging from oslo_log import resources from oslo_log.strategy import Strategy import six class MultiLog(resourc...,"django.conf import settings from django.db import migrations, models import django.db.models.deletion class Migration(migrations.Migration):  dependencies = [  ('accounting', '0012_remove..."
`while`,"the second is a list of 200+ names from the 1820 US Federal Census. I am not a genealogist, nor am I related to any of the names in the two lists. The first list is a list of names from the 1820 US Fe...","(1) import time import datetime now = datetime.datetime.now() print now while 1:  print 'Checking if it is now', now  time.sleep(300)  now = datetime.datetime.now()  print now \end{code}..."
`try`,"again. if you are having issues, please contact support@stupeflix.com with your account name and a link to your video.",":  from _ast import * except ImportError:  pass try:  from _ast import Tuple except ImportError:  pass class Node(object):  def __init__(self, value, line=None, col=None, expanded=Fals..."
`if`,(var_is_null())  {  return;  }  var_ = var_;  if (var_.is_null())  {  return;  }  if (var_.get_type() == Var::VARCHAR)  {  var_.to_string(str);  },"(typeof require === 'function') {  var Rx = require('../../../../../src/core/util/core').Rx; } describe('Observable#take', function () {  it('should take the first values from the observable s..."
`for`,the 2019-2020 school year SPECIAL AWARD TIME! All students should be in the bleachers by 8:45 AM. The Pledge of Allegiance will be recited at 8:50 AM. The awards will begin at 9:00 AM. 2019,"PYTHON_ARCHIVES The Python Software Foundation (""the Foundation"") is a 501(c)(3) non-profit corporation incorporated in the Commonwealth of Massachusetts. The Foundation is the legal home of the Pytho..."
`torch`,"-bearer - a person who passes on a tradition, custom, or duty To be a torchbearer for the Gospel is to be a torchbearer for the kingdom of God. The Gospel is the Good News that the kingdom of God is h...",.frontend.Terminal -- Copyright (c) 2013-2017 The Chancellor Master and Fellows of the University of Cambridge -- -- This file is part of torch. -- -- Torch is free software: you can redistribute it ...
`def`,ixion 2017-07-03 03:35:55 UTC #1 I’ve been using the 360cam for a while now and have always been disappointed with the video quality. I’m using the 360cam app on a Samsung galaxy s7. I’ve noticed that...,"get_current_user():  """"""  Returns the current user from the session.  If there is no session, or the session is empty, it will return None.  :returns: None or the string id of t..."
