# HW3 - PEFT

In this notebook, we will fine-tune the GPT2 model on the [WikiText](https://huggingface.co/datasets/Salesforce/wikitext#wikitext-2-v1) dataset using different fine-tuning methodologies.

Parameter-Efficient Fine-Tuning (PEFT) is a technique that enables the adaptation of large pre-trained models to specific tasks while modifying only a small subset of their parameters, significantly reducing computational and memory costs. Instead of updating all model parameters, PEFT methods, such as LoRA (Low-Rank Adaptation), Adapter layers, and Prefix-Tuning, introduce lightweight trainable modules that are inserted into the model or modify activations in a structured way. This approach retains the general knowledge of the base model while efficiently adapting to new tasks, making it particularly useful for fine-tuning large-scale models like LLMs and vision-language models on resource-constrained hardware.

## Install required libraries

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl 

## Import required libraries

In [2]:
import gc
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, PrefixTuningConfig, get_peft_model, PeftModel, TaskType

## Setup

In [3]:
gpt_2_medium_model_name = "openai-community/gpt2-medium"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(gpt_2_medium_model_name)
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the dataset
def tokenizing_preprocess(examples):
    inputs =  tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
    inputs['labels'] = inputs['input_ids'].copy()
    return inputs


# Define training arguments
training_args = TrainingArguments(
    output_dir='./gpt2',
    eval_strategy='no',
    save_strategy="no",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
    report_to="none"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Load dataset (5 pt)

In [4]:
# TODO: Load the wikitext-2-v1 version of wikitext
dataset = load_dataset("wikitext", "wikitext-2-v1")

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/685k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.07M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/618k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [5]:
# TODO Select 1000 data for train and 500 data for validation
train_data = dataset['train'].select(range(1000))
eval_data = dataset['validation'].select(range(500))

# Apply tokenization preprocess on datasets
train_dataset = train_data.map(tokenizing_preprocess, batched=True)
eval_dataset = eval_data.map(tokenizing_preprocess, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

## Full Fine-Tuning (5 pt)

In [6]:
# Load the model
ff_model = AutoModelForCausalLM.from_pretrained(gpt_2_medium_model_name)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [7]:
# Initialize Trainer
trainer = Trainer(
    model=ff_model,
    args=training_args,
    train_dataset=train_dataset,
)

In [8]:
# Zero-Shot evaluation of model

# TODO: Evaluate model on eval_dataset
eval_output = trainer.evaluate(eval_dataset=eval_dataset)

print(f"eval_loss = {eval_output['eval_loss']:.4f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


eval_loss = 6.9568


In [9]:
# TODO: Get reserved memory from cuda
gpu_memory_before = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0
# TODO: Train the model using trainer
train_output = trainer.train()
# TODO: Get reserved memory from cuda
gpu_memory_after = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0

# Report the training time and gpu memory consumption
print(f"Training time: {train_output.metrics['train_runtime']:.4f} seconds")
print(f"GPU memory used: {gpu_memory_after - gpu_memory_before:.4f} bytes")

Step,Training Loss


Training time: 124.6486 seconds
GPU memory used: 5542772736.0000 bytes


In [10]:
# TODO: Evaluate model on eval_dataset
eval_output = trainer.evaluate(eval_dataset=eval_dataset)

print(f"eval_loss = {eval_output['eval_loss']:.4f}")

eval_loss = 1.0830


In [11]:
# Delete the model
del ff_model
del trainer

In [12]:
# Empty the GPU memory (Run this cell twice if the GPU RAM is not close to zero(~0.2))
gc.collect()
torch.cuda.empty_cache()

## Prefix Tuning (20 pt)

TODO: Explain about Prefix Tuning briefly\
Prefix Tuning is a parameter-efficient fine-tuning technique for large language models, particularly effective for natural language generation tasks. Instead of updating all the parameters of the large pre-trained model, Prefix Tuning keeps the original model weights frozen and introduces a small, continuous, task-specific vector, called a "prefix."

This trainable prefix is prepended to the input embeddings or to the hidden states in each layer of the transformer network. During fine-tuning, only the parameters of this prefix are optimized for the specific downstream task. The frozen pre-trained model then processes the input augmented with this learned prefix, allowing it to adapt its behavior to the new task.

The core idea is that this small set of learnable prefix parameters can effectively steer the large, frozen model to perform the desired task without the computational and storage costs associated with fine-tuning the entire model. This makes Prefix Tuning significantly more efficient in terms of memory and computation compared to traditional fine-tuning, while often achieving comparable or even better performance, especially in low-data settings.

In [49]:
from transformers import AutoModel
prefix_model = AutoModelForCausalLM.from_pretrained(gpt_2_medium_model_name)

In [50]:
# TODO: Define your LoRA configuration using PrefixTuningConfig class from peft library
#       Set task_type to CAUSAL_LM
prefix_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,
    prefix_projection=True
)


# TODO: Wrraped the GPT2LMHeadModel with above prefix config using get_peft_model function
prefix_model = get_peft_model(prefix_model, prefix_config)

# TODO: Print number of trainable parameters
prefix_model.print_trainable_parameters()

trainable params: 51,450,880 || all params: 406,274,048 || trainable%: 12.6641


In [51]:
prefix_model

PeftModelForCausalLM(
  (base_model): GPT2LMHeadModel(
    (transformer): GPT2Model(
      (wte): Embedding(50257, 1024)
      (wpe): Embedding(1024, 1024)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0-23): 24 x GPT2Block(
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D(nf=3072, nx=1024)
            (c_proj): Conv1D(nf=1024, nx=1024)
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): GPT2MLP(
            (c_fc): Conv1D(nf=4096, nx=1024)
            (c_proj): Conv1D(nf=1024, nx=4096)
            (act): NewGELUActivation()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_fea

In [52]:
# Initialize Trainer
trainer = Trainer(
    model=prefix_model,
    args=training_args,
    train_dataset=train_dataset,
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [53]:
# TODO: Get reserved memory from cuda
gpu_memory_before = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0
# TODO: Train the model
train_output = trainer.train()
# TODO: Get reserved memory from cuda
gpu_memory_after = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0

# Report the training time and gpu memory consumption
print(f"Training time: {train_output.metrics['train_runtime']:.4f} seconds")
print(f"GPU memory used: {gpu_memory_after - gpu_memory_before:.4f} bytes")

Step,Training Loss


Training time: 76.1505 seconds
GPU memory used: 2258632704.0000 bytes


In [54]:
# TODO: Evaluate model on eval_dataset
eval_output = trainer.evaluate(eval_dataset=eval_dataset)

print(f"eval_loss = {eval_output['eval_loss']:.4f}")

eval_loss = 1.3667


In [19]:
# Delete the model
del prefix_model
del trainer

In [48]:
# Empty the GPU memory (Run this cell twice if the GPU RAM is not close or less than 1.5Gb)
gc.collect()
torch.cuda.empty_cache()

## Fine-Tuning by LoRA (Low-Rank Adaptation) (40 pt)

TODO: Explain about LoRA (Low-Rank Adaptation) briefly\
LoRA, which stands for Low-Rank Adaptation, is a parameter-efficient fine-tuning (PEFT) technique designed to efficiently adapt large language models (LLMs) to specific downstream tasks without the need to fine-tune all of the model's parameters.

The core idea behind LoRA is based on the observation that the weight updates during the fine-tuning of LLMs often have a low "intrinsic rank." Instead of directly updating the original weight matrices of the pre-trained model, LoRA introduces pairs of small, trainable low-rank decomposition matrices into specific layers of the model, typically the attention layers.

During fine-tuning, the original pre-trained model weights are kept frozen. Only these newly added low-rank matrices are trained on the task-specific data. The update to the original weight matrix is then represented by the matrix multiplication of these two smaller low-rank matrices.

This approach significantly reduces the number of trainable parameters, as the number of parameters in the low-rank matrices is much smaller than in the original weight matrices. This leads to substantial savings in computational resources (GPU memory and processing time) and storage for each fine-tuned task, making it feasible to adapt very large models on less powerful hardware and store multiple fine-tuned versions efficiently. LoRA also generally maintains or improves the performance compared to full fine-tuning and does not add any inference latency.

In [21]:
lora_model = AutoModelForCausalLM.from_pretrained(gpt_2_medium_model_name)

In [22]:
# Print the model artitechture
print(lora_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50257, bias=False)
)


In [23]:

# TODO: Define your LoRA configuration using LoraConfig class from peft library
#       Apply the LoRA on Conv1D modules (c_attn and c_proj) of GPT2Attention blocks (attn).
#       Set fan_in_fan_out to True
#       Set task_type to CAUSAL_LM
lora_config = LoraConfig(
    r=256,
    lora_alpha = 256,
    target_modules=['c_attn','c_proj'],
    fan_in_fan_out = True,
    task_type = TaskType.CAUSAL_LM
)

# # TODO: Wrraped the transformer module of GPT2LMHeadModel with above lora config
# #       using get_peft_model function
lora_model = get_peft_model(lora_model, lora_config)

# TODO: Print number of trainable parameters
lora_model.print_trainable_parameters()

trainable params: 69,206,016 || all params: 424,029,184 || trainable%: 16.3211


In [24]:
# Print the model artitechture and see the changes
print(lora_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 1024)
        (wpe): Embedding(1024, 1024)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-23): 24 x GPT2Block(
            (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D(nf=3072, nx=1024)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1024, out_features=256, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=256, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora

In [25]:
# Initialize Trainer
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_dataset,
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [26]:
# TODO: Get reserved memory from cuda
gpu_memory_before = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0
# TODO: Train the model using trainer
train_output = trainer.train()
# TODO: Get reserved memory from cuda
gpu_memory_after = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0

# Report the training time and gpu memory consumption
print(f"Training time: {train_output.metrics['train_runtime']:.4f} seconds")
print(f"GPU memory used: {gpu_memory_after - gpu_memory_before:.4f} bytes")

Step,Training Loss


Training time: 93.3378 seconds
GPU memory used: 2606759936.0000 bytes


In [27]:
# TODO: Evaluate model on eval_dataset
eval_output = trainer.evaluate(eval_dataset=eval_dataset)

print(f"eval_loss = {eval_output['eval_loss']:.4f}")

eval_loss = 1.2767


In [28]:
# Delete the model
del lora_model
del trainer

In [29]:
# Empty the GPU memory (Run this cell twice if the GPU RAM is not close or less than 1.5Gb)
gc.collect()
torch.cuda.empty_cache()

#### Run LoRA for different rank values

Fine-tune the GPT-2 model with different rank values. (Be sure to change the alpha value according to the rank so that the results are fair.)

Enter the requested items in the table.

Compare the values ​​obtained and explain their differences.

TODO

| Method | Training Time(s) | Training Memory(Gb) | Validation Loss| #Trainable Params(M)|
|:-:|:-:|:-:|:-:|:-:|
| Zero-Shot         | ...  | ...  | 6.9568 | ... |
| Full Fine-Tuning  | 124.6486  | 5.5427  | 1.0830 | ... |
| Prefix Tuning     | 76.1505  | 2.2586 | 1.3667 | 51,450,880 |
| Lora rank=4       | 77.5178  | 1.8832  | 5.4517 | 1,081,344 |
| Lora rank=16      | 72.9470  | 1.9167  | 3.9039 | 4,325,376 |
| Lora rank=64      | 75.4763  | 2.1244  | 1.5502 | 17,301,504 |
| Lora rank=256     | 93.3378  | 2.6067  | 1.2767 | 69,206,016 |




TODO:

Your detailed and complete explanation\

The table shows the trade-offs between different methods for adapting a language model.

* **Zero-Shot** has the worst performance (highest validation loss) as it involves no training.
* **Full Fine-Tuning** achieves the best performance (lowest validation loss) but is the most expensive in terms of training time and memory because it updates all model parameters.
* **Prefix Tuning** and **LoRA** are **Parameter-Efficient Fine-Tuning (PEFT)** methods that train significantly fewer parameters than full fine-tuning. This results in much lower training time and memory usage.
* **Prefix Tuning** offers a good balance of efficiency and performance.
* **LoRA's** performance and efficiency depend on the **rank**: a lower rank is more resource-efficient but can lead to lower performance, while a higher rank gets closer to full fine-tuning performance but uses more resources and trains more parameters.

In essence, PEFT methods like Prefix Tuning and LoRA provide a way to achieve performance comparable to full fine-tuning with substantial savings in computational resources.

## Implement LoRA from scratch (30 pt)

In [39]:
custom_lora_model = AutoModelForCausalLM.from_pretrained(gpt_2_medium_model_name)
print(custom_lora_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3072, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=1024)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=4096, nx=1024)
          (c_proj): Conv1D(nf=1024, nx=4096)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50257, bias=False)
)


In [40]:
class LoRALayer(nn.Module):
    def __init__(self, base_layer, rank=4, alpha=64):
        super().__init__()
        self.rank = rank
        self.alpha = alpha

        # TODO: set the base_layer and extract the input and output shape of it
        self.base_layer = base_layer
        # Conv1D weight shape: [in_features, out_features]
        weight_shape = base_layer.weight.shape
        in_features = weight_shape[0]
        out_features = weight_shape[1]

        # TODO: Define the A and B matrices
        #       Note that the B matrices must be initialized with zero (both weight and bias)
        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=True)
        nn.init.zeros_(self.lora_B.weight)
        nn.init.zeros_(self.lora_B.bias)

    def forward(self, x):
        # TODO: Complete the forward layer
        base_out = self.base_layer(x)
        # LoRA update on last dimension
        lora_update = self.lora_B(self.lora_A(x)) * (self.alpha / self.rank)
        return base_out + lora_update

In [41]:
# TODO: Freeze the model
for param in custom_lora_model.parameters():
    param.requires_grad = False


transformer_model = custom_lora_model.transformer

for block in transformer_model.h:

    block.attn.c_attn = LoRALayer(block.attn.c_attn)
    block.attn.c_proj = LoRALayer(block.attn.c_proj)



In [42]:
print(custom_lora_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): LoRALayer(
            (base_layer): Conv1D(nf=3072, nx=1024)
            (lora_A): Linear(in_features=1024, out_features=4, bias=False)
            (lora_B): Linear(in_features=4, out_features=3072, bias=True)
          )
          (c_proj): LoRALayer(
            (base_layer): Conv1D(nf=1024, nx=1024)
            (lora_A): Linear(in_features=1024, out_features=4, bias=False)
            (lora_B): Linear(in_features=4, out_features=1024, bias=True)
          )
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp)

In [43]:
# Initialize Trainer
trainer = Trainer(
    model=custom_lora_model,
    args=training_args,
    train_dataset=train_dataset,
)

In [44]:
# TODO: Get reserved memory from cuda
gpu_memory_before = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0
# TODO: Train the model using trainer
train_output = trainer.train()
# TODO: Get reserved memory from cuda
gpu_memory_after = torch.cuda.memory_reserved() if torch.cuda.is_available() else 0


# Report the training time and gpu memory consumption
print(f"Training time: {train_output.metrics['train_runtime']:.4f} seconds")
print(f"GPU memory used: {gpu_memory_after - gpu_memory_before:.4f} bytes")

Step,Training Loss


Training time: 73.4684 seconds
GPU memory used: 1472200704.0000 bytes


In [45]:
# TODO: Evaluate model on eval_dataset
eval_output = trainer.evaluate(eval_dataset=eval_dataset)

print(f"eval_loss = {eval_output['eval_loss']:.4f}")

eval_loss = 1.3679


In [37]:
# Delete the model
del custom_lora_model
del trainer

In [38]:
# Empty the GPU memory (Run this cell twice if the GPU RAM is not close or less than 1.5Gb)
gc.collect()
torch.cuda.empty_cache()