<a href="https://colab.research.google.com/github/JenWei0312/PEFT_methods/blob/main/PEFT_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PEFT Part Two -- Practical Examples using PEFT library

### Table of Content

#### **PEFT (Parameter-Efficient Fine-Tuning) Workflow Overview**

#### **Hands-on Examples**
This section provides practical examples that complement the official HuggingFace PEFT tutorial. Our examples are designed with accessibility in mind:
- All examples run on Google Colab's free tier (T4 GPU with 15GB memory)
- Each example includes memory monitoring and runtime expectations
- Code is thoroughly documented with memory management best practices

#### **Memory management tips included and thoroughly documented**

####**Code includes**:
- Base case with memory management techniques
- LoRA fine tuning
- QLoRA fine tuning



  
### PEFT (Parameter-Efficient Fine-Tuning) Workflow Overview

Navigating the Hugging Face ecosystem for parameter-efficient fine-tuning can sometimes feel like exploring a maze. This guide provides a clear roadmap of the process, showing how different libraries work together to achieve efficient model adaptation.

The workflow follows these essential steps:

    Start
→ Dataset Loading/Split (datasets)<br>
→ Define Preprocessing <br>
→ Load Tokenizer (transformers.AutoTokenizer)<br>
→ Apply Tokenization to Dataset (dataset.map())<br>
→ Load Base Model (transformers.AutoModelForXXX)<br>
→ Define PEFT Config (peft.LoraConfig or similar)<br>
→ Get PEFT Model (peft.get_peft_model())<br>
→ Define Training Arguments (transformers.TrainingArguments)<br>
→ Create Data Collator (transformers.DataCollatorForXXX)<br>
→ Initialize Trainer (transformers.Trainer)<br>
→ Train (trainer.train())<br>
→ Evaluate (evaluate + custom metrics)<br>

    End

**Key Concepts to Understand**:

1. **PEFT's Role in Training**: PEFT works alongside the standard training infrastructure from transformers, not as a replacement. It specifically modifies how model parameters are updated during training, which is why you'll still use the transformers Trainer class with your PEFT-modified model.

2. **The Two-Stage Model Setup**: When working with PEFT, your model goes through two important transformations:
   - First, you load the base model from transformers (e.g., `model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")`)
   - Then, you convert it into a PEFT model (e.g., `peft_model = get_peft_model(model, peft_config)`)
   This two-stage process allows PEFT to add its efficient adaptation layers while preserving the base model's architecture.

3. **Model Compatibility**: PEFT's versatility extends beyond the models pre-defined in its library. A model can work with PEFT if it:
   - Uses PyTorch's nn.Module system
   - Contains standard layer types (Linear, Conv2d, etc.)
   - Has a structure that PEFT can traverse and modify

4. **Library Integration**: The workflow demonstrates how different Hugging Face libraries complement each other:
   - `datasets`: Handles data management and preprocessing
   - `transformers`: Provides the foundational model architecture and training infrastructure
   - `peft`: Adds efficient fine-tuning capabilities
   - `evaluate`: Offers evaluation metrics and tools


### Hands-on Examples

This section provides practical examples that complement the official HuggingFace PEFT tutorial. Our examples are designed with accessibility in mind:
- All examples run on Google Colab's free tier (T4 GPU with 15GB memory)
- Each example includes memory monitoring and runtime expectations
- Code is thoroughly documented with memory management best practices


<br>**Resource Requirements & Tips**

Before starting each example, ensure your Colab environment has:
- A fresh runtime (to avoid memory fragmentation)
- Connected to a T4 GPU
- All required packages installed (requirements provided in each section)

Memory Management Tips:
- Use gradient checkpointing when possible
- Enable mixed precision training (fp16 or bf16)
- Monitor GPU memory usage throughout training
- Implement strategic model unloading
- Choose appropriate batch sizes and accumulation steps



### The Journey to Efficient Fine-tuning

**1. Traditional Fine-tuning Challenges**
* Training time: ~30 minutes (projected)
*  Even with memory optimizations (gradient checkpointing, mixed precision)
* Can run into OOM errors if memory/cache isn't carefully managed
* Training all 350M parameters - high computational overhead<br>

**2. LoRA: A Game-Changing Solution**
* Training time: ~5 minutes (6x faster!)
* Only training ~0.2% of the parameters
* Same quality results with dramatically less overhead
* More stable and predictable memory usage<br>

**3. QLoRA: Taking Efficiency Further**
* Training time: ~7 minutes
* Even lower memory footprint
* Perfect for resource-constrained environments
* Makes fine-tuning accessible to everyone<br><br>

**Key Insight**: PEFT methods like LoRA don't just solve memory issues - they make fine-tuning faster, cheaper, and more reliable. A 6x speedup means 6x less compute cost, making previously impractical tasks both possible and affordable!

#### Base Case -- Costy With Risk of CUDA Run Out Of Memory

In [6]:
!pip install --upgrade torch torchvision torchaudio
!pip install --upgrade transformers
!pip install --upgrade evaluate==0.4.0
!pip install --upgrade datasets
!pip install --upgrade peft
!pip install --upgrade trl



In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from trl import SFTTrainer
import time

print("Starting baseline with aggressive memory optimization...")

# Memory monitoring
def print_gpu_memory():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**2
        reserved = torch.cuda.memory_reserved() / 1024**2
        print(f"GPU Memory: {allocated:.2f}MB allocated, {reserved:.2f}MB reserved")

# Load dataset and tokenizer
print("\nLoading dataset and tokenizer...")
dataset = load_dataset("imdb", split="train[:1%]")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

print_gpu_memory()

# Determine available precision
if torch.cuda.is_available():
    if torch.cuda.is_bf16_supported():
        print("Using bfloat16 precision")
        torch_dtype = torch.bfloat16
    else:
        print("bfloat16 not supported, falling back to float16")
        torch_dtype = torch.float16
else:
    print("CUDA not available, using float32")
    torch_dtype = torch.float32

# Load model with memory optimizations
print("\nLoading model...")
model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    #torch_dtype=torch.bfloat16,  # Use mixed precision
    torch_dtype=torch_dtype,
    use_cache=False,  # Disable KV cache
)

# Enable gradient checkpointing
model.gradient_checkpointing_enable()
print("Memory after loading model:")
print_gpu_memory()

# Configure training with aggressive memory optimization
training_args = TrainingArguments(
    output_dir="output_dir",
    per_device_train_batch_size=4,     # Small batch size
    gradient_accumulation_steps=8,      # Gradient accumulation
    max_steps=100,
    save_strategy="steps",
    save_steps=50,
    logging_steps=1,
    learning_rate=1e-4,
    #fp16=True,                         # Mixed precision training
    fp16=not torch.cuda.is_bf16_supported(),  # Use fp16 only if bfloat16 is not available
    bf16=torch.cuda.is_bf16_supported(),      # Use bf16 if available
    optim="adamw_torch_fused",         # Memory-efficient optimizer
    max_grad_norm=1.0,
)

# Initialize trainer
print("\nInitializing trainer...")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
)


print("\nAttempting to start training...")
print_gpu_memory()

try:
    trainer.train()
    print("Training completed (unexpected!)")
except Exception as e:
    print(f"\nTraining failed as expected with error: {str(e)}")
finally:
    print("\nFinal memory usage:")
    print_gpu_memory()




Starting baseline with aggressive memory optimization...

Loading dataset and tokenizer...
GPU Memory: 3821.02MB allocated, 7452.00MB reserved
Using bfloat16 precision

Loading model...
Memory after loading model:
GPU Memory: 3821.02MB allocated, 7452.00MB reserved

Initializing trainer...

Attempting to start training...
GPU Memory: 4453.63MB allocated, 7452.00MB reserved


Step,Training Loss
1,27.4925



Final memory usage:
GPU Memory: 3818.99MB allocated, 7494.00MB reserved


KeyboardInterrupt: 

#### Option one -- LoRA

**Memory Profile for OPT-350M with LoRA**
- Initial load: ~0.6GB
- Peak during training: ~1GB
- Final usage: ~0.9GB

These settings enabled fine-tuning OPT-350M on a T4 GPU (15GB memory) with stable memory usage around 2GB.

**Training Time**
- ~5.5 minutes

In [2]:
import transformers
import torch
import time
from dataclasses import dataclass
from typing import Dict, List

## Utility functions for training metric monitoring
@dataclass
class TrainingMetrics:
    """Class to store training metrics"""
    start_time: float
    peak_memory: float = 0
    memory_logs: List[Dict[str, float]] = None

    def __post_init__(self):
        self.memory_logs = []

    def update_peak_memory(self):
        """Update peak memory based on current GPU memory usage"""
        current_allocated = torch.cuda.memory_allocated() / 1024**2
        current_reserved = torch.cuda.memory_reserved() / 1024**2
        self.peak_memory = max(self.peak_memory, current_reserved)
        self.memory_logs.append({
            "timestamp": time.time() - self.start_time,
            "allocated": current_allocated,
            "reserved": current_reserved
        })

    def get_training_time(self):
        """Calculate total training time in minutes"""
        return (time.time() - self.start_time) / 60

def print_gpu_memory():
    """Print current GPU memory usage"""
    allocated = torch.cuda.memory_allocated() / 1024**2
    reserved = torch.cuda.memory_reserved() / 1024**2
    print(f"GPU Memory: {allocated:.2f}MB allocated, {reserved:.2f}MB reserved")


# Modify the trainer to track metrics during training
class MetricsCallback(transformers.TrainerCallback):
    def __init__(self, metrics):
        self.metrics = metrics

    def on_step_end(self, args, state, control, **kwargs):
        self.metrics.update_peak_memory()

In [4]:
import torch
import time
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset

# Start timing and CUDA setup
start_time = time.time()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
torch.cuda.empty_cache()

def print_gpu_memory():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**2
        reserved = torch.cuda.memory_reserved() / 1024**2
        print(f"GPU Memory: {allocated:.2f}MB allocated, {reserved:.2f}MB reserved")

# Load dataset and tokenizer
print("\nLoading dataset and tokenizer...")
dataset = load_dataset("imdb", split="train[:1%]")  # Match QLoRA's dataset size
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
print_gpu_memory()

# Load model
print("\nLoading model...")
model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)
model.gradient_checkpointing_enable()
print_gpu_memory()

# Configure LoRA - match QLoRA settings
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,  # Match QLoRA
    bias="none",
)

# Create PEFT model
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
print_gpu_memory()

# Configure training - match QLoRA settings
training_args = TrainingArguments(
    output_dir="output_dir",
    per_device_train_batch_size=4,  # Match QLoRA
    gradient_accumulation_steps=8,   # Match QLoRA
    max_steps=100,
    save_strategy="steps",
    save_steps=50,
    logging_steps=1,
    learning_rate=1e-4,  # Match QLoRA
    fp16=True,
    optim="adamw_torch_fused",
    max_grad_norm=1.0,
)

# Initialize trainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    args=training_args,
)

print("\nStarting training...")
print_gpu_memory()

try:
    trainer.train()
    print("Training completed successfully!")
except Exception as e:
    print(f"Training failed with error: {str(e)}")
finally:
    training_time = (time.time() - start_time) / 60
    print(f"\nTotal time: {training_time:.2f} minutes")
    print("\nFinal memory usage:")
    print_gpu_memory()

Using device: cuda

Loading dataset and tokenizer...
GPU Memory: 281.92MB allocated, 308.00MB reserved

Loading model...
GPU Memory: 914.53MB allocated, 938.00MB reserved
trainable params: 786,432 || all params: 331,982,848 || trainable%: 0.2369
GPU Memory: 917.53MB allocated, 940.00MB reserved

Starting training...
GPU Memory: 917.53MB allocated, 940.00MB reserved


Step,Training Loss
1,27.4729
2,28.8999
3,27.8254
4,28.6126
5,27.3506
6,27.2318
7,27.563
8,34.8995
9,27.9177
10,27.3051


Training completed successfully!

Total time: 5.61 minutes

Final memory usage:
GPU Memory: 923.58MB allocated, 4314.00MB reserved


#### Option Two -- QLoRA

**Memory Profile for OPT-350M with QLoRA**
- Initial load: ~0.3GB
- Peak during training: ~0.4GB
- Final usage: ~0.3GB

These settings enabled fine-tuning OPT-350M on a T4 GPU (15GB memory) with stable memory usage well under 1GB.

**Training Time**
- ~7.5 minutes

**Comparison with LoRA**
- Significantly reduces initial load and peak memory usage compared to LoRA.
- Slightly longer training time, potentially incurring a slightly higher cost.

In [3]:
import torch
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments
)
from trl import SFTTrainer
from datasets import load_dataset
import time

# Memory monitoring
print("Initial GPU memory:")
print_gpu_memory()

# 1. Load dataset and tokenizer
dataset = load_dataset("imdb", split="train[:1%]")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

print("\nAfter loading tokenizer:")
print_gpu_memory()

# 2. Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 3. Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    quantization_config=bnb_config,
    device_map="auto"
)

print("\nAfter loading quantized model:")
print_gpu_memory()

# 4. Prepare model for training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# 5. Configure LoRA
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # Adjusted for OPT model
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 6. Get PEFT model
model = get_peft_model(model, config)
model.print_trainable_parameters()

print("\nAfter PEFT model creation:")
print_gpu_memory()

# 7. Configure training
training_args = TrainingArguments(
    output_dir="output_dir",
    per_device_train_batch_size=4,  # Can try larger batch size due to 4-bit quantization
    gradient_accumulation_steps=8,
    max_steps=100,
    save_strategy="steps",
    save_steps=50,
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    optim="adamw_torch_fused",
    max_grad_norm=1.0,
)

# 8. Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=training_args,
)

print("\nStarting training:")
print_gpu_memory()

# 9. Train with timing
start_time = time.time()

try:
    trainer.train()
    print("\nTraining completed successfully!")
except Exception as e:
    print(f"\nTraining failed with error: {str(e)}")
finally:
    training_time = (time.time() - start_time) / 60
    print(f"\nTotal training time: {training_time:.2f} minutes")
    print("\nFinal memory usage:")
    print_gpu_memory()

Initial GPU memory:
GPU Memory: 0.00MB allocated, 0.00MB reserved


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



After loading tokenizer:
GPU Memory: 0.00MB allocated, 0.00MB reserved

After loading quantized model:
GPU Memory: 203.87MB allocated, 226.00MB reserved
trainable params: 786,432 || all params: 331,982,848 || trainable%: 0.2369

After PEFT model creation:
GPU Memory: 259.67MB allocated, 328.00MB reserved


  trainer = SFTTrainer(


Map:   0%|          | 0/250 [00:00<?, ? examples/s]




Starting training:
GPU Memory: 259.67MB allocated, 328.00MB reserved


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mjenwei0312[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Step,Training Loss
1,28.3785
2,29.7761
3,28.6656
4,29.3529
5,28.0729
6,27.9259
7,28.2
8,35.6395
9,28.504
10,27.964


  return fn(*args, **kwargs)



Training completed successfully!

Total training time: 7.49 minutes

Final memory usage:
GPU Memory: 281.92MB allocated, 3766.00MB reserved


### Appendix -- GPU Memory Management in Base Case

When fine-tuning large language models, efficient GPU memory management is crucial. Below are key techniques used in our implementation:

**1. Mixed Precision Training**
```python
model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    torch_dtype=torch.bfloat16,  # Halves memory usage
    use_cache=False  # Disables KV cache
)
```

**2. Gradient Checkpointing**
```python
model.gradient_checkpointing_enable()  # Trades speed for lower memory
```

**3. Memory-Efficient Training Settings**
```python
training_args = TrainingArguments(
    per_device_train_batch_size=4,     # Small batch size
    gradient_accumulation_steps=8,     # Accumulate gradients
    fp16=True,                         # Mixed precision
    optim="adamw_torch_fused"          # Memory-efficient optimizer
)
```


**Key Insights**
- Mixed precision (bfloat16) reduces parameter memory by half
- Gradient checkpointing reduces memory by recomputing gradients
- Small batch size with gradient accumulation achieves larger effective batch size with lower memory

### Appendix -- Combine LoRA with Memory Saving Techniques

Further deduce memory, but at a cost of performence

In [5]:
import torch
import time
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset

# Start timing
start_time = time.time()

# Force CUDA initialization and clear cache
print("Initializing CUDA...")
torch.cuda.init()
torch.cuda.empty_cache()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

# Basic memory check function
def print_gpu_memory():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**2
        reserved = torch.cuda.memory_reserved() / 1024**2
        print(f"GPU Memory: {allocated:.2f}MB allocated, {reserved:.2f}MB reserved")

# Load dataset and tokenizer
print("\nLoading dataset and tokenizer...")
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

# Split the dataset before preprocessing
train_dataset = dataset["train"].select(range(250))  # Use a small subset for demonstration
eval_dataset = dataset["test"].select(range(100))   # Use a small subset for demonstration

# Add data preprocessing
def preprocess_function(examples):
    # Format the text properly for the model
    prompt = "Review: "
    texts = [prompt + str(review) for review in examples["text"]]
    return tokenizer(
        texts,
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors=None
    )

# Process the train and eval datasets separately
train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names
)
eval_dataset = eval_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=eval_dataset.column_names
)

print_gpu_memory()

# Load model
print("\nLoading model...")
model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m",
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)  # Explicitly move to GPU
model.gradient_checkpointing_enable()
print_gpu_memory()

# Configure LoRA
print("\nConfiguring LoRA...")
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    init_lora_weights="gaussian",
    bias="none",
)

# Create PEFT model
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
print_gpu_memory()

# Configure training arguments
print("\nSetting up training arguments...")
training_args = TrainingArguments(
    output_dir="output_dir",
    per_device_train_batch_size=1, # Reduced further
    gradient_accumulation_steps=16, # Increased further
    max_steps=100,
    # Add these parameters
    learning_rate=1e-5,  # Lower learning rate
    warmup_steps=10,     # Add warmup
    weight_decay=0.01,   # Add regularization
    # Add these for debugging
    logging_steps=1,
    logging_first_step=True,
    # Evaluation
    evaluation_strategy="steps",
    eval_steps=20,
    # Same as before
    save_strategy="steps",
    save_steps=50,
    fp16=True,
    optim="adamw_torch_fused",
    max_grad_norm=1.0,
)


# Initialize trainer
print("\nInitializing trainer...")
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=train_dataset,  # Pass the preprocessed train dataset
    eval_dataset=eval_dataset,    # Pass the preprocessed eval dataset
    args=training_args,
    #processing_class=tokenizer.__class__,  # Use processing_class instead of tokenizer
)

# Start training with progress indication
print("\nStarting training...")
print_gpu_memory()

try:
    print("Training initiated...")
    trainer.train()
    print("Training completed!")
except Exception as e:
    print(f"Training failed with error: {str(e)}")
finally:
    # Print final timing and memory usage
    training_time = (time.time() - start_time) / 60
    print(f"\nTotal time: {training_time:.2f} minutes")
    print("\nFinal memory usage:")
    print_gpu_memory()

Initializing CUDA...
Using device: cuda
CUDA available: True
CUDA device: Tesla T4

Loading dataset and tokenizer...


Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

GPU Memory: 635.61MB allocated, 656.00MB reserved

Loading model...
GPU Memory: 1268.23MB allocated, 1290.00MB reserved

Configuring LoRA...
trainable params: 786,432 || all params: 331,982,848 || trainable%: 0.2369
GPU Memory: 1271.23MB allocated, 1292.00MB reserved

Setting up training arguments...

Initializing trainer...





Starting training...
GPU Memory: 1271.23MB allocated, 1292.00MB reserved
Training initiated...


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss
20,54.8673,No log
40,53.557,No log
60,55.6401,No log
80,34.261,No log
100,54.1816,No log


Training completed!

Total time: 4.11 minutes

Final memory usage:
GPU Memory: 657.91MB allocated, 3836.00MB reserved
