# Understanding LoRA
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into the model’s layers. Instead of training all model parameters during fine-tuning, LoRA decomposes the weight updates into smaller matrices through low-rank decomposition, significantly reducing the number of trainable parameters while maintaining model performance. For example, when applied to GPT-3 175B, LoRA reduced trainable parameters by 10,000x and GPU memory requirements by 3x compared to full fine-tuning.  
LoRA works by adding pairs of rank decomposition matrices to transformer layers, typically focusing on attention weights. During inference, these adapter weights can be merged with the base model, resulting in no additional latency overhead. LoRA is particularly useful for adapting large language models to specific tasks or domains while keeping resource requirements manageable.  

Key advantages of LoRA
1. Memory Efficiency:
  * Only adapter parameters are stored in GPU memory
  * Base model weights remain frozen and can be loaded in lower precision
  * Enables fine-tuning of large models on consumer GPUs

2. Training Features:
  * Native PEFT/LoRA integration with minimal setup
  * Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. Adapter Management:
  * Adapter weight saving during checkpoints
  * Features to merge adapters back into base model

# Loading LoRA Adapters with PEFT(Parameter-Efficient Fine-Tuning).
PEFT is a library that provides a unified interface for loading and managing PEFT methods, including LoRA. It allows you to easily load and switch between different PEFT methods, making it easier to experiment with different fine-tuning techniques.

Adapters can be loaded onto a pretrained model with **load_adapter()**, which is useful for trying out different adapters whose weights aren’t merged. Set the active adapter weights with the **set_adapter()** function. To return the base model, you could use **unload()** to unload all of the LoRA modules. This makes it easy to switch between different task-specific weights.

In [None]:
from peft import PeftModel, PeftConfig

config = PeftConfig.from_pretrained("ybelkada/opt-350m-lora")
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
lora_model = PeftModel.from_pretrained(model, "ybelkada/opt-350m-lora")

# Fine-tune LLM using trl and the SFTTrainer with LoRA
The SFTTrainer from trl provides integration with LoRA adapters through the PEFT library. This means that we can fine-tune a model in the same way as we did with SFT, but use LoRA to reduce the number of parameters we need to train.

We’ll use the LoRAConfig class from PEFT in our example. The setup requires just a few configuration steps:

1. Define the LoRA configuration (rank, alpha, dropout)
2. Create the SFTTrainer with PEFT config
3. Train and save the adapter weights

# Using TRL with PEFT
PEFT methods can be combined with TRL for fine-tuning to reduce memory requirements. We can pass the LoraConfig to the model when loading it.

In [None]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

Above, we used **device_map="auto"** to automatically assign the model to the correct device. You can also manually assign the model to a specific device using **device_map={"": device_index}**.

We will also need to define the **SFTTrainer** with the LoRA configuration.

In [None]:
# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    peft_config=peft_config,  # LoRA configuration
    max_seq_length=max_seq_length,  # Maximum sequence length
    processing_class=tokenizer,
)

> ✏️ Try it out! Build on your fine-tuned model from the previous section, but fine-tune it with LoRA. Use the HuggingFaceTB/smoltalk dataset to fine-tune a deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model, using the LoRA configuration we defined above.

# Merging LoRA Adapters
After training with LoRA, you might want to merge the adapter weights back into the base model for easier deployment. This creates a single model with the combined weights, eliminating the need to load adapters separately during inference.

The merging process requires attention to memory management and precision. Since you’ll need to load both the base model and adapter weights simultaneously, ensure sufficient GPU/CPU memory is available. Using device_map="auto" in transformers will find the correct device for the model based on your hardware.

Maintain consistent precision (e.g., float16) throughout the process, matching the precision used during training and saving the merged model in the same format for deployment.

# Merging Implementation
After training a LoRA adapter, you can merge the adapter weights back into the base model. Here’s how to do it:

In [None]:
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel

# 1. Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "base_model_name", torch_dtype=torch.float16, device_map="auto"
)

# 2. Load the PEFT model with adapter
peft_model = PeftModel.from_pretrained(
    base_model, "path/to/adapter", torch_dtype=torch.float16
)

# 3. Merge adapter weights with base model
merged_model = peft_model.merge_and_unload()

If you encounter size discrepancies in the saved model, ensure you’re also saving the tokenizer:

In [None]:
# Save both model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("base_model_name")
merged_model.save_pretrained("path/to/save/merged_model")
tokenizer.save_pretrained("path/to/save/merged_model")

>✏️ Try it out! Merge the adapter weights back into the base model. Use the HuggingFaceTB/smoltalk dataset to fine-tune a deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model, using the LoRA configuration we defined above.

# How to Fine-Tune LLMs with LoRA Adapters using Hugging Face TRL

This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Adds small trainable rank decomposition matrices to attention layers
- Typically reduces trainable parameters by ~90%
- Maintains model performance while being memory efficient

We'll cover:
1. Setup development environment and LoRA configuration
2. Create and prepare the dataset for adapter training
3. Fine-tune using `trl` and `SFTTrainer` with LoRA adapters
4. Test the model and merge adapters (optional)


## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pytorch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.


In [None]:
# Install the requirements in Google Colab
!pip -q install transformers datasets trl huggingface_hub peft

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

## 2. Load the dataset

In [None]:
# Load a sample dataset
from datasets import load_dataset

# TODO: define your dataset and config using the path and name parameters
dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")
dataset

## 3. Fine-tune LLM using `trl` and the `SFTTrainer` with LoRA

The [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` provides integration with LoRA adapters through the [PEFT](https://huggingface.co/docs/peft/en/index) library. Key advantages of this setup include:

1. **Memory Efficiency**:
   - Only adapter parameters are stored in GPU memory
   - Base model weights remain frozen and can be loaded in lower precision
   - Enables fine-tuning of large models on consumer GPUs

2. **Training Features**:
   - Native PEFT/LoRA integration with minimal setup
   - Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. **Adapter Management**:
   - Adapter weight saving during checkpoints
   - Features to merge adapters back into base model

We'll use LoRA in our example, which combines LoRA with 4-bit quantization to further reduce memory usage without sacrificing performance. The setup requires just a few configuration steps:
1. Define the LoRA configuration (rank, alpha, dropout)
2. Create the SFTTrainer with PEFT config
3. Train and save the adapter weights


In [None]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-Lora"
finetune_tags = ["smol-course-lora", "module_1_lora"]

The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. LoRA. We only need to create our `LoraConfig` and provide it to the trainer.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Define LoRA parameters for finetuning</h2>
    <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Use the general parameters for an abitrary finetune</p>
    <p>🐕 Adjust the parameters and review in weights & biases.</p>
    <p>🦁 Adjust the parameters and show change in inference results.</p>
</div>

In [None]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [None]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    max_seq_length=1512, # Maximum sequence length
    packing=True,  # Enable input packing for efficiency
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
    dataset_kwargs={
        "add_special_tokens": False,  # Special tokens handled by template
        "append_concat_token": False,  # No additional separator needed
    },
)

We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [None]:
# max_seq_length = 1512  # max sequence length for model and packing of the dataset

# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer,
    # max_seq_length=max_seq_length,  # Maximum sequence length

)

Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

The training with Flash Attention for 3 epochs with a dataset of 15k samples took 4:14:36 on a `g5.2xlarge`. The instance costs `1.21$/h` which brings us to a total cost of only ~`5.3$`.



### Merge LoRA Adapter into the Original Model

When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:

1. **Simplified Deployment**: Single model file instead of base model + adapters
2. **Inference Speed**: No adapter computation overhead
3. **Framework Compatibility**: Better compatibility with serving frameworks


In [None]:
from peft import AutoPeftModelForCausalLM


# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    args.output_dir, safe_serialization=True, max_shard_size="2GB"
)

## 3. Test Model and run Inference

After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.



<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Bonus Exercise: Load LoRA Adapter</h2>
    <p>Use what you learnt from the ecample note book to load your trained LoRA adapter for inference.</p>
</div>

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    finetune_name, device_map="auto", torch_dtype=torch.float16
)
pipe = pipeline(
    "text-generation", model=merged_model, tokenizer=tokenizer, device=device
)

Lets test some prompt samples and see how the model performs.

In [None]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

# 🐕 Try out the bigcode/the-stack-smol dataset and finetune a code generation model on a specific subset data/python.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

# Set device
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
# model.config.attn_implementation = "flash_attention_2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set up chat format (important for TRL-style fine-tuning)
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)
# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-BigCode-Python-Lora"
finetune_tags = ["smol-course-bigcode-lora", "module_2-lora"]

In [None]:


# Load the dataset (Python subset only)
ds = load_dataset("bigcode/the-stack-smol", data_dir="data/python")

ds = ds["train"]

def convert_to_chat(example):
    parts = example["repository_name"].split("/")
    repo_name = parts[1] if len(parts) == 2 else parts[0]

    return {
        "messages": [
            {
                "role": "user",
                "content": f"Generate code for `{repo_name}/{example['path']}`."
            },
            {"role": "assistant", "content": example["content"]}
        ]
    }

ds = ds.map(convert_to_chat)

# 🛠️ Preprocess dataset: use only the `content` column
# now it's messages
def preprocess(example):
    return {"messages": example["messages"]}

ds = ds.map(preprocess, remove_columns=ds.column_names)

# Shuffle and split into train/test manually
split = ds.train_test_split(test_size=0.2, seed=42)

ds = {
    "train": split["train"],
    "test": split["test"]
}
ds


In [None]:
ds["train"] = ds["train"].select(range(2500))  # for testing setup
ds["test"] = ds["test"].select(range(300))  # for testing setup

In [None]:
ds

In [None]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

In [None]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=3,  # Number of training epochs
    max_seq_length=1512, # Maximum sequence length
    # max_seq_length=8192,  # Important!
    # packing=True,         # Only if flash_attention_2 is enabled
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    # save_strategy="epoch",  # Save checkpoint every epoch
    save_steps=500,
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    # push_to_hub=False,  # Don't push to HuggingFace Hub
    # report_to="none",  # Disable external logging
    report_to="wandb",
    run_name="qlora_lora6_alpha8_dropout5_epoch3",
    # eval_strategy="steps",# Evaluate the model at regular intervals
    eval_strategy="epoch",
    # eval_steps=500,# Frequency of evaluation
    # eval_steps=50,
    # dataset_kwargs={
    #     "add_special_tokens": False,  # Special tokens handled by template
    #     "append_concat_token": False,  # No additional separator needed
    # },
    hub_model_id=finetune_name,
)

In [None]:

import wandb
wandb.init(project="bigcode-finetune-lora", name="qlora-test-lora6", config=args.__dict__)

wandb.log({
    "lora_r": rank_dimension,
    "lora_alpha": lora_alpha,
    "lora_dropout": lora_dropout,
    "learning_rate": args.learning_rate,
    "batch_size": args.per_device_train_batch_size * args.gradient_accumulation_steps,
    "num_train_epochs": args.num_train_epochs
})


In [None]:
# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer,

)

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")

In [None]:
trainer.push_to_hub(tags=finetune_tags)

In [None]:
from peft import AutoPeftModelForCausalLM


# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    args.output_dir, safe_serialization=True, max_shard_size="2GB"
)

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(
    finetune_name, device_map="auto", torch_dtype=torch.float16
)
pipe = pipeline(
    "text-generation", model=merged_model, tokenizer=tokenizer, device=device
)

In [None]:
prompts = [
    # "Write a Python function to calculate the factorial of a number.",
    "Write a Python script to train a Decision Tree Classifier using scikit-learn."
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

In [None]:
# prompt: how to push merged model to hub

merged_model.push_to_hub("SmolLM2-FT-BigCode-Python-Lora-merged", tags=finetune_tags)
tokenizer.push_to_hub("SmolLM2-FT-BigCode-Python-Lora", tags=finetune_tags)