# LoRA (Low-Rank Adaptation)

Fine-tuning large language models is a resource intensive process. LoRA is a technique that allows us to **fine-tune large language models with a small number of parameters**. It works by adding and optimizing smaller matrices to the attention weights, typically reducing trainable parameters by about 90%.

## Understanding LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into the model’s layers.

LoRA works by adding pairs of rank decomposition matrices to transformer layers, typically focusing on attention weights. During inference, these adapter weights can be merged with the base model, resulting in no additional latency overhead. LoRA is particularly useful for adapting large language models to specific tasks or domains while keeping resource requirements manageable.


## Key advantages of LoRA

1. **Memory Efficiency**:
- Only adapter parameters are stored in GPU memory
- Base model weights remain frozen and can be loaded in lower precision
- Enables fine-tuning of large models on consumer GPUs

2. **Training Features**:
- Native PEFT/LoRA integration with minimal setup
- Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. **Adapter Management**:
- Adapter weight saving during checkpoints
- Features to merge adapters back into base model


## Loading LoRA Adapters with PEFT
[PEFT](https://github.com/huggingface/peft) is a library that provides a unified interface for loading and managing PEFT methods, including LoRA. It allows you to easily load and switch between different PEFT methods, making it easier to experiment with different fine-tuning techniques.

Adapters can be loaded onto a pretrained model with `load_adapter()`, which is useful for trying out different adapters whose weights aren’t merged. Set the active adapter weights with the `set_adapter()` function. To return the base model, you could use `unload()` to unload all of the LoRA modules. This makes it easy to switch between different task-specific weights.

![LoRA Adapters](misc/lora_adapter.png "LoRA Adapters")

## LoRA Configuration
Let’s walk through the LoRA configuration and key parameters.

|    Parameter   |                                                                                    Description                                                                                    |
|:--------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| r (rank)       | Dimension of the low-rank matrices used for weight updates. Typically between 4-32. Lower values provide more compression but potentially less expressiveness.                    |
| lora_alpha     | Scaling factor for LoRA layers, usually set to 2x the rank value. Higher values result in stronger adaptation effects.                                                            |
| lora_dropout   | Dropout probability for LoRA layers, typically 0.05-0.1. Higher values help prevent overfitting during training.                                                                  |
| bias           | Controls training of bias terms. Options are “none”, “all”, or “lora_only”. “none” is most common for memory efficiency.                                                          |
| target_modules | Specifies which model modules to apply LoRA to. Can be “all-linear” or specific modules like “q_proj,v_proj”. More modules enable greater adaptability but increase memory usage. |


> When implementing PEFT methods, start with small rank values (4-8) for LoRA and monitor training loss. Use validation sets to prevent overfitting and compare results with full fine-tuning baselines when possible. The effectiveness of different methods can vary by task, so experimentation is key.

## Fine-tune LLM using trl and the SFTTrainer with LoRA

### Load a sample dataset

In [1]:
from datasets import load_dataset

dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")
dataset

DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 119
    })
})

### Load the model

In [7]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

In [13]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

In [4]:
# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved
finetune_name = "SmolLM2-LoRA_FT-MyDataset"

### Generate with base model

In [5]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "Which is the capital of paris?"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=100)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

After training:
<|im_start|>user
Which is the capital of paris?<|im_end|>
<|im_start|>assistant
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which is the capital of paris?paris
Which


### Define LoRA parameters for finetuning

In [6]:
from peft import LoraConfig

In [7]:
# r: rank dimension for LoRA update matrices (smaller = more compression)
rank_dimension = 6
# lora_alpha: scaling factor for LoRA layers (higher = stronger adaptation)
lora_alpha = 8
# lora_dropout: dropout probability for LoRA layers (helps prevent overfitting)
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

### Finetune the model

In [8]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=1,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=2,  # Batch size per GPU
    gradient_accumulation_steps=2,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
    max_seq_length=1512,  # max sequence length for model and packing of the dataset
    packing=True,  # Enable input packing for efficiency
    dataset_kwargs={
        "add_special_tokens": False,  # Special tokens handled by template
        "append_concat_token": False,  # No additional separator needed
    },
)

In [None]:
# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    peft_config=peft_config,  # LoRA configuration
    tokenizer=tokenizer,
)

In [10]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,1.7152
20,1.608
30,1.5464
40,1.4931
50,1.4124
60,1.3814
70,1.3537


TrainOutput(global_step=73, training_loss=1.4940339669789353, metrics={'train_runtime': 42.8341, 'train_samples_per_second': 6.817, 'train_steps_per_second': 1.704, 'total_flos': 286187668107264.0, 'train_loss': 1.4940339669789353})

In [12]:
# Save the model
trainer.save_model(f"checkpoints/{finetune_name}")

### Generate with fine-tuned model

In [13]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "Which is the capital of paris?"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=100)
print("After training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=False))



After training:
<|im_start|>user
Which is the capital of paris?<|im_end|>
<|im_start|>assistant
Which is the capital of paris?
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona
Barcelona


## Merge LoRA Adapter into the Original Model

When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:

1. **Simplified Deployment**: Single model file instead of base model + adapters
2. **Inference Speed**: No adapter computation overhead
3. **Framework Compatibility**: Better compatibility with serving frameworks


In [3]:
from peft import AutoPeftModelForCausalLM

In [11]:
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=f"checkpoints/{finetune_name}"
)

In [8]:
# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=f"checkpoints/{finetune_name}",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

In [9]:
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    f"checkpoints/merged/{finetune_name}", safe_serialization=True, max_shard_size="2GB"
)

### Test Model and run Inference

After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.



In [1]:
from transformers import pipeline

In [14]:
# Load Model with PEFT adapter
pipe = pipeline(
    "text-generation", model=merged_model, tokenizer=tokenizer, device=device
)

Device set to use cuda


In [18]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
    "What is the capital of France?"
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?
    response:
The capital of Germany is Berlin. It is located in the state of Brandenburg. It is the
--------------------------------------------------
    prompt:
Write a Python function to calculate the factorial of a number.
    response:
Write a Python function to calculate the factorial of a number.
Brian
Write a Python
--------------------------------------------------
    prompt:
A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?
    response:
The length of the fence is 25 feet and the width is 15 feet. How
--------------------------------------------------
    prompt:
What is the difference between a fruit and a vegetable? Give examples of each.
    response:
What is the difference between a fruit and a vegetable? Give examples of each.
Breadfruit
-------------

## Resources

- [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685)
- [PEFT Documentation](https://huggingface.co/docs/peft)
- [Hugging Face blog post on PEFT](https://huggingface.co/blog/peft)