<a href="https://colab.research.google.com/github/Krishna77799/ISL-linear-regression/blob/master/SFT_SmolLM2_135M_LORA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Fine-Tune LLMs with LoRA Adapters using Hugging Face TRL

This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Adds small trainable rank decomposition matrices to attention layers
- Typically reduces trainable parameters by ~90%
- Maintains model performance while being memory efficient

We'll cover:
1. Setup development environment and LoRA configuration
2. Create and prepare the dataset for adapter training
3. Fine-tune using `trl` and `SFTTrainer` with LoRA adapters
4. Test the model and merge adapters (optional)


## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pytorch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs.


In [None]:
# Install the requirements in Google Colab
# !pip install transformers datasets trl huggingface_hub

# Authenticate to Hugging Face

from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 2. Load the dataset

In [None]:
# Load a sample dataset
from datasets import load_dataset

# TODO: define your dataset and config using the path and name parameters
dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/everyday-conversations/train-00000-(…):   0%|          | 0.00/946k [00:00<?, ?B/s]

data/everyday-conversations/test-00000-o(…):   0%|          | 0.00/52.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/119 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 119
    })
})

In [None]:
dataset["train"][:5]

{'full_topic': ['Travel/Vacation destinations/Beach resorts',
  'Work/Career development/Mentorship',
  'Shopping/Window shopping/Window shopping etiquette',
  'Cooking/Cooking for others/Food gifting',
  'Weather/Climate change/Climate change impacts'],
 'messages': [[{'content': 'Hi there', 'role': 'user'},
   {'content': 'Hello! How can I help you today?', 'role': 'assistant'},
   {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?",
    'role': 'user'},
   {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.",
    'role': 'assistant'},
   {'content': 'That sounds great. Are there any resorts in the Caribbean that are good for families?',
    'role': 'user'},
   {'content': 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and 

In [None]:
#!pip install transformers datasets trl huggingface_hub

## 3. Fine-tune LLM using `trl` and the `SFTTrainer` with LoRA

The [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` provides integration with LoRA adapters through the [PEFT](https://huggingface.co/docs/peft/en/index) library. Key advantages of this setup include:

1. **Memory Efficiency**:
   - Only adapter parameters are stored in GPU memory
   - Base model weights remain frozen and can be loaded in lower precision
   - Enables fine-tuning of large models on consumer GPUs

2. **Training Features**:
   - Native PEFT/LoRA integration with minimal setup
   - Support for QLoRA (Quantized LoRA) for even better memory efficiency

3. **Adapter Management**:
   - Adapter weight saving during checkpoints
   - Features to merge adapters back into base model

We'll use LoRA in our example, which combines LoRA with 4-bit quantization to further reduce memory usage without sacrificing performance. The setup requires just a few configuration steps:
1. Define the LoRA configuration (rank, alpha, dropout)
2. Create the SFTTrainer with PEFT config
3. Train and save the adapter weights


In [None]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)
tokenizer.chat_template = "{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}"

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"
finetune_tags = ["smol-course", "module_1"]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

The `SFTTrainer`  supports a native integration with `peft`, which makes it super easy to efficiently tune LLMs using, e.g. LoRA. We only need to create our `LoraConfig` and provide it to the trainer.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Define LoRA parameters for finetuning</h2>
    <p>Take a dataset from the Hugging Face hub and finetune a model on it. </p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Use the general parameters for an abitrary finetune</p>
    <p>🐕 Adjust the parameters and review in weights & biases.</p>
    <p>🦁 Adjust the parameters and show change in inference results.</p>
</div>

Before we can start our training we need to define the hyperparameters (`TrainingArguments`) we want to use.

In [None]:
def format_chat_template(row):
    row["text"] = tokenizer.apply_chat_template(row["messages"], tokenize=False)
    return row

dataset = dataset.map(format_chat_template)

Map:   0%|          | 0/2260 [00:00<?, ? examples/s]

Map:   0%|          | 0/119 [00:00<?, ? examples/s]

In [None]:
!pip show trl transformers datasets

Name: trl
Version: 0.19.1
Summary: Train transformer language models with reinforcement learning.
Home-page: https://github.com/huggingface/trl
Author: Leandro von Werra
Author-email: leandro.vonwerra@gmail.com
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: accelerate, datasets, transformers
Required-by: 
---
Name: transformers
Version: 4.53.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.11/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft, sentence-transformers, trl
---
Name: datasets
Version: 4.0.0
Summary: HuggingFace community-driven o

In [None]:
import torch
print(torch.cuda.is_available())

True


We now have every building block we need to create our `SFTTrainer` to start then training our model.

In [None]:
from trl import setup_chat_format, SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoTokenizer

# … (load model + tokenizer + chat_template + setup_chat_format) …

# 1. Training config with fp16
training_args = SFTConfig(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=3e-4,
    num_train_epochs=3,
    logging_steps=10,
    output_dir="./lora_sft",
    bf16=False,
    fp16=True,
)

# 2. LoRA config
peft_config = LoraConfig(
    r=16, lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    lora_dropout=0.05,
)

# 3. Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()


Tokenizing train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkrishnaiitkgpteja[0m ([33mkrishnaiitkgpteja-individual[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,2.5023
20,2.1796
30,1.8797
40,1.6968
50,1.5828
60,1.4888
70,1.4313
80,1.3949
90,1.3705
100,1.3655




TrainOutput(global_step=213, training_loss=1.4925832143971618, metrics={'train_runtime': 485.051, 'train_samples_per_second': 13.978, 'train_steps_per_second': 0.439, 'total_flos': 1012665122426880.0, 'train_loss': 1.4925832143971618})

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
#trainer.train()

# save model
trainer.save_model()

### Merge LoRA Adapter into the Original Model

When using LoRA, we only train adapter weights while keeping the base model frozen. During training, we save only these lightweight adapter weights (~2-10MB) rather than a full model copy. However, for deployment, you might want to merge the adapters back into the base model for:

1. **Simplified Deployment**: Single model file instead of base model + adapters
2. **Inference Speed**: No adapter computation overhead
3. **Framework Compatibility**: Better compatibility with serving frameworks


In [None]:
print(trainer.args.output_dir)

./lora_sft


In [None]:
local_dir = "lora_sft"

In [None]:
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel

device = "cuda" if torch.cuda.is_available() else "cpu"
local_dir = "lora_sft"   # ← set this to the folder that ls reported

# Load base model
base = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-135M",
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
)

# Load adapters locally
peft_model = PeftModel.from_pretrained(
    base,
    local_dir,
    torch_dtype=torch.float16,
    device_map="auto",
    local_files_only=True,
).to(device)

# Merge & unload adapters
merged = peft_model.merge_and_unload()

# Save back to the same folder
merged.save_pretrained(
    local_dir,
    safe_serialization=True,
    max_shard_size="2GB",
)


Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

## 3. Test Model and run Inference

After the training is done we want to test our model. We will load different samples from the original dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.



<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Bonus Exercise: Load LoRA Adapter</h2>
    <p>Use what you learnt from the ecample note book to load your trained LoRA adapter for inference.</p>
</div>

In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [None]:
import torch
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM

# 1. Point to your local folder where you saved the merged checkpoint
local_dir = "lora_sft"  # adjust if needed

# 2. Load tokenizer (offline)
tokenizer = AutoTokenizer.from_pretrained(
    local_dir,
    local_files_only=True
)

# 3. Load the merged model (this was saved by `merged.save_pretrained(...)`)
model = AutoModelForCausalLM.from_pretrained(
    local_dir,
    local_files_only=True,
    torch_dtype=torch.float16,    # match your training dtype
    device_map="auto"             # let Accelerate put it on GPU/CPU
)

# 4. Create the pipeline WITHOUT a device= argument
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    framework="pt"                # ensure PyTorch backend
    # notice: no `device=` here
)

# 5. Test it
print(pipe("### Instruction:\nSay hello!\n\n### Response:", max_new_tokens=20))


Device set to use cuda:0


[{'generated_text': '### Instruction:\nSay hello!\n\n### Response: Hello!\nHello!\n\n### Instruction:\nSay hello!\n\n### Response: Hello'}]


In [None]:
out = pipe("### Instruction:\nSay hello!\n\n### Response:", max_new_tokens=20)[0]["generated_text"]

# extract only the assistant’s response
reply = out.split("### Response:")[1].split("### Instruction")[0].strip()
print(reply)


Hello, I'm playing an online game!

## Instruction:
Say goodbye.


Lets test some prompt samples and see how the model performs.

In [None]:
prompts = [
    "What is the capital of Germany? Explain why thats the case and if it was different in the past?",
    "Write a Python function to calculate the factorial of a number.",
    "A rectangular garden has a length of 25 feet and a width of 15 feet. If you want to build a fence around the entire garden, how many feet of fencing will you need?",
    "What is the difference between a fruit and a vegetable? Give examples of each.",
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
What is the capital of Germany? Explain why thats the case and if it was different in the past?
    response:
What is the capital of Germany? Why is it different in the past?
What is the capital of Germany? Why is it different in the past?firstsum
What is the capital of Germany? Why is it different in the past?PlaneProtection
What is the capital of Germany? Why is it different in the past? 
What is the capital of Germany? Why is it different in the past? transistorsum
What is the capital of Germany? Why is it different in the past?
What is the capital of Germany? Why is it different in the past?swigfaiss
What is the capital of Germany? Why is it different in the past?swigfaiss
What is the capital of Germany? Why is it different in the past?
What is the capital of Germany? Why is it different in the past?ManagementPlaneProtection
What is the capital of Germany? Why is it different in the past?
What is the capital of Germany? Why is it different in the past?
What is the capit