# Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/trl/blob/main/examples/notebooks/sft_qwen_vl.ipynb)

![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)

## Install dependencies

We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install **trackio** to log and monitor our experiments, and **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training.

In [None]:
import sys
!{sys.executable} -m pip install -U "trl[peft]" trackio #bitsandbytes
!{sys.executable} -m pip install -U bitsandbytes

### Log in to Hugging Face


In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Load dataset


We'll load the [**Mozilla/flickr30k-transformed-captions-gpt4o**](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o) dataset from the Hugging Face Hub using the `datasets` library.

This dataset is a set of GPT-generated multimodal instruction-following data. We use a processed version for conveniency here. You can check out more details about how to configure your own multimodal dataset for traininig with SFT in the [docs](https://huggingface.co/docs/trl/en/sft_trainer#training-vision-language-models). Fine-tuning Qwen3-VL on it helps refine its response style and visual understanding.








In [None]:
from datasets import load_dataset

dataset_name = "Mozilla/flickr30k-transformed-captions-gpt4o"
dataset = load_dataset(dataset_name, split="test[:100%]")


Let's review one example to understand the internal structure:

In [None]:
train_dataset = dataset.filter(lambda ex: ex["split"] == "train")

In [None]:
import json
import base64
import io

from datasets import Dataset

def build_qwen_vl_dataset(dataset, prompt="Describe this image."):
    examples = []

    for ex in dataset:
        examples.append({
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "image"},
                        {"type": "text", "text": prompt}
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": ex["alt_text"]}
                    ]
                }
            ],
            "images": [ex["image"]]   # PIL изображение
        })

    return Dataset.from_list(examples)
    
train_dataset = build_qwen_vl_dataset(train_dataset)

In [None]:
train_dataset

Dataset({
    features: ['messages', 'images'],
    num_rows: 29000
})

## Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration.

In [None]:
from transformers import Qwen3VLForConditionalGeneration, BitsAndBytesConfig
import torch

model_name = "Qwen/Qwen3-VL-2B-Instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    dtype="auto",
    device_map="cuda"
)

In [7]:
model

Qwen3VLForConditionalGeneration(
  (model): Qwen3VLModel(
    (visual): Qwen3VLVisionModel(
      (patch_embed): Qwen3VLVisionPatchEmbed(
        (proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
      )
      (pos_embed): Embedding(2304, 1024)
      (rotary_pos_emb): Qwen3VLVisionRotaryEmbedding()
      (blocks): ModuleList(
        (0-23): 24 x Qwen3VLVisionBlock(
          (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (attn): Qwen3VLVisionAttention(
            (qkv): Linear(in_features=1024, out_features=3072, bias=True)
            (proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (mlp): Qwen3VLVisionMLP(
            (linear_fc1): Linear(in_features=1024, out_features=4096, bias=True)
            (linear_fc2): Linear(in_features=4096, out_features=1024, bias=True)
            (act_fn): GELUTanh()
          )
        )
      )
 

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter** — a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
)

## Train model

We'll configure **SFT** using `SFTConfig`, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the [TRL SFTConfig documentation](https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig).

In [None]:
from trl import SFTConfig

output_dir = "Qwen3-VL-2B-Instruct-trl-sft"

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    # Training schedule / optimization
    num_train_epochs=1,
    # max_steps=10,                                       # Number of dataset passes. For full trainings, use `num_train_epochs` instead
    per_device_train_batch_size=32,                       # Batch size per GPU/CPU
    gradient_accumulation_steps=1,                        # Gradients are accumulated over multiple steps → effective batch size = 4 * 8 = 32
    warmup_steps=5,                                       # Gradually increase LR during first N steps
    learning_rate=2e-4,                                   # Learning rate for the optimizer
    optim="adamw_torch",                                  # Optimizer
    max_length=None,                                      # For VLMs, truncating may remove image tokens, leading to errors during training. max_length=None avoids it

    # Logging / reporting
    output_dir=output_dir,                                # Where to save model checkpoints and logs
    logging_steps=20,                                     # Log training metrics every N steps
    report_to="trackio",                                  # Experiment tracking tool

    # Hub integration
    push_to_hub=True,
)

Configure the SFT Trainer. We pass the previously configured `training_args`. We don't use eval dataset to mantain memory usage low but you can configure it.

In [10]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training

In [11]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 5090. Max memory = 31.357 GB.
4.096 GB of memory reserved.


And train!

In [12]:
trainer_stats = trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 151645, 'bos_token_id': None, 'pad_token_id': 151643}.


* Trackio project initialized: huggingface
* Trackio metrics will be synced to Hugging Face Dataset: leinms/trackio-dataset
* Found existing space: https://huggingface.co/spaces/leinms/trackio
* View dashboard by going to: https://leinms-trackio.hf.space/


* Created new run: leinms-1764707047




Step,Training Loss
20,13.3033
40,6.9571
60,6.8328
80,6.8115
100,6.8229
120,6.8129
140,6.8064
160,6.8081
180,6.804
200,6.791




* Run finished. Uploading logs to Trackio (please wait...)


In [13]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Tue Dec  2 20:45:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05              Driver Version: 580.76.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:81:00.0 Off |                  N/A |
| 30%   47C    P8             26W /  575W |   28240MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

Show memory stats after training

In [14]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1230.6506 seconds used for training.
20.51 minutes used for training.
Peak reserved memory = 28.24 GB.
Peak reserved memory for training = 24.144 GB.
Peak reserved memory % of max memory = 90.06 %.
Peak reserved memory for training % of max memory = 76.997 %.


## Saving fine tuned model

In this step, we save the fine-tuned model both **locally** and to the **Hugging Face Hub** using the credentials from your account.

In [15]:
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...-Instruct-trl-sft/training_args.bin: 100%|##########| 6.22kB / 6.22kB            

  ...-2B-Instruct-trl-sft/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...t-trl-sft/adapter_model.safetensors:  42%|####2     | 58.7MB /  140MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...-Instruct-trl-sft/training_args.bin: 100%|##########| 6.22kB / 6.22kB            

  ...-2B-Instruct-trl-sft/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...t-trl-sft/adapter_model.safetensors:  24%|##4       | 33.5MB /  140MB            

CommitInfo(commit_url='https://huggingface.co/leinms/Qwen3-VL-2B-Instruct-trl-sft/commit/cd93df554c0f8744bcc887d0a1964fb9edbc91f2', commit_message='End of training', commit_description='', oid='cd93df554c0f8744bcc887d0a1964fb9edbc91f2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/leinms/Qwen3-VL-2B-Instruct-trl-sft', endpoint='https://huggingface.co', repo_type='model', repo_id='leinms/Qwen3-VL-2B-Instruct-trl-sft'), pr_revision=None, pr_num=None)