<a href="https://colab.research.google.com/github/Himkeshtak/VLM-OpenCV-Course/blob/main/DPO_config_Medium.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Newer Alignment Techniques for Vision Language Models
Press enter or click to view image in full size

While Supervised Fine-Tuning (SFT) is effective for teaching models to follow instructions, it relies on a single, static “correct” answer for each example. A more nuanced alignment technique, Direct Preference Optimisation (DPO), trains the model by learning from human preferences instead. This approach is particularly effective for refining model behaviour in areas like helpfulness, safety, and style, which are not easily captured by a single ground-truth label.

The core principle of DPO is to fine-tune the model on a dataset of comparisons. For a given input (image + prompt), the model is shown two responses: one that a human annotator “chose” as superior and one that was “rejected.” The training objective is to increase the likelihood of generating the chosen response while decreasing the likelihood of generating the rejected one.

This requires a specially formatted preference dataset. A prominent example is RLAIF-V, which contains over 83,000 samples structured for this purpose. Each entry includes the necessary components for the DPOTrainer in the TRL library: a list of images, a prompt, a chosen response, and a rejected response.

Below is an example of a single sample from a DPO dataset, demonstrating its structure:

{'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=L size=980x812 at 0x154505570>],
 'prompt': [ { "content": [ { "text": null, "type": "image" }, { "text": "What should this catcher be using?", "type": "text" } ], "role": "user" } ],
 'rejected': [ { "content": [ { "text": "The catcher, identified by the number...", "type": "text" } ], "role": "assistant" } ],
 'chosen': [ { "content": [ { "text": "The catcher in the image should be using a baseball glove...", "type": "text" } ], "role": "assistant" } ]}
Once the dataset is prepared, you can use the DPOConfig and DPOTrainer classes from the trl library to configure and launch the fine-tuning process.
Below is an example configuration using DPOConfig:

In [2]:
pip install -U trl transformers datasets peft accelerate bitsandbytes

Collecting trl
  Downloading trl-0.27.0-py3-none-any.whl.metadata (11 kB)
Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-23.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Downloading trl-0.27.0-py3-none-any.whl (532 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m532.5/532.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-4.5.0-py3-none-any.whl (515 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bitsandbytes-0.49.1-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-23.0.0-cp312-cp312-manylinu

In [3]:
from trl import DPOConfig
training_args = DPOConfig(
    output_dir="smolvlm-instruct-trl-dpo-rlaif-v",
    bf16 = True,
    gradient_checkpointing = True,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    gradient_accumulation_steps =32,
    num_train_epochs=5,
    dataset_num_proc=8, #tokenization will use 8 processes
    dataloader_num_workers=8,
    logging_steps=10,
    report_to="tensorboard",
    push_to_hub=True,
    save_steps=10,
    save_total_limit=1,
    eval_steps=10, #Steps interval for evaluation
    eval_strategy="steps",
)

ValueError: Your setup doesn't support bf16/gpu.

To train your model using DPOTrainer, you can optionally provide a reference model to compute the reward difference. If you’re using Parameter-Efficient Fine-Tuning (PEFT), you may omit the reference model by setting ref_model=None.

In [None]:
from trl import DPOTrainer

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=peft_config,
    tokenizer=processor
)
trainer.train()