## Fine-tuning Llama 3.2 Vision using Trainer on ROCm

🚨 **WARNING**: This notebook is derived from [huggingface-llama-recipes](https://github.com/huggingface/huggingface-llama-recipes/blob/main/fine_tune/Llama-Vision%20FT.ipynb) .

In this recipe, we’ll demonstrate how to fine-tune a [Vision Language Model (VLM)](https://huggingface.co/blog/vlms) using the Hugging Face ecosystem.


Transformers Trainer API makes it easy to fine-tune Llama-Vision models. One can also use parameter-efficient fine-tuning techniques out of the box thanks to transformers integration. Make sure to have latest version of transformers.


We will fine-tune the model on a small split of VQAv2 dataset for educational purposes. If you want, you can also use a dataset where there’s multiple turns of conversation at one example. This dataset consists of images, questions about the images and short answers.


# 1. Install Dependencies

Let’s start by installing the essential libraries we’ll need for fine-tuning! 🚀

We recommended to use official ROCm prebuilt Docker images with the framework pre-installed. Refer to [Rocm doc](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-docker-with-pytorch-pre-installed)

In the Docker container, check the availability of ROCm-capable accelerators using the following command.

In [1]:
import torch
print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())

Is a ROCm-GPU detected?  True
How many ROCm-GPUs are detected?  8


In [None]:
# Install `bitsandbytes` from source code for ROCm 6.0+.
# Use -DBNB_ROCM_ARCH to target a specific GPU architecture.
!git clone --recurse https://github.com/ROCm/bitsandbytes.git
!cd bitsandbytes
!git checkout rocm_enabled_multi_backend
!pip install -r requirements-dev.txt
!cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
!python setup.py install

# Install `bitsandbytes` from binary
# Note, if you don't want to reinstall BNBs dependencies, append the `--no-deps` flag!
!pip install --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'

# To leverage the SFTTrainer in TRL for model fine-tuning.
!pip install trl

# To leverage PEFT for efficiently adapting pre-trained language models .
!pip install peft

# Install the other dependencies.
!pip install transformers datasets huggingface-hub scipy ipywidgets wandb accelerate

# Tested with transformers==4.47.0, trl==0.12.0, datasets==3.1.0, bitsandbytes==0.44.1.dev0+9315692, peft==0.13.2, qwen-vl-utils==0.0.8, wandb==0.19.1, accelerate==1.1.1, ipywidgets==8.1.5

We must authenticate outselves before downloading the model. 

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# 2. Fine-Tune the Model using PEFT

In [1]:
from datasets import load_dataset

ds = load_dataset("merve/vqav2-small", split="validation[:10%]")

In [2]:
ds

Dataset({
    features: ['multiple_choice_answer', 'question', 'image'],
    num_rows: 2144
})

We can now initialize the model and the processor, for we will use the processor in our preprocessing function. We will initialize the 11B variant of the vision model. 

Llama authors encourage freezing text decoder and only training image encoder. If you would like to try this out, feel free to set `FREEZE_LLM` to `True`. Intuitively, if your task is too domain specific, you might want to avoid this. In that case, you can either try LoRA training (which you can set `USE_LORA` to `True`), or freezing image encoder (set `FREEZE_IMAGE` to `True`) to save up compute.


In [3]:
from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

ckpt = "meta-llama/Llama-3.2-11B-Vision"
USE_LORA = True
FREEZE_LLM = False
FREEZE_IMAGE = False

if USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=True, # optional DoRA 
        init_lora_weights="gaussian"
    )

    model = MllamaForConditionalGeneration.from_pretrained(
            ckpt,
            torch_dtype=torch.bfloat16,
            device_map="auto"
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

elif FREEZE_IMAGE:
    if FREEZE_LLM:
        raise ValueError("You cannot freeze image encoder and text decoder at the same time.")
    model = MllamaForConditionalGeneration.from_pretrained(ckpt,
        torch_dtype=torch.bfloat16, device_map="auto")
    # freeze vision model to save up on compute
    for param in model.vision_model.parameters():
        param.requires_grad = False

elif FREEZE_LLM:
    if FREEZE_IMAGE:
        raise ValueError("You cannot freeze image encoder and text decoder at the same time.")
    model = MllamaForConditionalGeneration.from_pretrained(ckpt,
        torch_dtype=torch.bfloat16, device_map="auto")
    # freeze text model, this is encouraged in paper
    for param in model.language_model.parameters():
        param.requires_grad = False
        
else: # full ft
    model = MllamaForConditionalGeneration.from_pretrained(ckpt,
        torch_dtype=torch.bfloat16, device_map="auto")

processor = AutoProcessor.from_pretrained(ckpt)

Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

trainable params: 31,416,320 || all params: 10,674,357,795 || trainable%: 0.2943


For preprocessing, we will put together questions and answers. In between questions and answers we will put a conditioning phrase, which will condition the model and trigger question answering, in this case it’s “Answer briefly.”. 
To process images, we simply have to batch every image and put them as list of singular images. This is needed due to how processor can take a list of multiple images at once with a single text input, so we have to indicate that these are single images for each example.
Lastly, we will set pad tokens and image tokens to -100 to make model ignore these tokens.


In [4]:
def process(examples):
    texts = [f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>{example['question']} Answer briefly. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{example['multiple_choice_answer']}<|eot_id|>" for example in examples]
    images = [[example["image"].convert("RGB")] for example in examples]

    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100 
    labels[labels == 128256] = -100 # image token index
    batch["labels"] = labels
    batch = batch.to(torch.bfloat16).to("cuda")

    return batch


We can now setup our Trainer. Before that, we will setup the arguments we pass to the 
Trainer.

In [5]:
from transformers import TrainingArguments
args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=2e-5,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=10,
            save_strategy="no",
            optim="adamw_hf",
            push_to_hub=False,
            save_total_limit=1,
            bf16=True,
            output_dir="./lora",
            dataloader_pin_memory=False,
        )

We can now initialize the Trainer and start training.


In [6]:
from transformers import Trainer
trainer = Trainer(
        model=model,
        train_dataset=ds,
        data_collator=process,
        args=args
        )

Call train.

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


[34m[1mwandb[0m: Currently logged in as: [33myahao-he[0m ([33myahao-he-Tsinghua University[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,3.6549
20,1.4411
30,1.0619
40,0.8523
