# Fine-tune IDEFICS3 on Visual Question Answering

In this notebook we will fine-tune IDEFICS3 on VQAv2 dataset.

The transformers PR isn't merged yet so we will install the branch that contains the transformers implementation. Please checkout to following branch, then checkout to this commit: e1b7c0a05ab65e4ddb62a407fe12f8ec13a916f0 and pip install it.

In [2]:
!pip install -q git+https://github.com/andimarafioti/transformers.git@idefics3

In [3]:
!pip install -q accelerate datasets peft bitsandbytes


In [4]:
!pip install -q flash-attn --no-build-isolation

We will push out model to Hub so we need to authenticate ourselves.

In [10]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In this notebook we will not do full fine-tuning but use QLoRA method, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True.

In [48]:
import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration

DEVICE = "cuda"
USE_LORA = False
USE_QLORA = True
model_id = "HuggingFaceM4/Idefics3-8B-Llama3"

processor = AutoProcessor.from_pretrained(
    model_id
)


if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules='.*(text_model|connector).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_8bit=True,            
        )
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config if USE_QLORA else None,
        _attn_implementation="flash_attention_2",
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = Idefics3ForConditionalGeneration.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2",
    ).to(DEVICE)



`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [49]:
for param in model.model.vision_model.parameters():
    param.requires_grad = False

We will load VQAv2 dataset. For educational purposes we will load the validation split and split it twice.

In [31]:
from datasets import load_dataset
ds = load_dataset('merve/vqav2-small', trust_remote_code=True)

In [32]:
split_ds = ds["validation"].train_test_split(test_size=0.8)
train_ds = split_ds["train"]

In [33]:
train_ds

Dataset({
    features: ['multiple_choice_answer', 'question', 'image'],
    num_rows: 4287
})

Let's write our data collating function. We will apply prompt template to have questions and answers together so model can learn to answer. Then we pass the formatted prompts and images to the processor which processes both.

In [8]:
image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")]

def collate_fn(examples):
  texts = []
  images = []
  for example in examples:
      image = example["image"]
      question = example["question"]
      answer = example["multiple_choice_answer"]
      messages = [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": "Answer briefly."},
                  {"type": "image"},
                  {"type": "text", "text": question}
              ]
          },
          {
              "role": "assistant",
              "content": [
                  {"type": "text", "text": answer}
              ]
          }
      ]
      text = processor.apply_chat_template(messages, add_generation_prompt=False)
      texts.append(text.strip())
      images.append([image])

  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
  labels = batch["input_ids"].clone()
  labels[labels == processor.tokenizer.pad_token_id] = image_token_id
  batch["labels"] = labels

  return batch


We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`.

In [50]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=1,
    #gradient_accumulation_steps=8,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=1,
    optim="paged_adamw_8bit",
    #evaluation_strategy="epoch",
    bf16=True,
    output_dir="./idefics3-llama-vqav2",
    hub_model_id="idefics3-llama-vqav2",
    remove_unused_columns=False,
)


In [51]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_ds,
    #eval_dataset=test_ds,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


I have trained on a remote machine of A100 80GB VRAM, it takes ~50GB VRAM. I often run an isolated script so logs of trainer doesn't appear here. 

In [None]:
trainer.train()


### Inference with Adapters

Below is the code to infer with our trained model. Since we chose to train an adapter the loading is a bit different. We need to load the base model and adapter on it separately, and load the processor of the base model.

In [None]:
from transformers import Idefics3ForConditionalGeneration, AutoProcessor

peft_model_id = "idefics3-llama-vqav2/checkpoint-535"
base_model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
processor = AutoProcessor.from_pretrained(base_model_id)
model = Idefics3ForConditionalGeneration.from_pretrained(base_model_id)
model.load_adapter(peft_model_id).to("cuda")

We can prepare our inputs. We need to use the prompt template (with "Answer briefly", our conditioning phrase").

In [None]:
from PIL import Image
import requests
from transformers.image_utils import load_image

DEVICE = "cuda"

image = load_image("https://huggingface.co/spaces/merve/OWLSAM2/resolve/main/buddha.JPG")


messages = [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": "Answer briefly."},
                  {"type": "image"},
                  {"type": "text", "text": "Which country is this located in?"}
              ]
          }
      ]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt", padding=True).to("cuda")

We can now infer.

In [None]:
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)

##['User: Answer briefly.<row_1_col_1><row_1_col_2><row_1_col_3><row_1_col_4>\n<row_2_col_1>
# <row_2_col_2><row_2_col_3><row_2_col_4>\n<row_3_col_1><row_3_col_2><row_3_col_3>
# <row_3_col_4>\n\n<global-img>Which country is this located in?\nAssistant: thailand\nAssistant: thailand']

Find the fine-tuned model [here](https://huggingface.co/merve/idefics3llama-vqav2) and play around!