Refs:


*   Qlora example notebook: https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing#scrollTo=E0Nl5mWL0k2T
* lora smolVLM example: https://huggingface.co/learn/cookbook/en/fine_tuning_smol_vlm_sft_trl
*   Qlora HF blog: https://huggingface.co/blog/4bit-transformers-bitsandbytes
* VisualWebBench paper: https://arxiv.org/pdf/2404.05955
* Moondream VLM HF: https://huggingface.co/vikhyatk/moondream2



In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U git+https://github.com/huggingface/trl.git
!pip install -q datasets
!pip install flash-attn --no-build-isolation
!pip install num2words

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [20]:
from datasets import load_dataset
import torch
from transformers import Idefics3ForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from trl import SFTConfig, SFTTrainer
from PIL import Image
import num2words
from peft import PeftModel

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

#google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#free up gpu ram for reloading models
import gc
import time

def clear_memory():
    # Delete variables if they exist in the current global scope
    if 'inputs' in globals(): del globals()['inputs']
    if 'model' in globals(): del globals()['model']
    if 'processor' in globals(): del globals()['processor']
    if 'trainer' in globals(): del globals()['trainer']
    if 'peft_model' in globals(): del globals()['peft_model']
    if 'bnb_config' in globals(): del globals()['bnb_config']
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


In [4]:
vbench_webqa_ds = load_dataset("visualwebbench/VisualWebBench", "webqa")['test']
#train val split by indices
splits = vbench_webqa_ds.train_test_split(test_size=0.2)
vbench_webqa_train = splits["train"]
vbench_webqa_val = splits["test"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
vbench_webqa_train

Dataset({
    features: ['id', 'task_type', 'website', 'image', 'image_size', 'question', 'answer'],
    num_rows: 251
})

In [6]:
def format_data(sample):
    return {
        "images": [sample["image"]],    # actual image data (PIL or array)
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "image"},  # image placeholder, NOT the pixels
                    {"type": "text", "text": sample["question"]}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"type": "text", "text": sample["answer"]}
                ]
            }
        ]
    }

In [7]:
train_dataset = [format_data(sample) for sample in vbench_webqa_train]
val_dataset = [format_data(sample) for sample in vbench_webqa_val]

In [8]:
model_id = "HuggingFaceTB/SmolVLM-Instruct"
model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
)

processor = AutoProcessor.from_pretrained(model_id)

You are using a model of type `smolvlm` to instantiate a model of type `idefics3`. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.


Loading weights:   0%|          | 0/489 [00:00<?, ?it/s]



In [9]:
#required pre-processing before training, need to look into this
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [10]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Lora Setup

In [11]:
peft_config = LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.1,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
    use_dora=True,
    init_lora_weights="gaussian"
)

In [12]:
# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

trainable params: 5,088,256 || all params: 512,570,560 || trainable%: 0.9927


In [13]:
#clear_memory()

In [14]:
training_args = SFTConfig(
    output_dir="/content/drive/MyDrive/!personalMLProject/screen_qa/smolVlm2500m_webbench_qlora",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=25,
    save_total_limit=1,
    optim="adamw_torch_fused",
    bf16=True,
    push_to_hub=False,
    report_to="none",
    max_length=None
)

In [21]:
trainer_kwargs = {
    "model": peft_model,
    "args": training_args,
    "train_dataset": train_dataset,
    "eval_dataset": val_dataset,
    "processing_class": processor,
}

# This guard checks if the model is NOT already a PeftModel.
# If it's a base model, then peft_config should be passed.
# In your current setup, 'peft_model' IS a PeftModel, so this condition
# 'not isinstance(peft_model, PeftModel)' would evaluate to False.

if not isinstance(peft_model, PeftModel):
    trainer_kwargs["peft_config"] = peft_config

trainer = SFTTrainer(**trainer_kwargs)

In [18]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 49279, 'bos_token_id': 1, 'pad_token_id': 2}.
Casting fp32 inputs back to torch.bfloat16 for flash-attn compatibility.


Step,Training Loss


You are using a model of type `smolvlm` to instantiate a model of type `idefics3`. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.


TrainOutput(global_step=16, training_loss=18.49785614013672, metrics={'train_runtime': 540.9813, 'train_samples_per_second': 0.464, 'train_steps_per_second': 0.03, 'total_flos': 818289409104000.0, 'train_loss': 18.49785614013672})

In [19]:
trainer.save_model(training_args.output_dir)

You are using a model of type `smolvlm` to instantiate a model of type `idefics3`. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a `sam2_video` checkpoint into `Sam2Model`), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
