# Fine-Tuning PaliGemma with QVLA

#### Author: nisan yildiz

----

PaliGemma is a pre-trained VLM designed to be a efficient base model for various fine-tuning applications in VL domain. Here, we will be fine-tuning the PaliGemma pre-trained model for image annotation task using quantization and Adapters. Adapters are small layers that are "plugged-in" to the larger model during fine-tuning to be trained while rest of the architecture remains frozen. This allows efficient fine-tuning of base-models without the need to train the entire network.

In [1]:
!git clone https://github.com/adapter-hub/adapters.git
%cd adapters
!pip install .
!pip install -U bitsandbytes
!pip install -U datasets

Cloning into 'adapters'...
remote: Enumerating objects: 126942, done.[K
remote: Counting objects: 100% (584/584), done.[K
remote: Compressing objects: 100% (427/427), done.[K
remote: Total 126942 (delta 383), reused 204 (delta 156), pack-reused 126358 (from 2)[K
Receiving objects: 100% (126942/126942), 99.40 MiB | 26.66 MiB/s, done.
Resolving deltas: 100% (96629/96629), done.
/content/adapters
Processing /content/adapters
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers~=4.50.3 (from adapters==1.2.0.dev0)
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Downloading transformers-4.50.3-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: adapters
  Building wheel for adapters (pyproje

In [2]:
#Connect to drive
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/DI725/DI725-project


Mounted at /content/drive
/content/drive/MyDrive/DI725/DI725-project


In [3]:
import adapters
from adapters import AdapterModelInterface

In [4]:
import torch
from torch import nn

from transformers import BitsAndBytesConfig
from transformers import AutoProcessor, AutoModel, PaliGemmaForConditionalGeneration, AutoConfig

from huggingface_hub import notebook_login

from datasets import load_dataset

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/paligemma-3b-pt-224" # pt for pre-trained, needs fine-tuning

## Load the dataset

In [None]:
#process the dataset into jsonl files from the given captions csv

#!python3 process_dataset.py

In [15]:
dataset = load_dataset("json", data_files={'train': 'RISCM/resized/train_data.jsonl', 'test':'RISCM/resized/test_data.jsonl', 'validation':"RISCM/resized/val_data.jsonl"})

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

## Fine-tuning without quantization

In [6]:
#We need to log-in before using the PaliGemma model, as it is subject to agreement

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
base_model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.26M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

In [27]:
from transformers.modeling_outputs import BaseModelOutputWithPooling
from adapters.heads import PredictionHead
# Create a custom model class by inheriting from the original model class

class IdentityHead(PredictionHead):
    def __init__(self):
        super().__init__(name="identity_head")
        self.config = {
            "layers": 1,
            "activation_function": None,
            "use_pooler": False,
            "dropout_prob": 0.0
        }
        self.identity = nn.Identity()
        # Add the identity module
        self.add_module("0", self.identity)

    def build(self, model):
        # Override build to do nothing since we just want identity functionality
        self.train(model.training)  # make sure training mode is consistent

    def forward(self, hidden_states, **kwargs):
        # Ensure we maintain the correct dimensions
        # hidden_states shape: [batch_size, seq_len, hidden_size] or [batch_size, hidden_size]

        # Check if we need to preserve dimensions
        original_shape = hidden_states.shape

        # Apply identity transformation (maintaining the original shape)
        output = super().forward(hidden_states)

        # Ensure output has the same shape as input
        if output.shape != original_shape:
            output = output.view(original_shape)

        return output

    def get_label_names(self):
        # Override to return the expected label names
        return ["labels"]

# Add our custom identity head
base_model.heads = nn.ModuleDict({"identity_head": IdentityHead()})



### Adding VL-Adapter

PaliGemma model is not officially supported by the adapters library. We need to create a model interface object to be able to use it with adapters.

In [28]:
bottleneck_interface_lm = AdapterModelInterface(
    adapter_methods=["bottleneck"], # the vanilla Adapter a.k.a bottleneck adapter
    model_embeddings="language_model.model.embed_tokens",
    model_layers="language_model.model.layers",
    layer_self_attn="self_attn",
    layer_cross_attn=None,
    attn_k_proj="k_proj",
    attn_q_proj="q_proj",
    attn_v_proj="v_proj",
    attn_o_proj="o_proj",
    layer_intermediate_proj="mlp.up_proj",
    layer_output_proj="mlp.down_proj",
)

In [10]:
adapters.init(base_model, interface=bottleneck_interface_lm)
base_model.add_adapter("adapter_lm", config="double_seq_bn")
base_model.set_active_adapters("adapter_lm")
print(base_model.adapter_summary())

#moving to device
#base_model.to(device)
#base_model.adapter_to("adapter_lm", device=device)



Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
adapter_lm               bottleneck        9,476,352       0.324       1       1
--------------------------------------------------------------------------------
Full model                              2,923,466,480     100.000               1


### Quantization

We will be using 4-bit quantization for our model, with the NF4 datatype. Computations will be done in 16-bit bfloat16 type. We are also double quantizing.  

In [34]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16,
   bnb_4bit_use_double_quant=True)

base_NF4_model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, quantization_config=nf4_config)
base_NF4_model.heads = nn.ModuleDict({"identity_head": IdentityHead()})

#adding the adapter
adapters.init(base_NF4_model, interface=bottleneck_interface_lm)
base_NF4_model.add_adapter("adapter_lm", config="double_seq_bn")
base_NF4_model.set_active_adapters("adapter_lm")

#cast some layers to full precision
for param in base_NF4_model.parameters():
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

# Enable gradient checkpointing to reduce required memory
base_NF4_model.gradient_checkpointing_enable()
base_NF4_model.enable_input_require_grads()

class CastOutputToFloat(torch.nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
base_NF4_model.language_model.lm_head = CastOutputToFloat(base_NF4_model.language_model.lm_head)

#moving to device
base_NF4_model.to(device)
base_NF4_model.adapter_to("adapter_lm", device=device)



`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
adapter_lm               bottleneck       18,952,704       1.098       0       1
--------------------------------------------------------------------------------
Full model                              1,725,847,280     100.000               1


In [37]:
#Verifying the datatypes.
dtypes = {}
for _, p in base_NF4_model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

torch.float16 527750656 0.3024705759052781
torch.float32 19430128 0.011136020276350484
torch.uint8 1197619200 0.6863934038183714


## Preparing and exploring the dataset

RISCM dataset consists of captioned sattelite imagery with 5 captions provided per image. Our captions table includes information about all captions, as well as informations about the training/test/validation splits and the original source of the images.

In [16]:
image_dir = "RISCM/resized/"
from PIL import Image
import PIL
def collate_fn(examples):
      texts = [f"<image> <bos> {example['prefix']}" for example in examples]
      labels= [example['suffix'] for example in examples]
      images = [PIL.Image.open(image_dir + example["file_name"]).convert("RGB") for example in examples]
      tokens = processor(text=texts, images=images, suffix=labels,
      return_tensors="pt", padding="longest")
      tokens = tokens.to(torch.bfloat16).to(device)
      return tokens

In [17]:
input_text = f"<image> <bos> {dataset['test'][0]['prefix']}"
input_image = PIL.Image.open(image_dir + dataset["test"][0]["file_name"])

In [41]:
inputs = processor(text=input_text, images=input_image,
                  padding="longest", do_convert_rgb=True, return_tensors="pt").to(device)

In [42]:
import torch

# Assume you have these from before:
# base_model, processor, inputs

device = "cuda"
base_NF4_model.eval()
base_NF4_model.to(device)

# Prepare inputs
input_ids = inputs["input_ids"].to(device)
pixel_values = inputs.get("pixel_values", None)
if pixel_values is not None:
    pixel_values = pixel_values.to(device)

# Print input shapes
print(f"input_ids.shape: {input_ids.shape}")
if pixel_values is not None:
    print(f"pixel_values.shape: {pixel_values.shape}")

max_new_tokens = 10  # Keep small for debugging
generated = input_ids.clone()

with torch.no_grad():
    past_key_values = None
    for step in range(max_new_tokens):
        # Prepare input for this step
        model_inputs = {"input_ids": generated}
        if pixel_values is not None and step == 0:
            # Only pass pixel_values on the first step, if needed
            model_inputs["pixel_values"] = pixel_values
        if past_key_values is not None:
            model_inputs["past_key_values"] = past_key_values

        # Forward pass
        out = base_NF4_model(**model_inputs)
        print(f"Step {step} logits shape: {out.logits.shape}")

        # Debug past_key_values structure
        if hasattr(out, "past_key_values") and out.past_key_values is not None:
            print(f"past_key_values type: {type(out.past_key_values)}")

        # Get next token (greedy)
        next_token = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        print(f"Step {step} next_token.shape: {next_token.shape}")

        # Append next token to generated sequence
        generated = torch.cat([generated, next_token], dim=1)
        past_key_values = out.past_key_values

    # Decode the output sequence (skip original input)
    generated_tokens = generated[0, input_ids.shape[1]:]
    print("Generated tokens:", generated_tokens)
    decoded = processor.decode(generated_tokens, skip_special_tokens=True)
    print("Decoded output:", decoded)

input_ids.shape: torch.Size([1, 262])
pixel_values.shape: torch.Size([1, 3, 224, 224])


RuntimeError: Input type (torch.cuda.HalfTensor) and bias type (torch.cuda.FloatTensor) should be the same

### Fine-tuning

In [43]:
from transformers import TrainingArguments
args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=5e-4,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=100,
            optim="paged_adamw_8bit", # you can use paged optimizers like paged_adamw_8bit for Q or adamw_hf
            save_strategy="steps",
            save_steps=1000,
            save_total_limit=1,
            output_dir="paligemma_qvla",
            bf16=True,
            report_to=["tensorboard"],
            dataloader_pin_memory=False
        )


In [47]:
from datasets import Dataset, Image
from adapters import AdapterTrainer

In [44]:
base_model.active_head = "identity_head"
base_model._active_heads = [base_model.active_head]
base_model.train_adapter("adapter_lm")

In [None]:
base_model

In [45]:
base_NF4_model.active_head = "identity_head"
base_NF4_model._active_heads = [base_NF4_model.active_head]
base_NF4_model.train_adapter("adapter_lm")

In [None]:

trainer = AdapterTrainer(
    model=base_model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=collate_fn,
    args=args
)

trained = trainer.train()

Step,Training Loss


KeyboardInterrupt: 

In [48]:

trainer = AdapterTrainer(
    model=base_NF4_model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=collate_fn,
    args=args
)

trained = trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss


KeyboardInterrupt: 