# Fine-Tuning PaliGemma with QVLA

#### Author: nisan yildiz

----

PaliGemma is a pre-trained VLM designed to be a efficient base model for various fine-tuning applications in VL domain. Here, we will be fine-tuning the PaliGemma pre-trained model for image annotation task using quantization and Adapters. Adapters are small layers that are "plugged-in" to the larger model during fine-tuning to be trained while rest of the architecture remains frozen. This allows efficient fine-tuning of base-models without the need to train the entire network.

In [1]:
!git clone https://github.com/adapter-hub/adapters.git
%cd adapters
!pip install .
!pip install -U bitsandbytes

Cloning into 'adapters'...
remote: Enumerating objects: 126929, done.[K
remote: Counting objects: 100% (555/555), done.[K
remote: Compressing objects: 100% (407/407), done.[K
remote: Total 126929 (delta 380), reused 148 (delta 148), pack-reused 126374 (from 2)[K
Receiving objects: 100% (126929/126929), 99.35 MiB | 8.27 MiB/s, done.
Resolving deltas: 100% (96628/96628), done.
/content/adapters
Processing /content/adapters
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers~=4.50.3 (from adapters==1.2.0.dev0)
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Downloading transformers-4.50.3-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: adapters
  Building wheel for adapters (pyprojec

In [9]:
#Connect to drive
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/DI725/DI725-project


Mounted at /content/drive
/content/drive/MyDrive/DI725/DI725-project


In [2]:
import adapters
from adapters import AdapterModelInterface

In [3]:
import pandas as pd
import numpy as np
import torch

from transformers import BitsAndBytesConfig
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration

from huggingface_hub import notebook_login

In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/paligemma-3b-pt-224" # pt for pre-trained, needs fine-tuning

## Fine-tuning without quantization

In [5]:
#We need to log-in before using the PaliGemma model, as it is subject to agreement

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [13]:
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.26M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

### Adding VL-Adapter

PaliGemma model is not officially supported by the adapters library. We need to create a model interface object to be able to use it with adapters.

In [14]:
bottleneck_interface_lm = AdapterModelInterface(
    adapter_methods=["bottleneck"], # the vanilla Adapter a.k.a bottleneck adapter
    model_embeddings="language_model.model.embed_tokens",
    model_layers="language_model.model.layers",
    layer_self_attn="self_attn",
    layer_cross_attn=None,
    attn_k_proj="k_proj",
    attn_q_proj="q_proj",
    attn_v_proj="v_proj",
    attn_o_proj="o_proj",
    layer_intermediate_proj="mlp.up_proj",
    layer_output_proj="mlp.down_proj",
)

In [15]:
adapters.init(model, interface=bottleneck_interface_lm)
model.add_adapter("adapter_lm", config="double_seq_bn")
print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
adapter_lm               bottleneck       18,952,704       0.648       0       1
--------------------------------------------------------------------------------
Full model                              2,923,466,480     100.000               1


In [17]:
print(model)

PaliGemmaForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(256, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-26): 27 x SiglipEncoderLayer(
            (self_attn): SiglipSdpaAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear(in_features

### Quantization

We will be using 4-bit quantization for our model, with the NF4 datatype. Computations will be done in 16-bit bfloat16 type. We are also double quantizing.  

In [8]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16,
   bnb_4bit_use_double_quant=True)

base_NF4_model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, quantization_config=nf4_config)

#adding the adapter
base_NF4_model.add_adapter("adapter_lm", config="double_seq_bn")
print(base_NF4_model.adapter_summary())

#cast some layers to full precision
for param in base_NF4_model.parameters():
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

# Enable gradient checkpointing to reduce required memory
base_NF4_model.gradient_checkpointing_enable()
base_NF4_model.enable_input_require_grads()

class CastOutputToFloat(torch.nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
base_NF4_model.lm_head = CastOutputToFloat(model.lm_head)

#moving to device
base_NF4_model.adapter_to("adapter_lm", device=device)

prompt = "caption en"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend


RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

In [None]:
#Verifying the datatypes.
dtypes = {}
for _, p in base_NF4_model.named_parameters():
    dtype = p.dtype
    if dtype not in dtypes:
        dtypes[dtype] = 0
    dtypes[dtype] += p.numel()
total = 0
for k, v in dtypes.items():
    total += v
for k, v in dtypes.items():
    print(k, v, v / total)

## Preparing the dataset

In [13]:
captions = pd.read_csv("RISCM/captions.csv")
captions.head()
captions.split.value_counts()

Unnamed: 0,source,split,image,caption_1,caption_2,caption_3,caption_4,caption_5
0,NWPU,test,NWPU_31430.jpg,A gray plane on the runway and the lawn beside .,A grey plane is on the runway by the lawn .,There is an airplane on the runway with a larg...,A plane is parked on the runway next to the gr...,There is a plane on the runway beside the grass .
1,NWPU,test,NWPU_31431.jpg,Three small planes parked in a line on the air...,"There are four aircraft on the open ground, Th...",There are many planes of different sizes in a ...,Four planes are parked on the runway .,Four planes of different sizes were on the mar...
2,NWPU,test,NWPU_31432.jpg,A plane parked in a line on the airport with s...,A white plane was parked on the instruction li...,An airplane parked in an open area with many c...,A plane is parked on the open space .,There is 1 plane on the ground marked .
3,NWPU,test,NWPU_31433.jpg,A small plane and a big plane parked next to b...,A white plane and a gray plane parked at the b...,Two planes of different sizes are neatly parke...,A large plane and a small plane are parked nea...,Two planes are on the marked ground .
4,NWPU,test,NWPU_31434.jpg,Two planes parked next to boarding bridges .,Two aircraft were parked at the departure gates .,Two planes of different sizes are neatly parke...,Two planes are parked next to the terminal .,Two planes are on the marked ground .


In [29]:
captions.split.value_counts()

Unnamed: 0_level_0,count
split,Unnamed: 1_level_1
train,35614
test,4454
val,4453


In [30]:
test_jsonl = "RISCM/test_data.jsonl"
train_jsonl = "RISCM/train_data.jsonl"
val_jsonl = "RISCM/val_data.jsonl"

for img_i in range(len(captions)):
  match captions.iloc[img_i, 1]:
    case 'train':
      file = open(train_jsonl, "a")
    case 'val':
      file = open(val_jsonl, "a")
    case 'test':
      file = open(test_jsonl, "a")

  for caption in captions.iloc[img_i, 3:8]:
    if '\n' in caption:
      caption = caption.replace('\n', ' ')
    line = f'{{"image" : "{captions.iloc[img_i, 2]}", "prefix": "caption en", "suffix": "{caption}"}}'
    file.write(line + "\n")
  file.close()

KeyboardInterrupt: 

In [28]:
for i in captions.iloc[1, 3:8]:
  print(i)

Three small planes parked in a line on the airport and a big plane behind them .
There are four aircraft on the open ground, The largest of which is three times as large as the smallest one .
There are many planes of different sizes in a clearing .
Four planes are parked on the runway .
Four planes of different sizes were on the marked ground .


In [None]:
def preprocess_image(image, size=224):
  # Model has been trained to handle images of different aspects ratios
  # resized to 224x224 in the range [-1, 1]. Bilinear and antialias resize
  # options are helpful to improve quality in some tasks.
  image = np.asarray(image)
  if image.ndim == 2:  # Convert image without last channel into greyscale.
    image = np.stack((image,)*3, axis=-1)
  image = image[..., :3]  # Remove alpha layer.
  assert image.shape[-1] == 3

  image = tf.constant(image)
  image = tf.image.resize(image, (size, size), method='bilinear', antialias=True)
  return image.numpy() / 127.5 - 1.0  # [0, 255]->[-1,1]

