# 🖼️ Image Captioning using PaliGemma

This notebook demonstrates how to use Google's `paligemma-3b-mix-224` vision-language model to generate captions or textual descriptions for your own uploaded image.

We use Hugging Face Transformers, PyTorch, and Pillow to load the model, process the image, and generate a descriptive response.


## 📦 Install Dependencies

We install the necessary libraries such as `transformers`, `torch`, and `torchvision`. These are required to load and run the PaliGemma model.


In [1]:
# Install required packages
!pip install -q transformers accelerate torch torchvision

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from huggingface_hub import login

# Log in to Hugging Face with your token
login(token="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")  # Replace with your actual token

## 📚 Import Python Libraries

We import essential modules including:
- `torch` for model loading and inference
- `transformers` for accessing PaliGemma and its processor
- `PIL.Image` for opening and processing your image


In [3]:
# Import necessary libraries
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

## 🔄 Load the PaliGemma Model

We load the PaliGemma vision-to-language model and its processor using Hugging Face's `AutoProcessor` and `AutoModelForVision2Seq`.

Make sure to use a GPU (if available) for faster processing.


In [4]:
# Load the PaliGemma model and processor
model_id = "google/paligemma-3b-mix-224"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.26M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/62.6k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

PaliGemmaForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(256, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-26): 27 x SiglipEncoderLayer(
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (self_attn): SiglipAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
          

In [7]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

PaliGemmaForConditionalGeneration(
  (vision_tower): SiglipVisionModel(
    (vision_model): SiglipVisionTransformer(
      (embeddings): SiglipVisionEmbeddings(
        (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
        (position_embedding): Embedding(256, 1152)
      )
      (encoder): SiglipEncoder(
        (layers): ModuleList(
          (0-26): 27 x SiglipEncoderLayer(
            (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (self_attn): SiglipAttention(
              (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
              (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
            )
            (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
            (mlp): SiglipMLP(
          

## 🖼️ Upload Your Image

Replace the `image_path` with the path to your own image. Make sure the image is in `.jpg`, `.jpeg`, or `.png` format.

We use `PIL.Image.open()` to load the image and convert it to RGB format.


In [8]:
# Load your own image
image_path = "/content/image.png"  # Change this to your uploaded image path
image = Image.open(image_path).convert("RGB")

## 🧠 Define Prompt for Captioning

You can provide a prompt to guide the model on what type of output you want. For example:
- `"Describe the image"`
- `"What is happening in this image?"`
- `"Write a caption for this image"`


In [22]:
# Define your task prompt
prompt = "Describe the image."

## 🔄 Preprocess Image and Prompt

We process the image and the prompt using the `processor`. The result is a tensor suitable for model input.


In [23]:
# Preprocess input
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

# Generate output
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

# Decode output
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

# Print the result
print("🔍 Model Output:", generated_text)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


🔍 Model Output: Describe the image.
page displaying different products available with other options


In [19]:
# Define your task prompt
prompt = "Can you describe the components of the image?"

## ✨ Generate Caption

We use the `generate()` method to get the model's prediction. This step performs the vision-language generation based on your image and prompt.

## 📝 Decode and Display Output

The generated output is decoded using `processor.batch_decode()` to convert model tokens into human-readable text.


In [20]:
# Preprocess input
inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

# Generate output
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

# Decode output
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

# Print the result
print("🔍 Model Output:", generated_text)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


🔍 Model Output: Can you describe the components of the image?
earbuds


In [6]:
from IPython.display import clear_output
clear_output() # Removed the unexpected indent