# Using PaliGemma with 🤗 transformers

Modified from the hugginface notebook

PaliGemma is a new vision language model released by Google. In this notebook, we will see how to use 🤗 transformers for PaliGemma inference.
First, install below libraries with update flag as we need to use the latest version of 🤗 transformers along with others.

In [1]:
!pip install -q -U accelerate bitsandbytes git+https://github.com/huggingface/transformers.git

PaliGemma requires users to accept Gemma license, so make sure to go to [the repository]() and ask for access. If you have previously accepted Gemma license, you will have access to this model as well. Once you have the access, login to Hugging Face Hub using `notebook_login()` and pass your access token by running the cell below.

In [1]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
import torch
import time
import os
import json
from huggingface_hub import notebook_login
from PIL import Image
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

start_time = time.time()
image_folder = "/teamspace/studios/this_studio/drone-detection-with-llm/originales-400"
results = {}

input_text = "Detect drone"
model_id = "google/paligemma-3b-pt-896"

for image_file in os.listdir(image_folder):
    if image_file.endswith((".jpg", ".jpeg", ".png")):
        image_path = os.path.join(image_folder, image_file)
        input_image = Image.open(image_path)

        model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
        processor = PaliGemmaProcessor.from_pretrained(model_id)

        inputs = processor(text=input_text, images=input_image,
                           padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
        model.to(device)
        inputs = inputs.to(dtype=model.dtype)

        with torch.no_grad():
            output = model.generate(**inputs, max_length=4200)

        prediction = processor.decode(output[0], skip_special_tokens=True)
        print(prediction)

        results[image_file] = prediction

with open('predictions.json', 'w') as json_file:
    json.dump(results, json_file, indent=4)

end_time = time.time()
print(f"Total time {end_time - start_time}")


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [2]:
import torch
import time
import os
import json
from PIL import Image
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from huggingface_hub import notebook_login

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

start_time = time.time()
image_folder = "/teamspace/studios/this_studio/drone-detection-with-llm/originales-400"
results = {}

input_text = "Detect drone"
model_id = "google/paligemma-3b-pt-896"

# Load model and processor once
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
processor = PaliGemmaProcessor.from_pretrained(model_id)

for image_file in os.listdir(image_folder):
    if image_file.endswith((".jpg", ".jpeg", ".png")):
        image_path = os.path.join(image_folder, image_file)
        input_image = Image.open(image_path)

        inputs = processor(text=input_text, images=input_image,
                           padding="longest", do_convert_rgb=True, return_tensors="pt").to(device)
        inputs = inputs.to(dtype=model.dtype)

        with torch.no_grad():
            output = model.generate(**inputs, max_length=4200)

        prediction = processor.decode(output[0], skip_special_tokens=True)
        print(f"{image_file}: {prediction}")

        results[image_file] = {"predictions": prediction}

# Save the results to a JSON file
with open('predictions.json', 'w') as json_file:
    json.dump(results, json_file, indent=4)

end_time = time.time()
print(f"Total time {end_time - start_time}")

Using device: cuda


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

KeyboardInterrupt: 