**Captioning with PaliGemma: Step-by-Step Code Walkthrough**


Install Required Libraries:
First, we need to install the necessary libraries to access the PaliGemma model. Specifically, we’ll use the keras-nlp library:

In [None]:
# Install the following Keras required libraries
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade keras>=3

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.2/572.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Import Libraries:
Next, we import the libraries that will be used throughout the code:

In [None]:
import os
from google.colab import userdata
import keras_nlp
from IPython.display import Markdown
from keras.preprocessing.image import load_img, img_to_array
import tensorflow as tf
import matplotlib.pyplot as plt

Set Up Kaggle Credentials:
Here, I retrieve the Kaggle credentials from the environment. Make sure to store your Kaggle username and API key in the Colab secret keys for secure access

In [None]:
# Retrieve Kaggle credentials from the environment
os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")

Load the PaliGemma Model:
Now, we load the PaliGemma model. I’m using the pali_gemma_3b_mix_448 preset, which is a pre-trained checkpoint optimized for images with a resolution of 448x448 pixels:

In [None]:
# load paligemma from a preset
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset("pali_gemma_3b_mix_448")

Display Model Summary:
We can now display a summary of the loaded model to understand its structure and parameters:

In [None]:
pali_gemma_lm.summary()

Load and Prepare the Image:
To process an image, we first load it using Keras’s load_img function, which also allows resizing to the required dimensions. Then, we convert the image to a NumPy array.

Convert Image to Tensor:
Next, we transform the NumPy array into a Tensor object using TensorFlow’s convert_to_tensor function, making it ready for the PaliGemma model.

In [None]:
IMAGE_URL_HERE = 'https://storage.googleapis.com/keras-cv/models/paligemma/cow_beach_1.png'
image_path = tf.keras.utils.get_file('image_filename.jpg', IMAGE_URL_HERE)
keras_img = load_img(image_path, target_size=(448,448))
img_array = img_to_array(keras_img)
img_tensor = tf.convert_to_tensor(img_array)

Display the Image
Before passing the image to PaliGemma, let's display it to see what we're working with. Here’s the image we’ll use (for example, a cow):

In [None]:
# img_tensor = tf.expand_dims(img_array.astype('float32') / 255.0, axis=0)  # Normalize values between 0 and 1
# Display the image using matplotlib
# plt.imshow(img_tensor[0])  # Access the first image from the batch dimension
plt.imshow(keras_img[0])
plt.axis('off')  # Hide the axes for a cleaner image display
plt.show()

Generate an Image Caption
Now, we use PaliGemma to generate a caption. Start by defining a prompt, such as "Caption the image in detail." Then, pass both the image and the prompt to the generate() function:

In [None]:
# define prompt separately so we can measure its length later
prompt = "Caption the image:"
# pass images and prompts to  paligemma
response = pali_gemma_lm.generate( { "images": [img_tensor], "prompts": [prompt] } )
# we're not using an instruction-trained model so we have to cut the prompt off
# the front of our output
filtered = response[0][len(prompt):]
print(filtered)