In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Spatial understanding with Gemini 2.0

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fspatial-understanding%2Fspatial_understanding.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb">
      <img width="32px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/53/X_logo_2023_original.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/spatial_understanding.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| | |
|-|-|
| Author(s) |  [Guillaume Vernade](https://github.com/Giom-V) [Holt Skinner](https://github.com/holtskinner) |

## Overview

This notebook introduces object detection and spatial understanding with the Gemini API in Vertex AI.


**YouTube Video: Building with Gemini 2.0: Spatial understanding**

<a href="https://www.youtube.com/watch?v=-XmoDzDMqj4" target="_blank">
  <img src="https://img.youtube.com/vi/-XmoDzDMqj4/maxresdefault.jpg" alt="Building with Gemini 2.0: Spatial understanding" width="500">
</a>


You'll learn how to use Gemini to perform object detection like this:

<img src="https://storage.googleapis.com/generativeai-downloads/images/cupcakes_with_bbox.png" alt="Cupcakes with Bounding box" width="500">

There are many examples, including object detection with

* simply overlaying information
* searching within an image
* translating and understanding things in multiple languages
* using Gemini thinking abilities

**Note**

There's no "magical prompt". Feel free to experiment with different ones. You can use the dropdown to see different samples, but you can also write your own prompts. Also, you can try uploading your own images.


## Get started

### Install Google Gen AI SDK


In [None]:
%pip install --upgrade --quiet google-genai

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. In Colab or Colab Enterprise, you might see an error message that says "Your session crashed for an unknown reason." This is expected. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and create client

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [2]:
import os

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

In [3]:
from google import genai

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Import libraries

In [15]:
from PIL import Image, ImageColor, ImageDraw
from google.genai.types import GenerateContentConfig, Part, SafetySetting
from pydantic import BaseModel
import requests

### Load model

Spatial understanding works best with the [Gemini 2.0 Flash model](https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2).

For more information about all AI models and APIs on Vertex AI, see [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) and [Model Garden](https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/explore-models).

In [5]:
MODEL_ID = "gemini-2.0-flash-001"  # @param {type:"string", isTemplate: true}

We'll set the configuration to include a system instruction, safety settings, and a Pydantic class for Controlled Generation.

The system instructions are mainly used to make the prompts shorter by not having to repeat the format each time. They are also telling the model how to deal with similar objects which is a nice way to let it be creative.


In [52]:
class BoundingBox(BaseModel):
    box_2d: list[int]
    label: str


config = GenerateContentConfig(
    system_instruction="""Return bounding boxes as an array with labels. Never return masks. Limit to 25 objects.
    If an object is present multiple times, give each object a unique label according to its distinct characteristics (colors, size, position, etc..).""",
    temperature=0.5,
    safety_settings=[
        SafetySetting(
            category="HARM_CATEGORY_DANGEROUS_CONTENT",
            threshold="BLOCK_ONLY_HIGH",
        ),
    ],
    response_mime_type="application/json",
    response_schema=list[BoundingBox],
)

### Helper functions

Create methods to draw the bounding boxes onto images.

In [61]:
def plot_bounding_boxes(im: Image, bounding_boxes: list[BoundingBox]) -> None:
    """
    Plots bounding boxes on an image with markers for each a name, using PIL, normalized coordinates, and different colors.

    Args:
        img_path: The path to the image file.
        bounding_boxes: A list of bounding boxes containing the name of the object
         and their positions in normalized [y1 x1 y2 x2] format.
    """

    # Load the image
    img = im
    width, height = img.size
    print(img.size)
    # Create a drawing object
    draw = ImageDraw.Draw(img)

    # Define a list of colors
    colors = [
        "red",
        "green",
        "blue",
        "yellow",
        "orange",
        "pink",
        "purple",
        "brown",
        "gray",
        "beige",
        "turquoise",
        "cyan",
        "magenta",
        "lime",
        "navy",
        "maroon",
        "teal",
        "olive",
        "coral",
        "lavender",
        "violet",
        "gold",
        "silver",
    ] + [color_name for (color_name, _) in ImageColor.colormap.items()]

    # Iterate over the bounding boxes
    for i, bounding_box in enumerate(bounding_boxes):
        # Select a color from the list
        color = colors[i % len(colors)]

        # Convert normalized coordinates to absolute coordinates
        abs_y1 = int(bounding_box.box_2d[0] / 1000 * height)
        abs_x1 = int(bounding_box.box_2d[1] / 1000 * width)
        abs_y2 = int(bounding_box.box_2d[2] / 1000 * height)
        abs_x2 = int(bounding_box.box_2d[3] / 1000 * width)

        if abs_x1 > abs_x2:
            abs_x1, abs_x2 = abs_x2, abs_x1

        if abs_y1 > abs_y2:
            abs_y1, abs_y2 = abs_y2, abs_y1

        # Draw the bounding box
        draw.rectangle(((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=4)

        # Draw the text
        label = getattr(bounding_box, "label", None)
        if label:
            draw.text((abs_x1 + 8, abs_y1 + 6), label, fill=color, size=16)

    # Display the image
    img.show()

### Overlaying Information

Let's start by loading an image of cupcakes.

<img src="https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg" alt="Cupcakes" width="500">

Let's start with a simple prompt to find all items in the image.

In [None]:
image = "https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg"
prompt = "Detect the 2d bounding boxes of the cupcakes (with `label` as topping description)"  # @param {type:"string"}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        prompt,
        Part.from_uri(
            image,
            mime_type="image/jpeg",
        ),
    ],
    config=config,
)

print(response.text)

In [None]:
im = Image.open(requests.get(image, stream=True).raw)

plot_bounding_boxes(im, response.parsed)

### Search within an image

Let's complicate things and search within the image for specific objects.

In [None]:
image = "https://storage.googleapis.com/generativeai-downloads/images/socks.jpg"
prompt = "Show me the positions of the socks with a face. Label according to position in the image."  # @param ["Detect all rainbow socks", "Find all socks and label them with emojis ", "Show me the positions of the socks with the face","Find the sock that goes with the one at the top"] {"allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        prompt,
        Part.from_uri(
            image,
            mime_type="image/jpeg",
        ),
    ],
    config=config,
)

print(response.text)

In [None]:
im = Image.open(requests.get(image, stream=True).raw)
plot_bounding_boxes(im, response.parsed)

### Use Gemini reasoning capabilities

The model can also reason based on the image, you can ask it about the positions of items, their utility, or, like in this example, to find the shadow of a specific item.

In [None]:
image = "https://storage.googleapis.com/generativeai-downloads/images/origamis.jpg"
prompt = "Draw a square around the fox' shadow"  # @param ["Find the two origami animals.", "Where are the origamis' shadows?","Draw a square around the fox' shadow"] {"allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        prompt,
        Part.from_uri(
            image,
            mime_type="image/jpeg",
        ),
    ],
    config=config,
)

print(response.text)

In [None]:
im = Image.open(requests.get(image, stream=True).raw)
plot_bounding_boxes(im, response.parsed)

You can also use Gemini knowledge to enhanced the labels returned. In this example Gemini will give you advices on how to fix your little mistake.

As you can see this time, you're only resizing the image to 1024px as it helps the model getting the bigger picture and give you advice. There's no clear rule about when to do it, experiment and find what works the best for you.

In [None]:
image = "https://storage.googleapis.com/generativeai-downloads/images/spill.jpg"
prompt = "Tell me how to clean my table with an explanation as label"  # @param ["Show me where my coffee was spilled.", "Tell me how to clean my table with an explanation as label"] {"allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        prompt,
        Part.from_uri(
            image,
            mime_type="image/jpeg",
        ),
    ],
    config=config,
)

print(response.text)

In [None]:
im = Image.open(requests.get(image, stream=True).raw)
plot_bounding_boxes(im, response.parsed)

### Try with more images

Here are some more sample images to try prompting with Gemini.

- https://storage.googleapis.com/generativeai-downloads/images/vegetables.jpg
- https://storage.googleapis.com/generativeai-downloads/images/Japanese_Bento.png
- https://storage.googleapis.com/generativeai-downloads/images/fruits.jpg
- https://storage.googleapis.com/generativeai-downloads/images/cat.jpg
- https://storage.googleapis.com/generativeai-downloads/images/pumpkins.jpg
- https://storage.googleapis.com/generativeai-downloads/images/breakfast.jpg
- https://storage.googleapis.com/generativeai-downloads/images/bookshelf.jpg