##### Copyright 2025 Google LLC

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Workshop: Build with Gemini (Part 2)

<a target="_blank" href="https://colab.sandbox.google.com/github/markmcd/gemini-workshop/blob/main/02-multimodal-capabilities.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This workshop teaches how to build with Gemini using the Gemini API and Python SDK.

Course outline:

- **[Part1: Quickstart + Text prompting](https://github.com/markmcd/gemini-workshop/blob/main/01-text-prompting.ipynb)**

- **Part 2 (this notebook): Multimodal capabilities (image, video, audio, docs, code, speech generation)**
  - Image
  - Audio
  - Video
  - Documents (PDFs)
  - Code
  - Text to Speech
  - Final excercise: Analyze supermarket invoice

- **[Part 3: Thinking models + agentic capabilities (tool usage)](https://github.com/markmcd/gemini-workshop/blob/main/03-thinking-and-tools.ipynb)**

## 0. Use the Google AI Studio as playground

Explore and play with all models in the [Google AI Studio](https://aistudio.google.com/apikey).

## 1. Setup

Get a free API key in the [Google AI Studio](https://aistudio.google.com/apikey) and set up the [Google Gen AI Python SDK](https://github.com/googleapis/python-genai)

In [None]:
%pip install -U -q google-genai

In [None]:
from google import genai
from google.genai import types
import os
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import userdata
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
else:
    GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')


client = genai.Client(api_key=GEMINI_API_KEY)

# MODEL = "gemini-2.0-flash"
# MODEL = "gemini-2.5-pro"
# MODEL = "gemini-2.5-flash-lite-preview-06-17"
MODEL = "gemini-2.5-flash"

## 2. Image understanding

Gemini models are able to process and understand images, e.g., you can use Gemini to describe, caption, and answer questions about images, and you can even use it for object detection.

In [None]:
!curl -o image.jpg "https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg"

In [None]:
from PIL import Image
image = Image.open("image.jpg")
print(image.size)
image

For total image payload size less than 20MB, we recommend either uploading base64 encoded images or directly uploading locally stored image files.

You can use a Pillow image in your prompt:

In [None]:
response = client.models.generate_content(
    model=MODEL,
    contents=["What is this image?", image])

print(response.text)

Or you can use base64 encoded images

In [None]:
import requests

res = requests.get("https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        "What is this image?",
        types.Part.from_bytes(data=res.content, mime_type="image/jpeg")
    ]
)

print(response.text)

You can use the File API for large payloads (>20MB).

 The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. It is available at no cost in all regions where the Gemini API is available.

In [None]:
uploaded_image = client.files.upload(file="image.jpg")
print(uploaded_image)

response = client.models.generate_content(
    model=MODEL,
    contents=["What is this image?", uploaded_image]
)

print(response.text)

## **!! Exercise !!**  Multiple image understanding

TODO: Ask gemini to compare the images and list key differences

In [None]:
image_url_1 = "https://plus.unsplash.com/premium_photo-1694819488591-a43907d1c5cc?fm=jpg&q=60&w=3000&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxzZWFyY2h8MXx8Y3V0ZSUyMGRvZ3xlbnwwfHwwfHx8MA%3D%3D" # Dog
image_url_2 = "https://images.pexels.com/photos/2071882/pexels-photo-2071882.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500" # Cat

image_response_req_1 = requests.get(image_url_1)
image_response_req_2 = requests.get(image_url_2)

response = client.models.generate_content(
    model=MODEL,
    # TODO: Ask gemini to compare the images and list key differences
)

print(response.text)

## 3. Bounding box detection

Gemini models are trained to return bounding box coordinates.

**Important**: Gemini returns bounding box coordinates in this format:

- `[y_min, x_min, y_max, x_max]`
- and normalized to `[0,1000]`

**Tip**: Ask Gemini to return JSON format and configure `config={'response_mime_type': 'application/json'}`:

In [None]:
import json

prompt = """Detect the 2d bounding boxes of all cupcakes. The label should be the topping of the cupcake.
Return JSON format."""

response = client.models.generate_content(
    model=MODEL,
    contents=[prompt, image],
    config={'response_mime_type': 'application/json'}
)

bboxes = json.loads(response.text)
bboxes

Create a helper function to denormalize and draw the bounding boxes:


In [None]:
from PIL import ImageDraw, ImageFont

line_width = 4
font = ImageFont.load_default(size=16)

labels = list(set(box['label'] for box in bboxes))

def draw_bounding_boxes(image, bounding_boxes):
    img = image.copy()
    width, height = img.size

    draw = ImageDraw.Draw(img)

    colors = ['blue','red','green','yellow','orange','pink','purple']

    for box in bounding_boxes:
        y_min, x_min, y_max, x_max = box['box_2d']
        label = box['label']

        # Convert normalized coordinates to absolute coordinates
        y_min = int(y_min/1000 * height)
        x_min = int(x_min/1000 * width)
        y_max = int(y_max/1000 * height)
        x_max = int(x_max/1000 * width)

        color = colors[labels.index(label) % len(colors)]
        draw.rectangle([(x_min, y_min), (x_max, y_max)], outline=color, width=line_width)

        draw.text((x_min+line_width, y_min), label, fill=color, font=font)

    display(img)

draw_bounding_boxes(image, bboxes)

## 4. Audio

You can use Gemini to process audio files. For example, you can use it to generate a transcript of an audio file or to summarize the content of an audio file.

Gemini represents each second of audio as 32 tokens; for example, one minute of audio is represented as 1,920 tokens.

For more info about technical details and supported formats, see [the docs](https://ai.google.dev/gemini-api/docs/audio#supported-formats).

In [None]:
import requests
url = 'https://raw.githubusercontent.com/markmcd/gemini-workshop/main/data/audio.mp3'
res = requests.get(url)
with open("audio.mp3", "wb") as f:
    f.write(res.content)

In [None]:
import IPython
IPython.display.Audio("audio.mp3")

In [None]:
audio_file = client.files.upload(file="audio.mp3")

prompt = """Generate a transcript of the episode. Include timestamps and identify speakers.

Speakers:
- John

eg:
[00:00] Brady: Hello there.
[00:02] Tim: Hi Brady.

It is important to include the correct speaker names. Use the names you identified earlier. If you really don't know the speaker's name, identify them with a letter of the alphabet, eg there may be an unknown speaker 'A' and another unknown speaker 'B'.

If there is music or a short jingle playing, signify like so:
[01:02] [MUSIC] or [01:02] [JINGLE]

If you can identify the name of the music or jingle playing then use that instead, eg:
[01:02] [Firework by Katy Perry] or [01:02] [The Sofa Shop jingle]

If there is some other sound playing try to identify the sound, eg:
[01:02] [Bell ringing]

Each individual caption should be quite short, a few short sentences at most.

Signify the end of the episode with [END].
"""

response = client.models.generate_content(
    model=MODEL,
    contents=[prompt, audio_file]
)
print(response.text)
     

## 5. Video

Gemini models are able to process videos. The 1M context window support up to approximately an hour of video data.

For technical details about supported video formats, see [the docs](https://ai.google.dev/gemini-api/docs/vision#technical-details-video).

In [None]:
!curl -o Post_its.mp4 "https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4"

Use the File API to upload a video. Here we also check the processing state:

In [None]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)

  print(f'Video processing complete: ' + video_file.uri)
  return video_file

post_its_video = upload_video('Post_its.mp4')

Now you can use the uploaded file in your prompt:

In [None]:
response = client.models.generate_content(
    model=MODEL,
    contents=[
        post_its_video,
        'Detect all sticky notes and list the names on the notes',
    ]
)

print(response.text)

#### YouTube video support

The Gemini API and AI Studio support YouTube URLs as a file data Part. You can include a YouTube URL with a prompt asking the model to summarize, translate, or otherwise interact with the video content.

In [None]:
youtube_url = "https://youtu.be/LlWDx0LSDok"

response = client.models.generate_content(
    model=MODEL,
    contents=[
        'Can you summarize this video?',
        types.Part(file_data=types.FileData(file_uri=youtube_url))
    ]
)

print(response.text)

#### **!! Exercise !!**

- Your turn! Use this video (*If I could only cook one dish for a vegan skeptic* from Rainbow Plant Life: https://youtu.be/BHRyfEbhFFU
- Ask Gemini about to describe the video and to get the recipe

In [None]:
youtube_url = "https://youtu.be/BHRyfEbhFFU"

response = client.models.generate_content(
    model=MODEL,
    # TODO: ask Gemini to generate the recipe from the youtube video
)

print(response.text)

1 minute audio = ~130 words or ~170 tokens
8192 / 170 = ~48 min output length.

You can use Gemini for transcribing, but be aware of the output token limit.

Another useful prompt you can try with audio files:
- Summarize the audio
- Refer to timestamps: `Provide a transcript of the speech from 02:30 to 03:29.`

## 6. PDFs

PDFs can also be used in the same way:

In [None]:
URL = "https://storage.googleapis.com/generativeai-downloads/data/pdf_structured_outputs/invoice.pdf"
!curl -q $URL -O invoice.pdf

In [None]:
uploaded_pdf = client.files.upload(file='invoice.pdf')

response = client.models.generate_content(
  model=MODEL,
  contents=[
    'Extract the date of the invoice and the total cost',
    uploaded_pdf,
  ]
)

print(response.text)

**Next step**: A cool feature I recommend is to combine it with structured outputs using Pydantic.

In [None]:
from pydantic import BaseModel, Field

class Item(BaseModel):
    description: str = Field(description="The description of the item")
    quantity: float = Field(description="The Qty of the item")
    gross_worth: float = Field(description="The gross worth of the item")

class Invoice(BaseModel):
    """Extract the invoice number, date and all list items with description, quantity and gross worth and the total gross worth."""
    invoice_number: str = Field(description="The invoice number e.g. 1234567890")
    date: str = Field(description="The date of the invoice e.g. 2024-01-01")
    items: list[Item] = Field(description="The list of items with description, quantity and gross worth")
    total_gross_worth: float = Field(description="The total gross worth of the invoice")


prompt = f"Extract the structured data from the following PDF file"
response = client.models.generate_content(
    model=MODEL,
    contents=[prompt, uploaded_pdf],
    config={'response_mime_type': 'application/json',
            'response_schema': Invoice
    }
)

response.parsed

In [None]:
response.parsed.model_dump()

## 7. Code

Gemini is good at understanding and generating code.

Let's use [gitingest](https://github.com/cyclotruc/gitingest) to chat with a GitHub repo:

In [None]:
%pip install gitingest

In [None]:
from gitingest import ingest_async

summary, tree, content = await ingest_async("https://github.com/patrickloeber/snake-ai-pytorch")

In [None]:
print(summary)

In [None]:
print(tree)

In [None]:
prompt = f"""Explain what the model.py file in this code base does:

Code:
{content}
"""

chat = client.chats.create(model=MODEL)

response = chat.send_message(prompt)
print(response.text)

In [None]:
response = chat.send_message("Explain the `save` function in more detail")
print(response.text)

In [None]:
response = chat.send_message("Refactor the `save` function and use pathlib instead of os. Return only the refactored function")
print(response.text)

## 8. Text to Speech

In [None]:
import wave

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents="Say cheerfully: Have a wonderful day!",
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
               voice_name='Kore',
            )
         )
      ),
   )
)

data = response.candidates[0].content.parts[0].inline_data.data

# write to wave file
with wave.open("out.wav", "wb") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(24000)
    wf.writeframes(data)

In [None]:
import IPython
IPython.display.Audio("out.wav")

## Exercise: Analyze supermarket invoice

Task:
- Define a schema for a single item that contains `item_name` and `item_cost`
- Define a schema for the supermarket invoice with `items`, `date`, and `total_cost`
- Use Gemini to extract all info from the supermarket bill into the defined supermarket invoice schema.
- Ask Gemini to list a few healthy recipes based on the items. If you have dietary restrictions, tell Gemini about it!

In [None]:
import requests
url = 'https://raw.githubusercontent.com/markmcd/gemini-workshop/main/data/rewe_invoice.pdf'
res = requests.get(url)
with open("rewe_invoice.pdf", "wb") as f:
    f.write(res.content)

In [None]:
# TODO: upload the PDF

In [None]:
## TODO Define schemas

class SupermarketItem(BaseModel):
    ...

class SupermarketInvoice(BaseModel):
    items: list[SupermarketItem] = Field(description="The list of items")
    ...


prompt = f"Extract the structured data from the following PDF file"
response = client.models.generate_content(
    model=MODEL,
    contents=[...],
    config={'response_mime_type': 'application/json',
            'response_schema': ...
    }
)

response.parsed

In [None]:
response.parsed.model_dump()

Now you can do follow up questions with the info:

In [None]:
prompt = ... # TODO: ask Gemini to list a few healthy recipes based on the items.
response = client.models.generate_content(
    model=MODEL,
    contents=[prompt],
)

print(response.text)

## Recap & Next steps

Great job, you're now an expert in working with multimodal data :)

Gemini's multimodal capabilities are powerful, and with the Python SDK you only need a few lines of code to process various media types, including text, audio, images, videos, and PDFs.

Key Takeaways:
- Use `client.files.upload` for larger payloads
- Directly include smaller files in your prompt with e.g. `types.Part.from_bytes(data=res.content, mime_type="image/jpeg")`
- For many use cases, it's helpful to constrain Gemini to respond with JSON using structured outputs.
- Use detailed prompts for generating transcripts
- Gemini can generate speech

More helpful resources:

- [Audio understanding docs](https://ai.google.dev/gemini-api/docs/audio?lang=python)
- [Visio understanding docs](https://ai.google.dev/gemini-api/docs/vision?lang=python)
- [Structured output docs](https://ai.google.dev/gemini-api/docs/structured-output?lang=python)
- [Speech generation docs](https://ai.google.dev/gemini-api/docs/speech-generation)
- [Video understanding cookbook](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Video_understanding.ipynb)

Next steps:

- **[Part 3: Thinking models + agentic capabilities (tool usage)](./03-thinking-and-tools.ipynb)**
