## Multi Modal Understanding
Nova 2 models provide out of the box support for understanding multiple modalities: text, image, video, documents, and audio. The models are capable of understanding of complex document layouts, tables, charts, and multi-page documents and extracting information more accurately from PDFs, spreadsheets, and other document formats. In this book, we will walk you through high-level concepts. Each of these capabilities deserves a deep dive, follow our Github repo for future cookbooks.

In [None]:
# Run this cell to install the required packages if you haven't already done so.
%pip install -r requirements.txt

In [None]:
# Execute this cell to restart kernel if you executed the above cell.
from IPython.display import display_html
display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)

In [1]:
# Create OpenAI Client with Nova API
from openai import OpenAI
import os
from dotenv import load_dotenv

# Import json for pretty printing messages
import json

#Set up your environment variables
api_key = os.getenv("NOVA_API_KEY")
base_url = "https://api.nova.amazon.com/v1"

# Create OpenAI client
client = OpenAI(api_key=api_key, base_url=base_url)

# Configure model_id
model_id = 'nova-2-lite-v1' # You can change this to any other Nova model available to you

## Image Analysis
Nova can recieve images as either a URI or base64 encoding. Below you will see both examples.

Images must adhere to the following cosntraints:
- Formats: PNG, JPEG, WEBP, GIF
- Size: Up to 25 MB per request
- Limit: Up to 10 images per request


In [None]:
# Make the API call with an image URL
response = client.chat.completions.create(
    model=model_id,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Analyze the architecture in the image and describe its key features."},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2025/06/10/ML-18438-arch.png"
                }
            }
        ]
    }]
)

print(response.choices[0].message.content)

In [None]:
# Make the API call with a base64 encoded image
import base64

with open("assets/nova-insurance-claims-arch.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model=model_id,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", 
             "text": "Analyze the architecture in the image and describe its key features."},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_data}"
                }
            }
        ]
    }]
)

print(response.choices[0].message.content)

## Video Analysis

Nova can also understand video input 

- Formats: MP4, MOV, MKV, WEBM
- Size: Up to 25 MB (or 1 GB via S3)
- Duration: Up to 30 seconds

In [None]:
# Understanding Video Inputs with base64 encoded video
# import base64 if you didn't execute the previous cell
with open("assets/the-sea.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model=model_id,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the video content in detail."},
            {
                "type": "file",
                "file": {
                    "filename": "the-sea.mp4",
                    "file_data": video_data
                }
            }
        ],
    }]
)

print(response.choices[0].message.content)

The video opens with a panoramic view of a rugged coastline, characterized by steep cliffs and a rocky outcrop jutting into the sea. The ocean waves are gently crashing against the rocks, creating a dynamic interaction between the water and the land under a clear, blue sky that suggests fair weather. The camera then transitions to a closer perspective of the coastal scene, focusing on the rocky outcrop and the crashing waves, providing a more intimate view of the natural landscape.

Subsequently, the video shifts dramatically to a close-up of a seashell resting on a sandy beach. The shell, with its intricate pattern and vibrant colors, is highlighted by the warm glow of the setting sun, creating a tranquil and picturesque scene. Gentle waves are seen lapping at the shore in the background, adding to the serene ambiance of the beach setting. This close-up shot contrasts sharply with the earlier wide-angle coastal view, offering a detailed appreciation of the natural beauty and intricate

## Audio Analysis
Nova 2 Omni supports rich audio understanding and can used to use cases like transcription and diarization. 

- Formats: MP3, WAV, OGG
- Duration: Up to 30 seconds
- 7 Supported Speech Languages


In [None]:
# Understanding Audio Inputs with Nova Omni
with open("assets/sample-multi-speaker-audio.mp3", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="nova-omni-v2",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": """ For each speaker turn segment, transcribe, assign a speaker label, start and end timestamps. 
             You must follow the exact XML format shown in the example below: 
             '<segment><transcription speaker="speaker_id" start="start_time" end="end_time">transcription_text</transcription></segment>"""
             },
            {
                "type": "input_audio",
                "input_audio": {
                    "data": "<base64_audio_data>",
                    "format": "mp3"
                }
            }
        ]
    }]
)

# Document Analysis