# Multimodal Pydantic AI Agent

This notebook will show you how to convert base64 images from Weaviate into Pydantic AI's `BinaryContent`.

You can then pass these `BinaryContent` inputs to the Pydantic AI `Agent`.

In [22]:
import fitz # uv add pymupdf
from pathlib import Path

PDF = "Rank1.pdf" # <-- Replace this with your PDF (If you want to use this PDF, you can get it from https://arxiv.org/abs/2502.18418)
OUT_DIR = "pdf_pages"
DPI = 300 # 300 is "print-quality" ...
FORMAT = "png"

doc = fitz.open(PDF)
Path(OUT_DIR).mkdir(parents=True, exist_ok=True)

zoom = DPI / 72.0
mat = fitz.Matrix(zoom, zoom)

for i, page in enumerate(doc, start=1):
    pix = page.get_pixmap(matrix=mat, alpha=False)
    pix.save(Path(OUT_DIR, f"page_{i:d}.{FORMAT}"))

doc.close()
print(f"Saved {len(list(Path(OUT_DIR).glob(f'*.{FORMAT}')))} images to '{OUT_DIR}'")

Saved 17 images to 'pdf_pages'


In [None]:
import re
import base64
from pathlib import Path
from typing import Tuple, List
from pydantic_ai import Agent, BinaryContent

def decode_image_from_db(b64_or_data_url: str, fallback_media_type: str = "image/png") -> Tuple[bytes, str]:
    s = b64_or_data_url.strip()
    if s.startswith("data:"):
        header, b64 = s.split(",", 1)
        media_type = header.split(";")[0][len("data:"):] or fallback_media_type
        return base64.b64decode(b64), media_type
    else:
        return base64.b64decode(s), fallback_media_type

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def extract_page_num(p: Path) -> int:
    m = re.search(r"page[_-](\d+)", p.stem, flags=re.I)
    return int(m.group(1)) if m else 10**9

def load_binarycontents(paths: List[Path]) -> List[BinaryContent]:
    items = []
    for p in paths:
        b64 = encode_image(str(p)) # simulate Weaviate base64 image storage
        img_bytes, media_type = decode_image_from_db(b64, "image/png")
        items.append(BinaryContent(data=img_bytes, media_type=media_type))
    return items

In [25]:
from pydantic import BaseModel
from pydantic_ai import RunContext

class MultimodalDependencies(BaseModel):
    prompt: str
    all_images: List[Path]

agent = Agent(
    model="openai:gpt-4.1",
    deps_type=MultimodalDependencies
)

@agent.system_prompt
def get_system_prompt(ctx: RunContext[MultimodalDependencies]) -> str:
    return f"You are analyzing document pages. The user will ask questions about content across {len(ctx.deps.all_images)} pages."

@agent.tool
async def get_batch_info(ctx: RunContext[MultimodalDependencies], batch_size: int) -> str:
    """Get information about the current batch being processed."""
    return f"Processing batch of {batch_size} images from directory: {ctx.deps.images_dir}"

async def run_multimodal_qa():
    images_dir = Path("./pdf_pages")
    batch_sizes = [2, 4, 6, 8, 10]
    prompt = "How many reasoning traces from R1 were sourced to train Rank1?"
    all_pages = sorted(images_dir.glob("page_*.png"), key=extract_page_num)
    
    deps = MultimodalDependencies(
        prompt=prompt,
        all_images=all_imgs
    )
    
    for k in batch_sizes:        
        batch = all_pages[:k]
        contents = load_binarycontents(batch)

        result = await agent.run([deps.prompt, *contents], deps=deps)
        
        page_nums = [extract_page_num(p) for p in batch]
        print(f"\n=== Sent {k} image(s): pages {page_nums} ===")
        print(result.output)
        tokens_used = result.usage().total_tokens
        print(f"This response took {tokens_used} tokens")

await run_multimodal_qa()


=== Sent 2 image(s): pages [1, 2] ===
Rank1 was trained on **635,264 examples** of R1 reasoning traces. This number is given towards the end of the "Data Preparation" section on page 2:

> "After all generation was done, our dataset has 635,264 examples of R1 generations..."

So, 635,264 reasoning traces from R1 were sourced to train Rank1.
This response took 1686 tokens

=== Sent 4 image(s): pages [1, 2, 3, 4] ===
The number of reasoning traces from R1 that were sourced to train Rank1 is **635,264**.

This is found on page 2 in the "Data Preparation" section:
> "After all generation was done, our dataset has 635,264 examples of R1 generations..."
This response took 3199 tokens

=== Sent 6 image(s): pages [1, 2, 3, 4, 5, 6] ===
The number of reasoning traces from R1 that were sourced to train Rank1 is **635,264**.

You can find this information on page 2, section "2.1 Data Preparation", where it states:
> "After all generation was done, our dataset has 635,264 examples of R1 generatio