Skip to content

Add visual search image tooling plugin#8

Draft
eric-tramel wants to merge 1 commit intomainfrom
codex/visual-search-plugin
Draft

Add visual search image tooling plugin#8
eric-tramel wants to merge 1 commit intomainfrom
codex/visual-search-plugin

Conversation

@eric-tramel
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel commented May 5, 2026

What

Adds data-designer-visual-search, a self-contained Data Designer plugin that registers a new visual-search custom column type for VLM-driven image inspection.

The column accepts an image input column plus a prompt and model alias, then exposes a small local image-tool interface to the model:

  • open_image: open the row image and return the root image_id
  • get_image_info: inspect dimensions and lineage metadata for an image ID
  • list_images: list the current image tree
  • crop_image: create a pixel or percentage crop from any previous image ID
  • transform_image: rotate, flip, or resize any previous image ID
  • edit_color: adjust brightness, contrast, saturation, sharpness, grayscale, or inversion

Each tool-created image is stored in a row-local in-memory workspace and receives an ID such as img_0001. The column re-attaches tool-created images to the next model turn, so the model can crop or edit an image, inspect the result, branch from an earlier image ID, and continue reasoning. The main output column contains the final model answer, and {column_name}__image_history records the operation tree by default.

This PR also adds plugin-owned documentation under plugins/data-designer-visual-search/docs/ and generated site docs under docs/plugins/data-designer-visual-search/.

Why

Users want visual-search workflows where a model can zoom into regions, improve contrast, compare multiple crops, and recover from a bad crop by rewinding to an earlier image. A normal multimodal column can pass the initial image, but successive tool-generated image content needs extra plumbing: the output image bytes from a local tool must become model input on the next turn, while preserving IDs and lineage so the model can decide what to operate on next.

This plugin owns that column-specific loop instead of pushing image state management into generic Data Designer behavior.

Usage

import pandas as pd

from data_designer.config.config_builder import DataDesignerConfigBuilder
from data_designer.config.models import ChatCompletionInferenceParams, ModelConfig, ModelProvider
from data_designer.config.seed_source_dataframe import DataFrameSeedSource
from data_designer.interface.data_designer import DataDesigner

seed_df = pd.DataFrame(
    {
        "image_path": ["/path/to/store-shelf.png"],
        "target": ["the nutrition label on the cereal box"],
    }
)

provider = ModelProvider(
    name="nvidia",
    endpoint="https://integrate.api.nvidia.com/v1",
    api_key="NVIDIA_API_KEY",
    provider_type="openai",
)

vision_model = ModelConfig(
    alias="vision",
    model="qwen/qwen3.5-122b-a10b",
    provider="nvidia",
    inference_parameters=ChatCompletionInferenceParams(
        temperature=0,
        max_tokens=512,
        timeout=60,
    ),
)

builder = DataDesignerConfigBuilder(model_configs=[vision_model])
builder.with_seed_dataset(DataFrameSeedSource(df=seed_df))
builder.add_column(
    name="visual_answer",
    column_type="visual-search",
    image_column="image_path",
    prompt=(
        "Find {{ target }}. Use crop_image or edit_color if that helps. "
        "Return the text you can read and explain which image_id you used."
    ),
    model_alias="vision",
    max_tool_call_turns=4,
)

result = DataDesigner(
    artifact_path="artifacts",
    model_providers=[provider],
).preview(builder, num_records=1)

The resulting dataset includes:

  • visual_answer: the final VLM answer.
  • visual_answer__image_history: metadata for the image operation tree, including image_id, parent/child IDs, operation name, dimensions, and operation parameters.

Practical branching example:

open_image() -> img_0000
crop_image(image_id="img_0000", x=0, y=0, width=50, height=50, unit="percent") -> img_0001
edit_color(image_id="img_0001", contrast=1.5) -> img_0002
crop_image(image_id="img_0000", x=50, y=50, width=50, height=50, unit="percent") -> img_0003

The second crop branches from img_0000 instead of continuing from the edited first crop, which lets the model compare independent regions of the same source image.

Useful configuration options include:

  • allowed_tools: restrict the model to a subset of the built-in tools.
  • image_data_type and image_format: handle explicit URL or base64 image inputs.
  • image_placeholder: add a model-specific image token for endpoints that require text placeholders for attached images.
  • attach_images_after_tool_calls: control whether tool output images are sent back into model context.
  • with_trace and extract_reasoning_content: capture side-effect trace columns for debugging.

How

  • VisualSearchColumnConfig defines the public column interface and side-effect columns.
  • VisualImageWorkspace keeps the per-row in-memory image tree and exposes image IDs, data URI conversion, and history metadata.
  • VisualSearchToolExecutor provides OpenAI-compatible function schemas and executes the Pillow-backed local image tools.
  • VisualSearchColumnGenerator runs the chat/tool loop directly so it can append local tool results and attach generated image content blocks to the next model request.
  • The plugin documentation is authored in the plugin package and generated into the repository docs site with make plugin-docs / make all.

Intermediate images are intentionally kept in memory per row and are not persisted as dataset artifacts. The persisted side-effect output is metadata describing the image tree, not image bytes.

Validation

  • make sync
  • make lint
  • make test-plugin PLUGIN=data-designer-visual-search
  • make all
  • Live smoke test against https://integrate.api.nvidia.com/v1 using qwen/qwen3.5-122b-a10b: one generated image row, the model called crop_image, the crop was re-attached as the next model image input, and the final answer correctly identified the crop color.

@eric-tramel eric-tramel self-assigned this May 5, 2026
@eric-tramel eric-tramel force-pushed the codex/visual-search-plugin branch from 3d4fc7e to 31506ef Compare May 5, 2026 22:01
@eric-tramel eric-tramel changed the title [codex] add visual search plugin Add visual search image tooling plugin May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant