Skip to content

Configurable response formatting for grounding/VQA datasets #238

@shuheng-liu

Description

@shuheng-liu

Context

#237 lands the infrastructure for PaliGemma-style location tokens — a coordinate ↔ <locNNNN> codec, an ensure_loc_tokens utility that handles both PaliGemma (promote-existing-IDs) and Gemma 3 (extend-vocab-and-resize), and the policy wiring for π0.5 and π0.6. It deliberately does NOT ship a concrete grounding dataset.

The reason is design: a class-per-source (PixMo-points, RefCOCO, OpenImages, …) approach scales poorly. Each new grounding source would otherwise require a new Python file even though the only differences from the existing class are (a) the HF dataset name, (b) the field names for the image / coords / label, and (c) the response format (points vs. xyxy vs. xywh).

We want the user to add a grounding source by editing config, not by writing a new dataset class.

Goal

A single generic grounding dataset class (or one for points + one for boxes — TBD) reads its source name, format, prompt template, and field mapping from DatasetConfig, applies the codec from src/opentau/datasets/grounding/loc_codec.py, and emits prompt / postfix strings that flow through the existing response_ce_loss path unchanged.

Expected user-facing config (one possible shape — final design left to the implementer):

{
  "vqa": "grounding_points",
  "vqa_kwargs": {
    "source": "allenai/pixmo-points",
    "prompt_template": "point to {label}",
    "label_field": "label",
    "points_field": "points",
    "image_field": "image_url",
    "max_points": 8
  }
}
{
  "vqa": "grounding_bbox",
  "vqa_kwargs": {
    "source": "lmms-lab/RefCOCO",
    "prompt_template": "detect {sentence}",
    "bbox_field": "bbox",
    "bbox_format": "xywh",
    "sentences_field": "sentences",
    "image_field": "image"
  }
}

Open questions to decide as part of this work:

  • Where do the kwargs live on DatasetConfig? Options: (a) free-form vqa_kwargs: dict | None, (b) draccus subclass-config pattern (GroundingDatasetConfig as a typed union), (c) a class-level constants pattern with thin per-source subclasses (SOURCE = ..., PROMPT_TEMPLATE = ...).
  • One generic class or two (points + bbox)? Field shapes and response formats differ enough that one class is awkward; two might be cleaner.
  • Should the existing vqa: str / repo_id: str two-source XOR validator extend to a three-way XOR with grounding? Or is grounding just another vqa value?
  • PixMo-points migration path: the broken vqa/pixmo.py was deleted in #237. The first dataset to land via the new infra is the natural replacement.
  • RefCOCO support format: xywh is the COCO native; the codec already has both xyxy_to_loc_tokens and xywh_to_loc_tokens.

In scope

  1. DatasetConfig plumbing for per-dataset format kwargs (whichever shape is picked above).
  2. Generic grounding dataset class(es) under src/opentau/datasets/grounding/.
  3. PixMo-points config replacement using the new infra.
  4. RefCOCO support.
  5. Tests covering the configurable response formatting on each new source (sample loads, response strings match <loc\d{4}> regex, no JSON characters slip through).

Out of scope (separate follow-ups)

  • Eval-time decoding of <locNNNN> strings to bounding boxes for IoU / mAP. The codec already has loc_tokens_to_xyxy / loc_tokens_to_points for this; an eval/regression test that closes the round-trip is a separate task.
  • OpenImages-detect (multi-object scenes, longer responses — likely needs response_max_length bump).
  • Defensive ensure_loc_tokens calls in π0 / π0.5_mem / π0.7-paligemma. These all share the PaliGemma backbone where the call is a no-op for vocab size, but they currently rely on the bare tokenizer fragmenting <loc0000> into seven pieces — adding the call once we have a real grounding consumer would prevent silent failures.

References

Metadata

Metadata

Assignees

Labels

featureNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions