Skip to content

feat(datapoints): support multiple images per media_context#580

Merged
LinoGiger merged 6 commits intomainfrom
feat(datapoints)/multi-image-media-context
Apr 30, 2026
Merged

feat(datapoints): support multiple images per media_context#580
LinoGiger merged 6 commits intomainfrom
feat(datapoints)/multi-image-media-context

Conversation

@RapidPoseidon
Copy link
Copy Markdown
Contributor

Summary

  • Widened media_contexts from list[str] to list[str] | list[list[str]] across all order/job/validation/audience APIs
  • When a caller passes a list of lists, each inner list is uploaded and wrapped in a MultiAssetInput, so annotators see multiple reference images for that datapoint
  • Fully backward-compatible: all existing list[str] callers continue to work without changes

What changed

File Change
_datapoint.py media_context: str | list[str] | None
_datapoints_validator.py validates both shapes; rejects mixed lists
_datapoint_uploader.py _upload_and_map_media_context() handles str or list
_asset_upload_orchestrator.py extracts all URLs from list media_context
_rapidata_dataset.py _create_dataset_groups handles list media_context
rapids.py Rapid.media_context: str | list[str] | None
_validation_rapid_uploader.py new _upload_and_map_media_context()
rapids_manager.py all rapid builder media_context params updated
validation_set_manager.py all create_*_set signatures updated
rapidata_order_manager.py all create_*_order signatures updated
rapidata_job_manager.py all create_*_job_definition signatures updated
rapidata_job_definition.py update_dataset signature updated
audience_example_handler.py new _upload_and_map_media_context(); both example methods updated
rapidata_audience.py add_classification_example / add_compare_example signatures updated
docs/job_definition_parameters.md updated media_contexts reference with multi-image example

Test plan

  • pyright passes: 0 errors, 0 warnings, 0 informations ✅ (verified in this session)
  • Pass media_contexts=[["img1.jpg", "img2.jpg"], ["img3.jpg"]] to a classification order and confirm both images appear as context in the rapid
  • Pass existing media_contexts=["img1.jpg", "img2.jpg"] and confirm no regression

🔗 Session: https://session-c1c6a759.poseidon.rapidata.internal/

Allow media_contexts to accept list[list[str]] in addition to list[str],
so each datapoint can display multiple reference images in the rapid.

The type is widened throughout the full call stack:
- Datapoint.media_context: str | list[str] | None
- DatapointsValidator accepts list[str] | list[list[str]] | None
- DatapointUploader, ValidationRapidUploader, AudienceExampleHandler
  each grow _upload_and_map_media_context() that uploads all URLs and
  wraps them in IAssetInputMultiAssetInput when more than one is given
- AssetUploadOrchestrator extracts all URLs from list media_context
- RapidataDataset._create_dataset_groups handles list media_context
- Rapid model, RapidsManager rapid builders, all order/job/validation
  manager methods, and RapidataAudience examples updated accordingly
- Fully backward-compatible: existing list[str] callers are unaffected

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: elias <elias@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

This PR widens media_contexts from list[str] to list[str] | list[list[str]] uniformly across all order/job/validation/audience APIs. The implementation is backward-compatible, well-documented, and correctly propagates the type change through all layers. Overall the approach is sound. A few issues worth addressing before merge are noted below.


Issues

1. _upload_and_map_media_context is copy-pasted into three separate classes

The exact same method body appears in:

  • _datapoint_uploader.py
  • _validation_rapid_uploader.py
  • audience_example_handler.py
  • _rapidata_dataset.py (inlined, not extracted)

This is the most significant code-quality issue in the PR. Any future change to upload/mapping logic must be applied to all four places. Consider extracting it to a shared mixin, utility function, or the AssetMapper/AssetUploader layer itself:

# e.g. in asset_mapper or a small shared helper
def upload_and_map_media_context(
    asset_uploader: AssetUploader,
    asset_mapper: AssetMapper,
    media_context: str | list[str],
) -> IAssetInput:
    if isinstance(media_context, list):
        return asset_mapper.create_existing_asset_input(
            [asset_uploader.upload_asset(mc) for mc in media_context]
        )
    return asset_mapper.create_existing_asset_input(
        asset_uploader.upload_asset(media_context)
    )

2. Dead-code continue in _datapoints_validator.py

# lines 310-313
if has_list:
    for mc in media_contexts:
        if not isinstance(mc, list):
            # should not happen because of has_list/has_str check
            continue

The has_str and has_list guard above already ensures that if we reach this branch, every element is a list. The isinstance check plus continue is unreachable. The comment acknowledges this, but unreachable code is still noise. Either remove the guard entirely or use an assert if you want to document the invariant:

# replace the dead continue with:
assert isinstance(mc, list)  # guaranteed by has_str/has_list check above

3. Style inconsistency — Optional[...] vs ... | None

rapidata_order_manager.py line 598 uses Optional[list[str] | list[list[str]]], while every other changed signature in this PR uses list[str] | list[list[str]] | None. Pick one form (the project convention is X | None) and make it consistent.

4. Sequential uploads inside _create_dataset_groups

uploaded_names = [
    asset_uploader.upload_asset(mc) for mc in media_context
]

This is sequential. The single-asset path is unaffected, but for a datapoint with many context images this could be slow. If the rest of the upload layer already has a parallel/batch upload mechanism, consider using it here too. At minimum, a TODO comment would keep this visible.

5. No automated tests added

The PR lands meaningful new validation logic (media_context_not_empty for lists, mixed-shape rejection in DatapointsValidator). These are good candidates for fast unit tests — the validator logic is pure Python and easy to exercise in isolation. The manual test plan in the PR description does not replace automated coverage of the rejection paths (empty list, mixed list, empty inner string, etc.).


Minor / Non-blocking

  • _datapoints_validator.py mixed-list error message refers to "a list of strings or a list of lists of strings" — consider making the message actionable, e.g. "Wrap each single string in a list, or pass each value as a bare string."
  • _datapoint.py validator correctly handles the list case, but any(not isinstance(item, str) or item == "" for item in v) evaluates even past the first failure. Fine for typical input sizes, but next((True for item in v if not isinstance(item, str) or item == ""), False) would short-circuit.
  • The docs update is clear and the examples are helpful. One small nit: the table in job_definition_parameters.md renders \| literally in some Markdown renderers — the escaped pipe in the type cell may look odd depending on the doc renderer in use.

Summary

Area Status
Correctness ✅ Logic is sound and backward-compatible
Type annotations ✅ Consistent with project from __future__ import annotations convention
Code duplication ⚠️ _upload_and_map_media_context duplicated 3–4×
Dead code ⚠️ Unreachable continue in validator
Style consistency ⚠️ One Optional[...] outlier
Tests ⚠️ No automated tests for new validation paths
Docs ✅ Updated and clear

The duplication item is the most important to address before merge — everything else is low risk.

@EliasTeikari EliasTeikari requested a review from LinoGiger April 28, 2026 11:49
Per follow-up feedback: simplify to a single canonical shape and warn on
the legacy form instead of supporting both.

- Public API ``media_contexts`` is now ``list[list[str]] | None``. A flat
  ``list[str]`` is still accepted for backward compatibility but emits a
  deprecation warning and is wrapped automatically (each string -> [str]).
- Singular ``media_context`` (``Datapoint``, ``Rapid``, audience example
  helpers, ``RapidsManager`` rapid builders) is now ``list[str] | None``;
  passing a ``str`` emits the same deprecation warning and is wrapped.
- All uploaders / handlers (``DatapointUploader``,
  ``ValidationRapidUploader``, ``AudienceExampleHandler``, dataset group
  creation) drop their isinstance branching — they now always receive a
  ``list[str]`` and unwrap single-element lists into ``ExistingAssetInput``
  for backwards-compatible rendering, only using ``MultiAssetInput`` when
  there is more than one asset.
- ``DatapointsValidator`` exposes a shared ``_normalize_media_contexts``
  helper used by ``map_datapoints`` and the two ranking-style entry
  points (``RapidataOrderManager.create_ranking_order`` and the
  job-manager equivalent) and by ``ValidationSetManager``.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: elias <elias@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

Clean, well-scoped feature. The approach — widening media_context/media_contexts to accept lists, normalising legacy strings at the boundary with deprecation warnings, and keeping list[list[str]] internally — is sound. The changes are consistent across the full surface area of the SDK. A few issues worth addressing before merging.


Issues

1. _upload_and_map_media_context is duplicated three times

Identical method bodies appear in:

  • _datapoint_uploader.py (DatapointUploader)
  • _validation_rapid_uploader.py (ValidationRapidUploader)
  • audience_example_handler.py (AudienceExampleHandler)
# all three are literally identical
uploaded_names = [self.asset_uploader.upload_asset(mc) for mc in media_context]
if len(uploaded_names) == 1:
    return self.asset_mapper.create_existing_asset_input(uploaded_names[0])
return self.asset_mapper.create_existing_asset_input(uploaded_names)

Since all three classes already share asset_uploader / asset_mapper via composition, this is a good candidate for a standalone helper function (or a shared mixin).

2. _upload_and_map_media_context has a questionable single/multi branch

Both the single-asset and multi-asset paths call the same method:

return self.asset_mapper.create_existing_asset_input(uploaded_names[0])  # str
return self.asset_mapper.create_existing_asset_input(uploaded_names)      # list[str]

Relying on create_existing_asset_input silently switching behaviour based on argument type makes the intent opaque and is fragile. Consider using an explicit create_multi_asset_input (or equivalent) for the list path so the two code paths are visibly different.

3. _normalize_media_contexts is a private symbol imported across four files

_normalize_media_contexts (underscore prefix ⇒ private) is imported by:

  • rapidata_order_manager.py
  • rapidata_job_manager.py
  • validation_set_manager.py
  • (called internally in _datapoints_validator.py)

A function used by four different modules is not private. Either remove the underscore or move it to a shared utilities module.

4. Inconsistent normalisation point for ranking vs. everything else

For non-ranking flows, normalisation happens inside DatapointsValidator.map_datapoints. For ranking it is done manually before the length check — and the import is inlined inside the function body:

# in create_ranking_order / _create_ranking_job_definition — inside a span block
from rapidata.rapidata_client.datapoints._datapoints_validator import (
    _normalize_media_contexts,
)
media_contexts = _normalize_media_contexts(media_contexts)

Inline imports inside function bodies are a code smell. Move the import to the top of the file. Better still, refactor ranking to also go through a shared normalisation path.

5. Validator duplication between Datapoint and Rapid

Datapoint.media_context_normalize (_datapoint.py) and Rapid.media_context_normalize (rapids.py) are byte-for-byte identical. Extract into a standalone validator function imported by both.

6. Rapid validator duplicates _coerce_media_context too

The audience-level _coerce_media_context function in rapidata_audience.py handles the strlist[str] coercion with the same logic as both Pydantic validators. Three implementations of the same 15-line logic is maintenance debt.

7. Validation-set/classification media_contexts check uses truthiness

media_contexts = _normalize_media_contexts(media_contexts)
if media_contexts and len(media_contexts) != len(datapoints):

After normalisation media_contexts is either None or a non-empty list[list[str]], so media_contexts and … is fine now. But is not None is more explicit and less surprising to a future reader. The ranking paths already got this right:

if media_contexts is not None and len(media_contexts) != len(datapoints):

(_datapoints_validator.py line 434 in the diff). Make it consistent everywhere.


Minor / Nits

  • Documentation type mismatch. job_definition_parameters.md now shows Optional[list[list[str]]] as the canonical type, but the backward-compatible note only appears in the constraint text. The old single-string example has been removed entirely, which may confuse users who search the docs for the old pattern before they hit the deprecation warning.

  • Asset uploads are sequential. _upload_and_map_media_context uploads multiple assets one-by-one in a list comprehension. For the common case (1 image) this is irrelevant, but for the multi-image path it could be parallelised with ThreadPoolExecutor if upload latency matters.

  • No automated test coverage. The PR adds meaningful new code paths (multi-image upload, deprecation-warning coercion, empty-list rejection, mixed-type rejection) but ships no unit tests. At minimum, the _normalize_media_contexts function and the Datapoint/Rapid validators are straightforward to unit-test without hitting the API.


Summary

The feature is well-thought-out and the backward-compat story is solid. The main asks before merging are:

  1. De-duplicate _upload_and_map_media_context (or at minimum leave a comment explaining why it is intentionally duplicated).
  2. De-duplicate the media_context Pydantic validator between Datapoint and Rapid.
  3. Remove the underscore from _normalize_media_contexts or move it somewhere explicitly shared.
  4. Use is not None guards consistently for the post-normalisation length checks.
  5. Move the inline imports in ranking methods to the top of the file.

Comment thread docs/job_definition_parameters.md Outdated
Comment on lines +188 to +190

> A flat `list[str]` is still accepted for backward compatibility — it triggers a deprecation warning and each string is automatically wrapped in a single-element list.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment

Comment thread docs/job_definition_parameters.md Outdated
Media URLs shown as reference context alongside each datapoint. Useful when you need to show one or more reference images / videos alongside the item being evaluated.

**Constraints:** If provided, must have the same length as `datapoints`.
**Constraints:** If provided, must have the same length as `datapoints`. Each entry is itself a list of media URLs / paths. Use a single-element inner list for one media asset per datapoint, or multiple entries to display several images / videos.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not mention video as an input here. even tho technically supported. it's not the idea to be used for videos. so only mention images

data_type (str, optional): The type of the datapoint. Defaults to "media" (any form of image, video or audio).
context (str, optional): The context is text that will be shown in addition to the instruction. Defaults to None.
media_context (str, optional): The media context is a link to an image / video that will be shown in addition to the instruction (can be combined with context). Defaults to None.
media_context (list[str], optional): A list of links to images / videos that will be shown in addition to the instruction (can be combined with context). Pass a single-element list for one media asset, or multiple to display multiple images / videos. Defaults to None.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

data_type (Literal["media", "text"], optional): The data type of the datapoints. Defaults to "media".
context (str, optional): Additional text context to display with the example. Defaults to None.
media_context (str, optional): Additional media (URL or path) to display with the example. Defaults to None.
media_context (list[str], optional): Additional media (URLs or paths) to display with the example. Pass a single-element list for one media asset, or multiple to display multiple images / videos. Defaults to None.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will stop commenting now, but please remove the video mention everyhwere

Comment on lines +33 to +48
def _upload_and_map_media_context(
self, media_context: list[str]
) -> IAssetInput:
"""Upload media context asset(s) and map to IAssetInput.

``media_context`` is always a list. A single-element list is sent as a
plain ``ExistingAssetInput`` (preserving the legacy single-image
rendering); two or more entries are bundled into a ``MultiAssetInput``.
"""
uploaded_names = [
self.asset_uploader.upload_asset(mc) for mc in media_context
]
if len(uploaded_names) == 1:
return self.asset_mapper.create_existing_asset_input(uploaded_names[0])
return self.asset_mapper.create_existing_asset_input(uploaded_names)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this code duplicated?

Comment on lines +7 to +69
def _normalize_media_contexts(
media_contexts: list[list[str]] | list[str] | None,
) -> list[list[str]] | None:
"""Normalize media_contexts to ``list[list[str]] | None``.

Accepts the legacy ``list[str]`` shape (one media context per datapoint)
by wrapping each string in a single-element list, after emitting a
deprecation warning. This way the rest of the pipeline only ever has
to deal with ``list[list[str]]``.
"""
if media_contexts is None:
return None

if not isinstance(media_contexts, list):
raise ValueError(
"media_contexts must be a list of lists of strings or None, "
f"got {type(media_contexts).__name__}."
)

has_str = any(isinstance(mc, str) for mc in media_contexts)
has_list = any(isinstance(mc, list) for mc in media_contexts)
if has_str and has_list:
raise ValueError(
"media_contexts must be a list of lists of strings, not a mix of strings and lists."
)

if has_str:
logger.warning(
"Passing a flat list of strings for media_contexts is deprecated; "
"pass a list of lists of strings instead. Each string has been "
"wrapped in a single-element list."
)
normalized: list[list[str]] = []
for mc in media_contexts:
if not isinstance(mc, str):
raise ValueError(
"media_contexts must be a list of lists of strings, "
f"got element of type {type(mc).__name__}."
)
if mc == "":
raise ValueError(
"media_contexts entries cannot be empty strings."
)
normalized.append([mc])
return normalized

# list[list[str]] case
for mc in media_contexts:
if not isinstance(mc, list):
raise ValueError(
"media_contexts must be a list of lists of strings, "
f"got element of type {type(mc).__name__}."
)
if len(mc) == 0:
raise ValueError(
"Each inner media_contexts list must contain at least one string. "
"Use None for the whole field if not needed."
)
if any(not isinstance(item, str) or item == "" for item in mc):
raise ValueError(
"Every entry in a media_contexts inner list must be a non-empty string."
)
return cast("list[list[str]]", media_contexts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also seems like duplicate code from that already exists in the datapoint? either do it in the validator or in the datapoint itself. not both

- Drop "video" wording from docstrings and docs — media_context is for
  images only, even though the plumbing technically also handles other
  asset kinds.
- Remove the deprecation-note note from docs/job_definition_parameters.md.
- Stop duplicating asset upload/map. ``_upload_and_map_media_context``
  helpers were near-clones of ``_upload_and_map_asset``; both
  ``DatapointUploader`` and ``ValidationRapidUploader`` now reuse their
  existing ``_upload_and_map_asset`` for media_context. The
  ``AudienceExampleHandler`` gains one ``_upload_and_map_asset`` shared
  by all three callers (main asset, classification, compare) and the
  inline asset-upload blocks were folded into it.
- Stop duplicating coercion logic. The plural
  ``_normalize_media_contexts`` helper is gone; ``map_datapoints``
  delegates per-item coercion to ``coerce_media_context``, which now
  lives once in ``_datapoint.py`` and is imported by
  ``Datapoint``/``Rapid`` field validators and by
  ``RapidataAudience``'s public methods.
- ``_create_dataset_groups`` now calls the same shared
  ``_upload_and_map_asset`` helper instead of branching on str vs list.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: elias <elias@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

This PR widens media_context / media_contexts from str | None / list[str] | None to list[str] | None / list[list[str]] | None across the entire SDK. A coerce_media_context() helper centralises normalisation and provides a deprecation path for the old str form. The approach is clean overall, but there are a few issues worth addressing before merge.


🐛 Critical — _validation_rapid_uploader.py: suspicious [0] indexing

context_asset = (
    self._upload_and_map_asset(rapid.media_context)[0]
    if rapid.media_context
    else None
)

The PR description says this file gets a new _upload_and_map_media_context() method, but the diff shows a call to the existing _upload_and_map_asset() with [0] subscript. The existing _upload_and_map_asset in _ValidationRapidUploader appears to return an (IAssetInput, dict) tuple (the dict is then passed to _translate_compare_truth), so [0] may be intentionally extracting the asset input.

The problem is that rapid.media_context is now list[str] while the method was designed for str | list[str] (the main asset). If the method internally handles a list by wrapping it in a MultiAssetInput, then [0] on the returned tuple is fine. But if the method iterates the list as multiple assets (as it might for compare rapids), this call may produce the wrong shape — or fail entirely.

Recommendation: explicitly verify that _upload_and_map_asset(list[str]) returns a tuple whose first element is the correct IAssetInput for the context, add a test, or add a dedicated _upload_and_map_media_context() as the PR description promises.


⚠️ Medium — Public API type signatures are a hard breaking change at the type level

Manager methods like create_classification_order changed from:

media_contexts: list[str] | None = None

to:

media_contexts: list[list[str]] | None = None

Runtime backward compatibility is preserved because coerce_media_context is applied per-element inside map_datapoints. However, existing callers that pass media_contexts=["img1.jpg", "img2.jpg"] will now get a pyright/mypy error even though the code works at runtime. This is inconsistent: the Datapoint model accepts str | list[str] and emits a deprecation warning, but the public manager APIs silently break type-checking for the same pattern.

Recommendation: Change manager signatures to list[list[str]] | list[str] | None (matching _DatapointsValidator.validate_datapoints) and add a runtime deprecation warning at the manager level when a flat list is passed, consistent with how Datapoint handles it.


⚠️ Medium — Audience method signatures changed without deprecation wrapper in type annotations

rapidata_audience.py::add_classification_example and add_compare_example changed media_context: str | Nonemedia_context: list[str] | None. The coerce_media_context() call inside handles str at runtime, but the type annotation no longer accepts str, which is inconsistent with the deprecation approach used for Datapoint and Rapid.

Recommendation: Align these method signatures with Datapoint/Rapid — either annotate as str | list[str] | None (with the coerce_media_context call providing the deprecation) or at minimum note the breaking change in the PR.


🔍 Low — Misleading docstring in audience_example_handler.py::_upload_and_map_asset

"""A single string is sent as a plain ``ExistingAssetInput``; a list is
bundled into a ``MultiAssetInput``."""

Both branches call self._asset_mapper.create_existing_asset_input(...), not MultiAssetInput. The docstring is either wrong or create_existing_asset_input wraps multi-names internally — if the latter, the docstring should say so.


🔍 Low — PR description says "rejects mixed lists" but no such validation exists in the diff

_datapoints_validator.py description: "validates both shapes; rejects mixed lists". Looking at the diff, no code rejects ["img1.jpg", ["img2.jpg", "img3.jpg"]]. This either means the validation relies on coerce_media_context raising on non-str/list elements (it does raise on invalid types, but a mixed list[str | list[str]] would silently produce wrong results), or the mixed-list check was inadvertently left out.


🔍 Low — Uncertain invariant comment in _rapidata_dataset.py

# media_context is always a list (the Datapoint validator
# coerces). The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips
# through).

"if a bare string slips through" contradicts the stated invariant. If the Datapoint validator always coerces to list[str], a bare string can never slip through. The comment should either remove the hedge or explain under what circumstances a bare string could appear.


✅ What's done well

  • coerce_media_context() is a clean single source of truth; delegating to it from both Datapoint and Rapid field validators and audience public APIs is the right pattern.
  • The deprecation warning for bare str inputs is a good UX touch.
  • The assets.update() change in _asset_upload_orchestrator.py is correct — a list[str] media context requires iterating all URLs, not just adding the list object.
  • Documentation examples are clear and show both single- and multi-image usage.
  • The type annotation update to dict[str, tuple[str | None, list[str] | None]] in _rapidata_dataset.py::_create_dataset_groups is correct and consistent.

The same `_upload_and_map_asset` body was copy-pasted into
DatapointUploader, AudienceExampleHandler, and ValidationRapidUploader.
Move it onto AssetUploader as `upload_and_map_asset` (returns
IAssetInput) plus a `_with_mapping` variant that also returns the
original-path -> uploaded-name dict needed for compare-rapid truth
translation. All four call sites now share one implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: lino <lino@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

This PR widens media_context / media_contexts from str | None / list[str] | None to list[str] | None / list[list[str]] | None across the entire SDK. The central coercion helper coerce_media_context() is a clean single source of truth, and consolidating the duplicate _upload_and_map_asset methods into AssetUploader.upload_and_map_asset[_with_mapping] is a good refactor. Overall the approach is sound, but there are a few issues worth addressing before merging.


Potential Bugs / Breaking Behavior

1. Silent API-level breaking change for single-image callers

Before this PR, passing one image wraps it in ExistingAssetInput. After, even a single-element list produces MultiAssetInput. If the backend treats these two wrapper types differently (e.g. different rendering, different validation), existing callers that pass a single image now get changed behavior silently — no error, no warning, just different runtime behavior.

Recommend either:

  • Keeping ExistingAssetInput when the list has exactly one element in upload_and_map_asset, or
  • Confirming the backend is agnostic to ExistingAssetInput vs. MultiAssetInput([single_item]).
# _asset_uploader.py — consider this in upload_and_map_asset_with_mapping
if isinstance(asset, list) and len(asset) == 1:
    # preserve ExistingAssetInput behavior for single-image case
    uploaded_name = self.upload_asset(asset[0])
    return AssetMapper.create_existing_asset_input(uploaded_name), {asset[0]: uploaded_name}

2. Deprecation warning spam for existing list[str] callers

The PR claims full backward compatibility, but media_contexts=["img1.jpg", "img2.jpg"] will now emit one logger.warning per item via coerce_media_context. A two-datapoint job fires two warnings; a 1000-datapoint job fires 1000 warnings. This is likely to flood production logs.

Consider emitting a single deprecation per call in DatapointsValidator.validate_datapoints (where the full list is visible) rather than per-element inside map_datapoints.

3. _rapidata_dataset.py comment hints at an unguarded path

# The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips through).

"If a bare string slips through" suggests the code may be reachable with a str that bypasses the Pydantic validator. If this path is truly unreachable, the comment should be removed (per project conventions). If it is reachable, the code needs an explicit guard.


Correctness

4. PR summary says "rejects mixed lists" — this is not enforced

DatapointsValidator.validate_datapoints accepts list[list[str]] | list[str] | None. A caller can pass ["img1.jpg", ["img2.jpg", "img3.jpg"]] — each element is independently coerced, so no error is raised. The claim in the PR description is misleading. Either add an explicit check or remove the claim.

# example of what should be caught if the claim is accurate:
if media_contexts and not all(isinstance(m, type(media_contexts[0])) for m in media_contexts):
    raise ValueError("media_contexts must be uniformly list[str] or list[list[str]], not mixed.")

Documentation

5. Old list[str] form is dropped from docs but still works (with warnings)

docs/job_definition_parameters.md now only shows the list[list[str]] form. Existing users with media_contexts=["original1.jpg", "original2.jpg"] will get surprise deprecation warnings and no migration guidance. At minimum, a migration note would help:

# Old form (deprecated, still accepted):
media_contexts=["original1.jpg", "original2.jpg"]

# New preferred form:
media_contexts=[["original1.jpg"], ["original2.jpg"]]

Code Style (CLAUDE.md conventions)

6. Comments explain what rather than why

Per project conventions, comments should only capture non-obvious why, not what. Several new comments describe implementation mechanics that are already clear from the code:

  • _asset_upload_orchestrator.py: # media_context is always a list of strings (the model coerces legacy single strings...) — remove
  • _datapoints_validator.py: The 4-line block comment before the list comprehension — remove
  • _rapidata_dataset.py: # media_context is always a list (the Datapoint validator coerces)... — remove

Minor / Nits

7. upload_and_map_asset_with_mapping uploads sequentially

asset_to_uploaded = {a: self.upload_asset(a) for a in asset}

For multi-image contexts this is sequential. The existing uploader already has a SingleFlightCache and threading infrastructure — parallel upload here would be consistent with how datapoint assets are handled. Not a blocker, but worth a follow-up.

8. No upper bound on inner list size

Nothing prevents media_context=["img1.jpg"] * 1000. A reasonable cap or at least a logged warning for large inner lists would prevent accidental misuse.

9. Optional[...] vs X | None inconsistency

rapidata_order_manager.py still uses Optional[list[list[str]]] for create_ranking_order while all other methods in the same file use list[list[str]] | None. Prefer the modern form for consistency.


Summary

Category Severity
Single-image MultiAssetInput behavior change Medium — needs confirmation with backend team
Deprecation warning flood Medium — will impact existing users
"Rejects mixed lists" claim not implemented Low — misleading PR description
Comments violating CLAUDE.md conventions Low — style issue
Missing migration docs Low — UX improvement

The core design is solid. The main asks before merging: (1) confirm MultiAssetInput vs ExistingAssetInput is safe for single-image case, and (2) fix the warning-per-item behavior for the backward-compat path.

The two-method API existed only because compare-rapid truth translation
needed a path -> uploaded-name dict. But `upload_asset` is always cached
(see its docstring: "Caching is always enabled... cannot be disabled"),
so the validation uploader can just call it again on `winnerId` /
`correctCombinations` entries — those calls are guaranteed cache hits,
not re-uploads.

That collapses AssetUploader back to a single `upload_and_map_asset`
returning `IAssetInput` directly, kills the
`upload_and_map_asset_with_mapping(asset)[0]` smell, and isolates the
compare-truth concern inside `_translate_compare_truth` where it
belongs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: lino <lino@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

Code Review: feat(datapoints) — support multiple images per media_context

Overview

This PR widens media_context/media_contexts from str | Nonelist[str] | None (per-datapoint) and list[str] | Nonelist[list[str]] | None (outer API), enabling annotators to see multiple reference images per task. The coerce_media_context() helper provides runtime backward compatibility for callers that still pass a bare string.

The overall approach is sound — centralising coercion in one helper, moving upload_and_map_asset() to AssetUploader, and eliminating duplicated upload patterns across four upload classes. Good DX refactor.


Issues

1. Breaking API change, not fully backward-compatible (type-level)

The PR summary says "fully backward-compatible: all existing list[str] callers continue to work without changes", but the public manager signatures were changed from list[str] | None to list[list[str]] | None (dropping the union). Any codebase using Pyright/mypy that passes media_contexts=["img1.jpg", "img2.jpg"] to an order/job/validation manager will now get a static type error.

The runtime is fine — coerce_media_context is called per-item inside Datapoint — but static callers are broken. Either:

  • Keep the union in public signatures: list[list[str]] | list[str] | None (as _datapoints_validator.py still does internally), or
  • Accept this is a breaking change and document it as such in the PR.

The inconsistency is visible right now: _datapoints_validator.validate_datapoints and map_datapoints still accept list[list[str]] | list[str] | None, but the callers above them (rapidata_order_manager, rapidata_job_manager, etc.) advertise only list[list[str]] | None.

2. Mixed-list claim vs. actual behaviour

The PR summary says "rejects mixed lists", but no explicit validation rejects [["a.jpg"], "b.jpg"]. At runtime each item goes through coerce_media_context, which silently wraps the bare string "b.jpg" into ["b.jpg"] (with a deprecation warning). A user who accidentally mixes shapes gets a warning but no error. If rejection is the intended contract, add an explicit check in validate_datapoints.

3. Stale/misleading comment in _rapidata_dataset.py

# media_context is always a list (the Datapoint validator
# coerces). The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips
# through).

The "or ExistingAssetInput if a bare string slips through" clause refers to a code path that can't happen once the validator always normalises to list[str]. Remove the parenthetical to avoid confusion.

4. _translate_compare_truth correctness concern

The old implementation built asset_to_uploaded with {a: upload(a) for a in asset} — a single upload pass, results reused for truth translation. The new implementation calls self.asset_uploader.upload_asset(original_winner_id) again inside the truth-translation loop. This relies on the cache always hitting for assets that were just uploaded in upload_rapid. That assumption holds today, but the method no longer makes this guarantee explicit in its contract.

Additionally, the new asset_paths set is constructed from rapid.asset (the datapoints to label), but rapid.media_context isn't part of this set. If a compare_truth winner ID could ever reference a media context URL this would silently fail translation — confirm that's not possible.

5. rapidata_audience.py — double coercion

add_classification_example and add_compare_example now call coerce_media_context(media_context) before delegating to AudienceExampleHandler, which accepts list[str] | None directly. That's harmless but redundant — the handler doesn't call coerce_media_context itself, so the coercion is needed here. The concern is the opposite direction: a user calling rapidata_audience.add_classification_example(media_context="img.jpg") will hit a Pyright error (signature says list[str]) even though the runtime coercion would handle it.

This is the same root issue as #1: runtime compat without type compat.

6. Documentation shows only list[list[str]] without a migration note

docs/job_definition_parameters.md now documents only the nested-list form and omits the old flat-string example. Users reading the docs have no indication that the old form still works at runtime. Add a brief backward-compat note (e.g. # list[str] also accepted but deprecated).


Minor observations

  • coerce_media_context(value: object) — the broad object annotation is correct for a Pydantic mode="before" validator; no change needed.
  • The upload_and_map_asset addition to AssetUploader is a clean consolidation; the logic was duplicated in three places before.
  • assets.update(datapoint.media_context) in _asset_upload_orchestrator.py is correct and minimal.

Summary

Severity Item
High Public API signatures claim list[list[str]] only — breaks static type checking for existing callers
Medium Mixed-list behaviour differs from documented "rejects mixed lists" claim
Low Stale comment about "bare string slips through" in _rapidata_dataset.py
Low Documentation missing deprecation/compat note for flat list[str]
Low _translate_compare_truth reliance on cache hit for correctness is undocumented

The core logic is well-structured. The main thing to resolve before merge is whether this is officially a breaking change (and documented as such) or whether the public type signatures should preserve the list[str] union for true backward compatibility.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Code Review: feat(datapoints): support multiple images per media_context

Overview

This PR widens media_context / media_contexts across the entire SDK from a single string (or flat list of strings) to a list of strings (or list of lists of strings), allowing annotators to see multiple reference images per datapoint. The architecture is sensible — a single coerce_media_context helper centralizes coercion and is delegated to by Pydantic field_validators on both Datapoint and Rapid. The upload_and_map_asset method is a clean consolidation of duplicated upload+wrap logic.


Issues

1. Incomplete backward compatibility — breaking change at public API level

The PR claims full backward compat, but the public manager APIs now type media_contexts as list[list[str]] | None, dropping list[str]:

# rapidata_order_manager.py, rapidata_job_manager.py, etc.
media_contexts: list[list[str]] | None = None   # was: list[str] | None

The internal _datapoints_validator.py still accepts list[list[str]] | list[str] | None, so runtime behavior is preserved — but existing callers using media_contexts=["img1.jpg", "img2.jpg"] now get pyright / IDE type errors. That gap between the public annotation and the validator is confusing and will break strict-typed codebases upgrading to this version.

Suggestion: Either keep list[list[str]] | list[str] | None at the public API layer with a deprecation warning for the flat-list form, or explicitly document this as a breaking change in the changelog.

2. "Slips through" comment undermines the invariant guarantee

In _rapidata_dataset.py:

# The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips through).

If coerce_media_context is always called before a Datapoint is constructed, a bare string should never slip through. The comment signals incomplete confidence in the invariant. Either remove it (trusting the validator) or add an assert isinstance(media_context, list) guard if the defensive concern is real.

3. Missing explicit mixed-list rejection

The PR summary says it "rejects mixed lists" but no such check exists in the diff. _datapoints_validator.py accepts list[list[str]] | list[str] and calls coerce_media_context per element, which silently coerces a flat list[str] to a list[list[str]]. A user who accidentally mixes — e.g. [["a.jpg"], "b.jpg"] — gets a confusing downstream error rather than a clear validation message. If rejection is intended, add it explicitly.

4. Audience API coercion gap

rapidata_audience.py calls coerce_media_context before delegating to AudienceExampleHandler. But AudienceExampleHandler itself now types media_context: list[str] | None with no coercion of its own. If AudienceExampleHandler is called directly (e.g. in tests or future internal callers) with a bare string, there is no safety net — the string reaches upload_and_map_asset which does handle str, but pyright would flag it as a type error and the behavior is accidental rather than intentional.

5. Docs show only the new form — no migration note

The old canonical example:

media_contexts=["original1.jpg", "original2.jpg"]

is replaced without any migration note. Users upgrading will hit a type error and have no guidance. A short callout like "Previously accepted a flat list of strings — wrap each entry in its own list: [["img.jpg"]] — or pass a multi-element list for multiple images." would help.


Minor

  • The uv.lock version jumps from 3.9.8 to 3.9.12 (4 releases). If those intermediate releases included unrelated changes that are now bundled here, it is worth calling out in the description.
  • The PR summary says the _datapoints_validator "validates both shapes; rejects mixed lists" but the actual diff only shows that it accepts both shapes. Aligning the description with the code would prevent confusion during future code archaeology.
  • No automated tests are included for the new multi-image behavior. The test plan relies on manual verification. Even a unit test for coerce_media_context edge cases (empty list, mixed types, None, str) would significantly increase confidence.

Verdict

The core design — coerce_media_context as single source of truth, upload_and_map_asset consolidating upload+wrap — is clean and the implementation is largely correct. The main concern worth resolving before merge is item 1 (the public API type signature incompatibility with the backward-compat claim) and item 5 (docs need a migration note). Items 2-4 are lower risk but worth a quick pass.

@LinoGiger LinoGiger merged commit a09abd6 into main Apr 30, 2026
3 checks passed
@LinoGiger LinoGiger deleted the feat(datapoints)/multi-image-media-context branch April 30, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants