feat(datapoints): support multiple images per media_context by RapidPoseidon · Pull Request #580 · RapidataAI/rapidata-python-sdk

RapidPoseidon · 2026-04-28T11:42:51Z

Summary

Widened media_contexts from list[str] to list[str] | list[list[str]] across all order/job/validation/audience APIs
When a caller passes a list of lists, each inner list is uploaded and wrapped in a MultiAssetInput, so annotators see multiple reference images for that datapoint
Fully backward-compatible: all existing list[str] callers continue to work without changes

What changed

File	Change
`_datapoint.py`	`media_context: str \| list[str] \| None`
`_datapoints_validator.py`	validates both shapes; rejects mixed lists
`_datapoint_uploader.py`	`_upload_and_map_media_context()` handles str or list
`_asset_upload_orchestrator.py`	extracts all URLs from list media_context
`_rapidata_dataset.py`	`_create_dataset_groups` handles list media_context
`rapids.py`	`Rapid.media_context: str \| list[str] \| None`
`_validation_rapid_uploader.py`	new `_upload_and_map_media_context()`
`rapids_manager.py`	all rapid builder `media_context` params updated
`validation_set_manager.py`	all `create_*_set` signatures updated
`rapidata_order_manager.py`	all `create_*_order` signatures updated
`rapidata_job_manager.py`	all `create_*_job_definition` signatures updated
`rapidata_job_definition.py`	`update_dataset` signature updated
`audience_example_handler.py`	new `_upload_and_map_media_context()`; both example methods updated
`rapidata_audience.py`	`add_classification_example` / `add_compare_example` signatures updated
`docs/job_definition_parameters.md`	updated `media_contexts` reference with multi-image example

Test plan

pyright passes: 0 errors, 0 warnings, 0 informations ✅ (verified in this session)
Pass media_contexts=[["img1.jpg", "img2.jpg"], ["img3.jpg"]] to a classification order and confirm both images appear as context in the rapid
Pass existing media_contexts=["img1.jpg", "img2.jpg"] and confirm no regression

🔗 Session: https://session-c1c6a759.poseidon.rapidata.internal/

Allow media_contexts to accept list[list[str]] in addition to list[str], so each datapoint can display multiple reference images in the rapid. The type is widened throughout the full call stack: - Datapoint.media_context: str | list[str] | None - DatapointsValidator accepts list[str] | list[list[str]] | None - DatapointUploader, ValidationRapidUploader, AudienceExampleHandler each grow _upload_and_map_media_context() that uploads all URLs and wraps them in IAssetInputMultiAssetInput when more than one is given - AssetUploadOrchestrator extracts all URLs from list media_context - RapidataDataset._create_dataset_groups handles list media_context - Rapid model, RapidsManager rapid builders, all order/job/validation manager methods, and RapidataAudience examples updated accordingly - Fully backward-compatible: existing list[str] callers are unaffected Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: elias <elias@rapidata.ai>

claude · 2026-04-28T11:44:25Z

Code Review — PR #580: Support multiple images per `media_context`

Overview

This PR widens media_contexts from list[str] to list[str] | list[list[str]] uniformly across all order/job/validation/audience APIs. The implementation is backward-compatible, well-documented, and correctly propagates the type change through all layers. Overall the approach is sound. A few issues worth addressing before merge are noted below.

Issues

1. `_upload_and_map_media_context` is copy-pasted into three separate classes

The exact same method body appears in:

_datapoint_uploader.py
_validation_rapid_uploader.py
audience_example_handler.py
_rapidata_dataset.py (inlined, not extracted)

This is the most significant code-quality issue in the PR. Any future change to upload/mapping logic must be applied to all four places. Consider extracting it to a shared mixin, utility function, or the AssetMapper/AssetUploader layer itself:

# e.g. in asset_mapper or a small shared helper
def upload_and_map_media_context(
    asset_uploader: AssetUploader,
    asset_mapper: AssetMapper,
    media_context: str | list[str],
) -> IAssetInput:
    if isinstance(media_context, list):
        return asset_mapper.create_existing_asset_input(
            [asset_uploader.upload_asset(mc) for mc in media_context]
        )
    return asset_mapper.create_existing_asset_input(
        asset_uploader.upload_asset(media_context)
    )

2. Dead-code `continue` in `_datapoints_validator.py`

# lines 310-313
if has_list:
    for mc in media_contexts:
        if not isinstance(mc, list):
            # should not happen because of has_list/has_str check
            continue

The has_str and has_list guard above already ensures that if we reach this branch, every element is a list. The isinstance check plus continue is unreachable. The comment acknowledges this, but unreachable code is still noise. Either remove the guard entirely or use an assert if you want to document the invariant:

# replace the dead continue with:
assert isinstance(mc, list)  # guaranteed by has_str/has_list check above

3. Style inconsistency — `Optional[...]` vs `... | None`

rapidata_order_manager.py line 598 uses Optional[list[str] | list[list[str]]], while every other changed signature in this PR uses list[str] | list[list[str]] | None. Pick one form (the project convention is X | None) and make it consistent.

4. Sequential uploads inside `_create_dataset_groups`

uploaded_names = [
    asset_uploader.upload_asset(mc) for mc in media_context
]

This is sequential. The single-asset path is unaffected, but for a datapoint with many context images this could be slow. If the rest of the upload layer already has a parallel/batch upload mechanism, consider using it here too. At minimum, a TODO comment would keep this visible.

5. No automated tests added

The PR lands meaningful new validation logic (media_context_not_empty for lists, mixed-shape rejection in DatapointsValidator). These are good candidates for fast unit tests — the validator logic is pure Python and easy to exercise in isolation. The manual test plan in the PR description does not replace automated coverage of the rejection paths (empty list, mixed list, empty inner string, etc.).

Minor / Non-blocking

_datapoints_validator.py mixed-list error message refers to "a list of strings or a list of lists of strings" — consider making the message actionable, e.g. "Wrap each single string in a list, or pass each value as a bare string."
_datapoint.py validator correctly handles the list case, but any(not isinstance(item, str) or item == "" for item in v) evaluates even past the first failure. Fine for typical input sizes, but next((True for item in v if not isinstance(item, str) or item == ""), False) would short-circuit.
The docs update is clear and the examples are helpful. One small nit: the table in job_definition_parameters.md renders \| literally in some Markdown renderers — the escaped pipe in the type cell may look odd depending on the doc renderer in use.

Summary

Area	Status
Correctness	✅ Logic is sound and backward-compatible
Type annotations	✅ Consistent with project `from __future__ import annotations` convention
Code duplication	⚠️ `_upload_and_map_media_context` duplicated 3–4×
Dead code	⚠️ Unreachable `continue` in validator
Style consistency	⚠️ One `Optional[...]` outlier
Tests	⚠️ No automated tests for new validation paths
Docs	✅ Updated and clear

The duplication item is the most important to address before merge — everything else is low risk.

Per follow-up feedback: simplify to a single canonical shape and warn on the legacy form instead of supporting both. - Public API ``media_contexts`` is now ``list[list[str]] | None``. A flat ``list[str]`` is still accepted for backward compatibility but emits a deprecation warning and is wrapped automatically (each string -> [str]). - Singular ``media_context`` (``Datapoint``, ``Rapid``, audience example helpers, ``RapidsManager`` rapid builders) is now ``list[str] | None``; passing a ``str`` emits the same deprecation warning and is wrapped. - All uploaders / handlers (``DatapointUploader``, ``ValidationRapidUploader``, ``AudienceExampleHandler``, dataset group creation) drop their isinstance branching — they now always receive a ``list[str]`` and unwrap single-element lists into ``ExistingAssetInput`` for backwards-compatible rendering, only using ``MultiAssetInput`` when there is more than one asset. - ``DatapointsValidator`` exposes a shared ``_normalize_media_contexts`` helper used by ``map_datapoints`` and the two ranking-style entry points (``RapidataOrderManager.create_ranking_order`` and the job-manager equivalent) and by ``ValidationSetManager``. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: elias <elias@rapidata.ai>

claude · 2026-04-28T12:28:34Z

Code Review — PR #580: Support multiple images per media_context

Overview

Clean, well-scoped feature. The approach — widening media_context/media_contexts to accept lists, normalising legacy strings at the boundary with deprecation warnings, and keeping list[list[str]] internally — is sound. The changes are consistent across the full surface area of the SDK. A few issues worth addressing before merging.

Issues

1. `_upload_and_map_media_context` is duplicated three times

Identical method bodies appear in:

_datapoint_uploader.py (DatapointUploader)
_validation_rapid_uploader.py (ValidationRapidUploader)
audience_example_handler.py (AudienceExampleHandler)

# all three are literally identical
uploaded_names = [self.asset_uploader.upload_asset(mc) for mc in media_context]
if len(uploaded_names) == 1:
    return self.asset_mapper.create_existing_asset_input(uploaded_names[0])
return self.asset_mapper.create_existing_asset_input(uploaded_names)

Since all three classes already share asset_uploader / asset_mapper via composition, this is a good candidate for a standalone helper function (or a shared mixin).

2. `_upload_and_map_media_context` has a questionable single/multi branch

Both the single-asset and multi-asset paths call the same method:

return self.asset_mapper.create_existing_asset_input(uploaded_names[0])  # str
return self.asset_mapper.create_existing_asset_input(uploaded_names)      # list[str]

Relying on create_existing_asset_input silently switching behaviour based on argument type makes the intent opaque and is fragile. Consider using an explicit create_multi_asset_input (or equivalent) for the list path so the two code paths are visibly different.

3. `_normalize_media_contexts` is a private symbol imported across four files

_normalize_media_contexts (underscore prefix ⇒ private) is imported by:

rapidata_order_manager.py
rapidata_job_manager.py
validation_set_manager.py
(called internally in _datapoints_validator.py)

A function used by four different modules is not private. Either remove the underscore or move it to a shared utilities module.

4. Inconsistent normalisation point for ranking vs. everything else

For non-ranking flows, normalisation happens inside DatapointsValidator.map_datapoints. For ranking it is done manually before the length check — and the import is inlined inside the function body:

# in create_ranking_order / _create_ranking_job_definition — inside a span block
from rapidata.rapidata_client.datapoints._datapoints_validator import (
    _normalize_media_contexts,
)
media_contexts = _normalize_media_contexts(media_contexts)

Inline imports inside function bodies are a code smell. Move the import to the top of the file. Better still, refactor ranking to also go through a shared normalisation path.

5. Validator duplication between `Datapoint` and `Rapid`

Datapoint.media_context_normalize (_datapoint.py) and Rapid.media_context_normalize (rapids.py) are byte-for-byte identical. Extract into a standalone validator function imported by both.

6. `Rapid` validator duplicates `_coerce_media_context` too

The audience-level _coerce_media_context function in rapidata_audience.py handles the str → list[str] coercion with the same logic as both Pydantic validators. Three implementations of the same 15-line logic is maintenance debt.

7. Validation-set/classification `media_contexts` check uses truthiness

media_contexts = _normalize_media_contexts(media_contexts)
if media_contexts and len(media_contexts) != len(datapoints):

After normalisation media_contexts is either None or a non-empty list[list[str]], so media_contexts and … is fine now. But is not None is more explicit and less surprising to a future reader. The ranking paths already got this right:

if media_contexts is not None and len(media_contexts) != len(datapoints):

(_datapoints_validator.py line 434 in the diff). Make it consistent everywhere.

Minor / Nits

Documentation type mismatch. job_definition_parameters.md now shows Optional[list[list[str]]] as the canonical type, but the backward-compatible note only appears in the constraint text. The old single-string example has been removed entirely, which may confuse users who search the docs for the old pattern before they hit the deprecation warning.
Asset uploads are sequential. _upload_and_map_media_context uploads multiple assets one-by-one in a list comprehension. For the common case (1 image) this is irrelevant, but for the multi-image path it could be parallelised with ThreadPoolExecutor if upload latency matters.
No automated test coverage. The PR adds meaningful new code paths (multi-image upload, deprecation-warning coercion, empty-list rejection, mixed-type rejection) but ships no unit tests. At minimum, the _normalize_media_contexts function and the Datapoint/Rapid validators are straightforward to unit-test without hitting the API.

Summary

The feature is well-thought-out and the backward-compat story is solid. The main asks before merging are:

De-duplicate _upload_and_map_media_context (or at minimum leave a comment explaining why it is intentionally duplicated).
De-duplicate the media_context Pydantic validator between Datapoint and Rapid.
Remove the underscore from _normalize_media_contexts or move it somewhere explicitly shared.
Use is not None guards consistently for the post-normalisation length checks.
Move the inline imports in ranking methods to the top of the file.

LinoGiger · 2026-04-28T13:21:58Z


+> A flat `list[str]` is still accepted for backward compatibility — it triggers a deprecation warning and each string is automatically wrapped in a single-element list.
+


remove this comment

LinoGiger · 2026-04-28T13:22:34Z

+Media URLs shown as reference context alongside each datapoint. Useful when you need to show one or more reference images / videos alongside the item being evaluated.

-**Constraints:** If provided, must have the same length as `datapoints`.
+**Constraints:** If provided, must have the same length as `datapoints`. Each entry is itself a list of media URLs / paths. Use a single-element inner list for one media asset per datapoint, or multiple entries to display several images / videos.


do not mention video as an input here. even tho technically supported. it's not the idea to be used for videos. so only mention images

LinoGiger · 2026-04-28T13:22:51Z

            data_type (str, optional): The type of the datapoint. Defaults to "media" (any form of image, video or audio).
            context (str, optional): The context is text that will be shown in addition to the instruction. Defaults to None.
-            media_context (str, optional): The media context is a link to an image / video that will be shown in addition to the instruction (can be combined with context). Defaults to None.
+            media_context (list[str], optional): A list of links to images / videos that will be shown in addition to the instruction (can be combined with context). Pass a single-element list for one media asset, or multiple to display multiple images / videos. Defaults to None.


same as above

LinoGiger · 2026-04-28T13:27:47Z

            data_type (Literal["media", "text"], optional): The data type of the datapoints. Defaults to "media".
            context (str, optional): Additional text context to display with the example. Defaults to None.
-            media_context (str, optional): Additional media (URL or path) to display with the example. Defaults to None.
+            media_context (list[str], optional): Additional media (URLs or paths) to display with the example. Pass a single-element list for one media asset, or multiple to display multiple images / videos. Defaults to None.


will stop commenting now, but please remove the video mention everyhwere

LinoGiger · 2026-04-28T13:30:14Z

+    def _upload_and_map_media_context(
+        self, media_context: list[str]
+    ) -> IAssetInput:
+        """Upload media context asset(s) and map to IAssetInput.
+
+        ``media_context`` is always a list. A single-element list is sent as a
+        plain ``ExistingAssetInput`` (preserving the legacy single-image
+        rendering); two or more entries are bundled into a ``MultiAssetInput``.
+        """
+        uploaded_names = [
+            self.asset_uploader.upload_asset(mc) for mc in media_context
+        ]
+        if len(uploaded_names) == 1:
+            return self.asset_mapper.create_existing_asset_input(uploaded_names[0])
+        return self.asset_mapper.create_existing_asset_input(uploaded_names)
+


why is this code duplicated?

LinoGiger · 2026-04-28T13:30:59Z

+def _normalize_media_contexts(
+    media_contexts: list[list[str]] | list[str] | None,
+) -> list[list[str]] | None:
+    """Normalize media_contexts to ``list[list[str]] | None``.
+
+    Accepts the legacy ``list[str]`` shape (one media context per datapoint)
+    by wrapping each string in a single-element list, after emitting a
+    deprecation warning. This way the rest of the pipeline only ever has
+    to deal with ``list[list[str]]``.
+    """
+    if media_contexts is None:
+        return None
+
+    if not isinstance(media_contexts, list):
+        raise ValueError(
+            "media_contexts must be a list of lists of strings or None, "
+            f"got {type(media_contexts).__name__}."
+        )
+
+    has_str = any(isinstance(mc, str) for mc in media_contexts)
+    has_list = any(isinstance(mc, list) for mc in media_contexts)
+    if has_str and has_list:
+        raise ValueError(
+            "media_contexts must be a list of lists of strings, not a mix of strings and lists."
+        )
+
+    if has_str:
+        logger.warning(
+            "Passing a flat list of strings for media_contexts is deprecated; "
+            "pass a list of lists of strings instead. Each string has been "
+            "wrapped in a single-element list."
+        )
+        normalized: list[list[str]] = []
+        for mc in media_contexts:
+            if not isinstance(mc, str):
+                raise ValueError(
+                    "media_contexts must be a list of lists of strings, "
+                    f"got element of type {type(mc).__name__}."
+                )
+            if mc == "":
+                raise ValueError(
+                    "media_contexts entries cannot be empty strings."
+                )
+            normalized.append([mc])
+        return normalized
+
+    # list[list[str]] case
+    for mc in media_contexts:
+        if not isinstance(mc, list):
+            raise ValueError(
+                "media_contexts must be a list of lists of strings, "
+                f"got element of type {type(mc).__name__}."
+            )
+        if len(mc) == 0:
+            raise ValueError(
+                "Each inner media_contexts list must contain at least one string. "
+                "Use None for the whole field if not needed."
+            )
+        if any(not isinstance(item, str) or item == "" for item in mc):
+            raise ValueError(
+                "Every entry in a media_contexts inner list must be a non-empty string."
+            )
+    return cast("list[list[str]]", media_contexts)


this also seems like duplicate code from that already exists in the datapoint? either do it in the validator or in the datapoint itself. not both

- Drop "video" wording from docstrings and docs — media_context is for images only, even though the plumbing technically also handles other asset kinds. - Remove the deprecation-note note from docs/job_definition_parameters.md. - Stop duplicating asset upload/map. ``_upload_and_map_media_context`` helpers were near-clones of ``_upload_and_map_asset``; both ``DatapointUploader`` and ``ValidationRapidUploader`` now reuse their existing ``_upload_and_map_asset`` for media_context. The ``AudienceExampleHandler`` gains one ``_upload_and_map_asset`` shared by all three callers (main asset, classification, compare) and the inline asset-upload blocks were folded into it. - Stop duplicating coercion logic. The plural ``_normalize_media_contexts`` helper is gone; ``map_datapoints`` delegates per-item coercion to ``coerce_media_context``, which now lives once in ``_datapoint.py`` and is imported by ``Datapoint``/``Rapid`` field validators and by ``RapidataAudience``'s public methods. - ``_create_dataset_groups`` now calls the same shared ``_upload_and_map_asset`` helper instead of branching on str vs list. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: elias <elias@rapidata.ai>

claude · 2026-04-28T13:45:59Z

Code Review — PR #580: Support multiple images per `media_context`

Overview

This PR widens media_context / media_contexts from str | None / list[str] | None to list[str] | None / list[list[str]] | None across the entire SDK. A coerce_media_context() helper centralises normalisation and provides a deprecation path for the old str form. The approach is clean overall, but there are a few issues worth addressing before merge.

🐛 Critical — `_validation_rapid_uploader.py`: suspicious `[0]` indexing

context_asset = (
    self._upload_and_map_asset(rapid.media_context)[0]
    if rapid.media_context
    else None
)

The PR description says this file gets a new _upload_and_map_media_context() method, but the diff shows a call to the existing _upload_and_map_asset() with [0] subscript. The existing _upload_and_map_asset in _ValidationRapidUploader appears to return an (IAssetInput, dict) tuple (the dict is then passed to _translate_compare_truth), so [0] may be intentionally extracting the asset input.

The problem is that rapid.media_context is now list[str] while the method was designed for str | list[str] (the main asset). If the method internally handles a list by wrapping it in a MultiAssetInput, then [0] on the returned tuple is fine. But if the method iterates the list as multiple assets (as it might for compare rapids), this call may produce the wrong shape — or fail entirely.

Recommendation: explicitly verify that _upload_and_map_asset(list[str]) returns a tuple whose first element is the correct IAssetInput for the context, add a test, or add a dedicated _upload_and_map_media_context() as the PR description promises.

⚠️ Medium — Public API type signatures are a hard breaking change at the type level

Manager methods like create_classification_order changed from:

media_contexts: list[str] | None = None

to:

media_contexts: list[list[str]] | None = None

Runtime backward compatibility is preserved because coerce_media_context is applied per-element inside map_datapoints. However, existing callers that pass media_contexts=["img1.jpg", "img2.jpg"] will now get a pyright/mypy error even though the code works at runtime. This is inconsistent: the Datapoint model accepts str | list[str] and emits a deprecation warning, but the public manager APIs silently break type-checking for the same pattern.

Recommendation: Change manager signatures to list[list[str]] | list[str] | None (matching _DatapointsValidator.validate_datapoints) and add a runtime deprecation warning at the manager level when a flat list is passed, consistent with how Datapoint handles it.

⚠️ Medium — Audience method signatures changed without deprecation wrapper in type annotations

rapidata_audience.py::add_classification_example and add_compare_example changed media_context: str | None → media_context: list[str] | None. The coerce_media_context() call inside handles str at runtime, but the type annotation no longer accepts str, which is inconsistent with the deprecation approach used for Datapoint and Rapid.

Recommendation: Align these method signatures with Datapoint/Rapid — either annotate as str | list[str] | None (with the coerce_media_context call providing the deprecation) or at minimum note the breaking change in the PR.

🔍 Low — Misleading docstring in `audience_example_handler.py::_upload_and_map_asset`

"""A single string is sent as a plain ``ExistingAssetInput``; a list is
bundled into a ``MultiAssetInput``."""

Both branches call self._asset_mapper.create_existing_asset_input(...), not MultiAssetInput. The docstring is either wrong or create_existing_asset_input wraps multi-names internally — if the latter, the docstring should say so.

🔍 Low — PR description says "rejects mixed lists" but no such validation exists in the diff

_datapoints_validator.py description: "validates both shapes; rejects mixed lists". Looking at the diff, no code rejects ["img1.jpg", ["img2.jpg", "img3.jpg"]]. This either means the validation relies on coerce_media_context raising on non-str/list elements (it does raise on invalid types, but a mixed list[str | list[str]] would silently produce wrong results), or the mixed-list check was inadvertently left out.

🔍 Low — Uncertain invariant comment in `_rapidata_dataset.py`

# media_context is always a list (the Datapoint validator
# coerces). The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips
# through).

"if a bare string slips through" contradicts the stated invariant. If the Datapoint validator always coerces to list[str], a bare string can never slip through. The comment should either remove the hedge or explain under what circumstances a bare string could appear.

✅ What's done well

coerce_media_context() is a clean single source of truth; delegating to it from both Datapoint and Rapid field validators and audience public APIs is the right pattern.
The deprecation warning for bare str inputs is a good UX touch.
The assets.update() change in _asset_upload_orchestrator.py is correct — a list[str] media context requires iterating all URLs, not just adding the list object.
Documentation examples are clear and show both single- and multi-image usage.
The type annotation update to dict[str, tuple[str | None, list[str] | None]] in _rapidata_dataset.py::_create_dataset_groups is correct and consistent.

The same `_upload_and_map_asset` body was copy-pasted into DatapointUploader, AudienceExampleHandler, and ValidationRapidUploader. Move it onto AssetUploader as `upload_and_map_asset` (returns IAssetInput) plus a `_with_mapping` variant that also returns the original-path -> uploaded-name dict needed for compare-rapid truth translation. All four call sites now share one implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Co-Authored-By: lino <lino@rapidata.ai>

claude · 2026-04-28T14:31:21Z

Code Review — PR #580: Support multiple images per `media_context`

Overview

This PR widens media_context / media_contexts from str | None / list[str] | None to list[str] | None / list[list[str]] | None across the entire SDK. The central coercion helper coerce_media_context() is a clean single source of truth, and consolidating the duplicate _upload_and_map_asset methods into AssetUploader.upload_and_map_asset[_with_mapping] is a good refactor. Overall the approach is sound, but there are a few issues worth addressing before merging.

Potential Bugs / Breaking Behavior

1. Silent API-level breaking change for single-image callers

Before this PR, passing one image wraps it in ExistingAssetInput. After, even a single-element list produces MultiAssetInput. If the backend treats these two wrapper types differently (e.g. different rendering, different validation), existing callers that pass a single image now get changed behavior silently — no error, no warning, just different runtime behavior.

Recommend either:

Keeping ExistingAssetInput when the list has exactly one element in upload_and_map_asset, or
Confirming the backend is agnostic to ExistingAssetInput vs. MultiAssetInput([single_item]).

# _asset_uploader.py — consider this in upload_and_map_asset_with_mapping
if isinstance(asset, list) and len(asset) == 1:
    # preserve ExistingAssetInput behavior for single-image case
    uploaded_name = self.upload_asset(asset[0])
    return AssetMapper.create_existing_asset_input(uploaded_name), {asset[0]: uploaded_name}

2. Deprecation warning spam for existing list[str] callers

The PR claims full backward compatibility, but media_contexts=["img1.jpg", "img2.jpg"] will now emit one logger.warning per item via coerce_media_context. A two-datapoint job fires two warnings; a 1000-datapoint job fires 1000 warnings. This is likely to flood production logs.

Consider emitting a single deprecation per call in DatapointsValidator.validate_datapoints (where the full list is visible) rather than per-element inside map_datapoints.

3. _rapidata_dataset.py comment hints at an unguarded path

# The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips through).

"If a bare string slips through" suggests the code may be reachable with a str that bypasses the Pydantic validator. If this path is truly unreachable, the comment should be removed (per project conventions). If it is reachable, the code needs an explicit guard.

Correctness

4. PR summary says "rejects mixed lists" — this is not enforced

DatapointsValidator.validate_datapoints accepts list[list[str]] | list[str] | None. A caller can pass ["img1.jpg", ["img2.jpg", "img3.jpg"]] — each element is independently coerced, so no error is raised. The claim in the PR description is misleading. Either add an explicit check or remove the claim.

# example of what should be caught if the claim is accurate:
if media_contexts and not all(isinstance(m, type(media_contexts[0])) for m in media_contexts):
    raise ValueError("media_contexts must be uniformly list[str] or list[list[str]], not mixed.")

Documentation

5. Old list[str] form is dropped from docs but still works (with warnings)

docs/job_definition_parameters.md now only shows the list[list[str]] form. Existing users with media_contexts=["original1.jpg", "original2.jpg"] will get surprise deprecation warnings and no migration guidance. At minimum, a migration note would help:

# Old form (deprecated, still accepted):
media_contexts=["original1.jpg", "original2.jpg"]

# New preferred form:
media_contexts=[["original1.jpg"], ["original2.jpg"]]

Code Style (CLAUDE.md conventions)

6. Comments explain what rather than why

Per project conventions, comments should only capture non-obvious why, not what. Several new comments describe implementation mechanics that are already clear from the code:

_asset_upload_orchestrator.py: # media_context is always a list of strings (the model coerces legacy single strings...) — remove
_datapoints_validator.py: The 4-line block comment before the list comprehension — remove
_rapidata_dataset.py: # media_context is always a list (the Datapoint validator coerces)... — remove

Minor / Nits

7. upload_and_map_asset_with_mapping uploads sequentially

asset_to_uploaded = {a: self.upload_asset(a) for a in asset}

For multi-image contexts this is sequential. The existing uploader already has a SingleFlightCache and threading infrastructure — parallel upload here would be consistent with how datapoint assets are handled. Not a blocker, but worth a follow-up.

8. No upper bound on inner list size

Nothing prevents media_context=["img1.jpg"] * 1000. A reasonable cap or at least a logged warning for large inner lists would prevent accidental misuse.

9. Optional[...] vs X | None inconsistency

rapidata_order_manager.py still uses Optional[list[list[str]]] for create_ranking_order while all other methods in the same file use list[list[str]] | None. Prefer the modern form for consistency.

Summary

Category	Severity
Single-image `MultiAssetInput` behavior change	Medium — needs confirmation with backend team
Deprecation warning flood	Medium — will impact existing users
"Rejects mixed lists" claim not implemented	Low — misleading PR description
Comments violating CLAUDE.md conventions	Low — style issue
Missing migration docs	Low — UX improvement

The core design is solid. The main asks before merging: (1) confirm MultiAssetInput vs ExistingAssetInput is safe for single-image case, and (2) fix the warning-per-item behavior for the backward-compat path.

The two-method API existed only because compare-rapid truth translation needed a path -> uploaded-name dict. But `upload_asset` is always cached (see its docstring: "Caching is always enabled... cannot be disabled"), so the validation uploader can just call it again on `winnerId` / `correctCombinations` entries — those calls are guaranteed cache hits, not re-uploads. That collapses AssetUploader back to a single `upload_and_map_asset` returning `IAssetInput` directly, kills the `upload_and_map_asset_with_mapping(asset)[0]` smell, and isolates the compare-truth concern inside `_translate_compare_truth` where it belongs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Co-Authored-By: lino <lino@rapidata.ai>

claude · 2026-04-29T14:44:17Z

Code Review: feat(datapoints) — support multiple images per media_context

Overview

This PR widens media_context/media_contexts from str | None → list[str] | None (per-datapoint) and list[str] | None → list[list[str]] | None (outer API), enabling annotators to see multiple reference images per task. The coerce_media_context() helper provides runtime backward compatibility for callers that still pass a bare string.

The overall approach is sound — centralising coercion in one helper, moving upload_and_map_asset() to AssetUploader, and eliminating duplicated upload patterns across four upload classes. Good DX refactor.

Issues

1. Breaking API change, not fully backward-compatible (type-level)

The PR summary says "fully backward-compatible: all existing list[str] callers continue to work without changes", but the public manager signatures were changed from list[str] | None to list[list[str]] | None (dropping the union). Any codebase using Pyright/mypy that passes media_contexts=["img1.jpg", "img2.jpg"] to an order/job/validation manager will now get a static type error.

The runtime is fine — coerce_media_context is called per-item inside Datapoint — but static callers are broken. Either:

Keep the union in public signatures: list[list[str]] | list[str] | None (as _datapoints_validator.py still does internally), or
Accept this is a breaking change and document it as such in the PR.

The inconsistency is visible right now: _datapoints_validator.validate_datapoints and map_datapoints still accept list[list[str]] | list[str] | None, but the callers above them (rapidata_order_manager, rapidata_job_manager, etc.) advertise only list[list[str]] | None.

2. Mixed-list claim vs. actual behaviour

The PR summary says "rejects mixed lists", but no explicit validation rejects [["a.jpg"], "b.jpg"]. At runtime each item goes through coerce_media_context, which silently wraps the bare string "b.jpg" into ["b.jpg"] (with a deprecation warning). A user who accidentally mixes shapes gets a warning but no error. If rejection is the intended contract, add an explicit check in validate_datapoints.

3. Stale/misleading comment in `_rapidata_dataset.py`

# media_context is always a list (the Datapoint validator
# coerces). The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips
# through).

The "or ExistingAssetInput if a bare string slips through" clause refers to a code path that can't happen once the validator always normalises to list[str]. Remove the parenthetical to avoid confusion.

4. `_translate_compare_truth` correctness concern

The old implementation built asset_to_uploaded with {a: upload(a) for a in asset} — a single upload pass, results reused for truth translation. The new implementation calls self.asset_uploader.upload_asset(original_winner_id) again inside the truth-translation loop. This relies on the cache always hitting for assets that were just uploaded in upload_rapid. That assumption holds today, but the method no longer makes this guarantee explicit in its contract.

Additionally, the new asset_paths set is constructed from rapid.asset (the datapoints to label), but rapid.media_context isn't part of this set. If a compare_truth winner ID could ever reference a media context URL this would silently fail translation — confirm that's not possible.

5. `rapidata_audience.py` — double coercion

add_classification_example and add_compare_example now call coerce_media_context(media_context) before delegating to AudienceExampleHandler, which accepts list[str] | None directly. That's harmless but redundant — the handler doesn't call coerce_media_context itself, so the coercion is needed here. The concern is the opposite direction: a user calling rapidata_audience.add_classification_example(media_context="img.jpg") will hit a Pyright error (signature says list[str]) even though the runtime coercion would handle it.

This is the same root issue as #1: runtime compat without type compat.

6. Documentation shows only `list[list[str]]` without a migration note

docs/job_definition_parameters.md now documents only the nested-list form and omits the old flat-string example. Users reading the docs have no indication that the old form still works at runtime. Add a brief backward-compat note (e.g. # list[str] also accepted but deprecated).

Minor observations

coerce_media_context(value: object) — the broad object annotation is correct for a Pydantic mode="before" validator; no change needed.
The upload_and_map_asset addition to AssetUploader is a clean consolidation; the logic was duplicated in three places before.
assets.update(datapoint.media_context) in _asset_upload_orchestrator.py is correct and minimal.

Summary

Severity	Item
High	Public API signatures claim `list[list[str]]` only — breaks static type checking for existing callers
Medium	Mixed-list behaviour differs from documented "rejects mixed lists" claim
Low	Stale comment about "bare string slips through" in `_rapidata_dataset.py`
Low	Documentation missing deprecation/compat note for flat `list[str]`
Low	`_translate_compare_truth` reliance on cache hit for correctness is undocumented

The core logic is well-structured. The main thing to resolve before merge is whether this is officially a breaking change (and documented as such) or whether the public type signatures should preserve the list[str] union for true backward compatibility.

claude · 2026-04-30T13:40:47Z

Code Review: feat(datapoints): support multiple images per media_context

Overview

This PR widens media_context / media_contexts across the entire SDK from a single string (or flat list of strings) to a list of strings (or list of lists of strings), allowing annotators to see multiple reference images per datapoint. The architecture is sensible — a single coerce_media_context helper centralizes coercion and is delegated to by Pydantic field_validators on both Datapoint and Rapid. The upload_and_map_asset method is a clean consolidation of duplicated upload+wrap logic.

Issues

1. Incomplete backward compatibility — breaking change at public API level

The PR claims full backward compat, but the public manager APIs now type media_contexts as list[list[str]] | None, dropping list[str]:

# rapidata_order_manager.py, rapidata_job_manager.py, etc.
media_contexts: list[list[str]] | None = None   # was: list[str] | None

The internal _datapoints_validator.py still accepts list[list[str]] | list[str] | None, so runtime behavior is preserved — but existing callers using media_contexts=["img1.jpg", "img2.jpg"] now get pyright / IDE type errors. That gap between the public annotation and the validator is confusing and will break strict-typed codebases upgrading to this version.

Suggestion: Either keep list[list[str]] | list[str] | None at the public API layer with a deprecation warning for the flat-list form, or explicitly document this as a breaking change in the changelog.

2. "Slips through" comment undermines the invariant guarantee

In _rapidata_dataset.py:

# The shared upload+map helper bundles it into a
# MultiAssetInput (or ExistingAssetInput if a bare string slips through).

If coerce_media_context is always called before a Datapoint is constructed, a bare string should never slip through. The comment signals incomplete confidence in the invariant. Either remove it (trusting the validator) or add an assert isinstance(media_context, list) guard if the defensive concern is real.

3. Missing explicit mixed-list rejection

The PR summary says it "rejects mixed lists" but no such check exists in the diff. _datapoints_validator.py accepts list[list[str]] | list[str] and calls coerce_media_context per element, which silently coerces a flat list[str] to a list[list[str]]. A user who accidentally mixes — e.g. [["a.jpg"], "b.jpg"] — gets a confusing downstream error rather than a clear validation message. If rejection is intended, add it explicitly.

4. Audience API coercion gap

rapidata_audience.py calls coerce_media_context before delegating to AudienceExampleHandler. But AudienceExampleHandler itself now types media_context: list[str] | None with no coercion of its own. If AudienceExampleHandler is called directly (e.g. in tests or future internal callers) with a bare string, there is no safety net — the string reaches upload_and_map_asset which does handle str, but pyright would flag it as a type error and the behavior is accidental rather than intentional.

5. Docs show only the new form — no migration note

The old canonical example:

media_contexts=["original1.jpg", "original2.jpg"]

is replaced without any migration note. Users upgrading will hit a type error and have no guidance. A short callout like "Previously accepted a flat list of strings — wrap each entry in its own list: [["img.jpg"]] — or pass a multi-element list for multiple images." would help.

Minor

The uv.lock version jumps from 3.9.8 to 3.9.12 (4 releases). If those intermediate releases included unrelated changes that are now bundled here, it is worth calling out in the description.
The PR summary says the _datapoints_validator "validates both shapes; rejects mixed lists" but the actual diff only shows that it accepts both shapes. Aligning the description with the code would prevent confusion during future code archaeology.
No automated tests are included for the new multi-image behavior. The test plan relies on manual verification. Even a unit test for coerce_media_context edge cases (empty list, mixed types, None, str) would significantly increase confidence.

Verdict

The core design — coerce_media_context as single source of truth, upload_and_map_asset consolidating upload+wrap — is clean and the implementation is largely correct. The main concern worth resolving before merge is item 1 (the public API type signature incompatibility with the backward-compat claim) and item 5 (docs need a migration note). Items 2-4 are lower risk but worth a quick pass.

EliasTeikari requested a review from LinoGiger April 28, 2026 11:49

LinoGiger requested changes Apr 28, 2026

View reviewed changes

fixed validataionRapidaUploader

edaf9ac

LinoGiger approved these changes Apr 30, 2026

View reviewed changes

LinoGiger merged commit a09abd6 into main Apr 30, 2026
3 checks passed

LinoGiger deleted the feat(datapoints)/multi-image-media-context branch April 30, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datapoints): support multiple images per media_context#580

feat(datapoints): support multiple images per media_context#580
LinoGiger merged 6 commits intomainfrom
feat(datapoints)/multi-image-media-context

RapidPoseidon commented Apr 28, 2026

Uh oh!

claude Bot commented Apr 28, 2026

Uh oh!

claude Bot commented Apr 28, 2026

Uh oh!

LinoGiger Apr 28, 2026

Uh oh!

LinoGiger Apr 28, 2026

Uh oh!

LinoGiger Apr 28, 2026

Uh oh!

LinoGiger Apr 28, 2026

Uh oh!

LinoGiger Apr 28, 2026

Uh oh!

LinoGiger Apr 28, 2026

Uh oh!

claude Bot commented Apr 28, 2026

Uh oh!

claude Bot commented Apr 28, 2026

Uh oh!

claude Bot commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		> A flat `list[str]` is still accepted for backward compatibility — it triggers a deprecation warning and each string is automatically wrapped in a single-element list.

Conversation

RapidPoseidon commented Apr 28, 2026

Summary

What changed

Test plan

Uh oh!

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

Issues

1. _upload_and_map_media_context is copy-pasted into three separate classes

2. Dead-code continue in _datapoints_validator.py

3. Style inconsistency — Optional[...] vs ... | None

4. Sequential uploads inside _create_dataset_groups

5. No automated tests added

Minor / Non-blocking

Summary

Uh oh!

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

Issues

1. _upload_and_map_media_context is duplicated three times

2. _upload_and_map_media_context has a questionable single/multi branch

3. _normalize_media_contexts is a private symbol imported across four files

4. Inconsistent normalisation point for ranking vs. everything else

5. Validator duplication between Datapoint and Rapid

6. Rapid validator duplicates _coerce_media_context too

7. Validation-set/classification media_contexts check uses truthiness

Minor / Nits

Summary

Uh oh!

LinoGiger Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

LinoGiger Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

LinoGiger Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

LinoGiger Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

LinoGiger Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

LinoGiger Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

🐛 Critical — _validation_rapid_uploader.py: suspicious [0] indexing

⚠️ Medium — Public API type signatures are a hard breaking change at the type level

⚠️ Medium — Audience method signatures changed without deprecation wrapper in type annotations

🔍 Low — Misleading docstring in audience_example_handler.py::_upload_and_map_asset

🔍 Low — PR description says "rejects mixed lists" but no such validation exists in the diff

🔍 Low — Uncertain invariant comment in _rapidata_dataset.py

✅ What's done well

Uh oh!

claude Bot commented Apr 28, 2026

Code Review — PR #580: Support multiple images per media_context

Overview

Potential Bugs / Breaking Behavior

Correctness

Documentation

Code Style (CLAUDE.md conventions)

Minor / Nits

Summary

Uh oh!

claude Bot commented Apr 29, 2026

Code Review: feat(datapoints) — support multiple images per media_context

Overview

Issues

1. Breaking API change, not fully backward-compatible (type-level)

2. Mixed-list claim vs. actual behaviour

3. Stale/misleading comment in _rapidata_dataset.py

4. _translate_compare_truth correctness concern

5. rapidata_audience.py — double coercion

Code Review — PR #580: Support multiple images per `media_context`

1. `_upload_and_map_media_context` is copy-pasted into three separate classes

2. Dead-code `continue` in `_datapoints_validator.py`

3. Style inconsistency — `Optional[...]` vs `... | None`

4. Sequential uploads inside `_create_dataset_groups`

1. `_upload_and_map_media_context` is duplicated three times

2. `_upload_and_map_media_context` has a questionable single/multi branch

3. `_normalize_media_contexts` is a private symbol imported across four files

5. Validator duplication between `Datapoint` and `Rapid`

6. `Rapid` validator duplicates `_coerce_media_context` too

7. Validation-set/classification `media_contexts` check uses truthiness

Code Review — PR #580: Support multiple images per `media_context`

🐛 Critical — `_validation_rapid_uploader.py`: suspicious `[0]` indexing

🔍 Low — Misleading docstring in `audience_example_handler.py::_upload_and_map_asset`

🔍 Low — Uncertain invariant comment in `_rapidata_dataset.py`

Code Review — PR #580: Support multiple images per `media_context`

3. Stale/misleading comment in `_rapidata_dataset.py`

4. `_translate_compare_truth` correctness concern

5. `rapidata_audience.py` — double coercion

6. Documentation shows only `list[list[str]]` without a migration note