Support anyformat for input when uploading to HF#19
Conversation
There was a problem hiding this comment.
Pull request overview
This PR broadens dataset conversion + Hugging Face upload support from video-only inputs to arbitrary inputs[] file types, and updates the GUI/upload flow accordingly.
Changes:
- Update Parquet + WebDataset conversion to pack all input files (not just videos) and preserve full sample/header payloads for round-tripping.
- Add Parquet upload mode improvements: configurable
samples_per_shardand upload all generated artifacts in a single HF commit. - Adjust GUI behavior to avoid attempting video playback for non-video inputs; expand tests and docs to cover these behaviors.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/osl_json_to_parquet_webdataset.py | CLI wording + warning key updated from video → input files. |
| tools/README.md | Documentation updated to describe input-file packing; adds an additional example workflow. |
| tests/test_parquet_conversion_tools.py | Adds round-trip test for mixed (non-video + video) inputs. |
| tests/test_hf_transfer_service.py | Adds tests ensuring Parquet upload batches into a single HF commit and forwards samples_per_shard. |
| tests/gui/test_dataset_explorer_regressions.py | Regression test ensuring selecting non-video inputs doesn’t trigger playback. |
| tests/gui/test_core_lifecycle.py | Verifies new upload dialog persistence/enablement for samples_per_shard. |
| annotation_tool/ui/dialogs.py | Adds samples_per_shard control and settings persistence; disables it for JSON upload mode. |
| annotation_tool/tools/osl_json_to_parquet.py | Generalizes packing from video-only to any input file; writes header + sample_payload. |
| annotation_tool/tools/parquet_to_osl_json.py | Improves reconstruction fallback logic and supports extracting arbitrary input files. |
| annotation_tool/controllers/hf_transfer_service.py | Adds samples_per_shard to parquet upload; switches to single create_commit batch upload. |
| annotation_tool/controllers/hf_transfer_controller.py | Threads samples_per_shard through to the parquet upload worker. |
| annotation_tool/controllers/media_controller.py | Detects non-video inputs and avoids attempting playback / suppresses invalid-media dialogs for non-video selections. |
| annotation_tool/main_window.py | Updates upload completion UI text to show “Input files packed”. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from tools.osl_json_to_parquet import convert_json_to_parquet | ||
| from tools.parquet_to_osl_json import convert_parquet_to_json |
There was a problem hiding this comment.
These imports look incorrect for this repo layout: there is no top-level Python package/module named tools.osl_json_to_parquet or tools.parquet_to_osl_json (the tools/ directory only contains CLI wrappers). This test will fail at import time unless the paths are changed. Import the functions from annotation_tool.tools.osl_json_to_parquet and annotation_tool.tools.parquet_to_osl_json (or otherwise ensure tools/ is an importable package with those modules).
| from tools.osl_json_to_parquet import convert_json_to_parquet | |
| from tools.parquet_to_osl_json import convert_parquet_to_json | |
| from annotation_tool.tools.osl_json_to_parquet import convert_json_to_parquet | |
| from annotation_tool.tools.parquet_to_osl_json import convert_parquet_to_json |
| """ | ||
| Convert an OpenSportsLib-style JSON annotation file into: | ||
| 1) a Parquet table with flattened metadata | ||
| 2) WebDataset TAR shards containing the referenced video files + sample metadata | ||
| 2) WebDataset TAR shards containing the referenced input files + sample metadata |
There was a problem hiding this comment.
The module docstring still says the Parquet output contains “flattened metadata”, but the new implementation stores the full sample_payload JSON per row (and no longer promotes metadata/labels into separate columns). Please update the docstring wording to reflect the new schema so callers don’t assume they can filter via flattened columns.
| row = _flatten_sample_for_parquet(sample, global_idx, shard_pattern=f"{shard_prefix}-%06d.tar") | ||
| row["shard_name"] = shard_name | ||
| row["sample_index"] = global_idx | ||
| row["task"] = doc.get("task") | ||
| row["dataset_name"] = doc.get("dataset_name") | ||
| row["json_version"] = doc.get("version") | ||
| row["json_date"] = doc.get("date") | ||
| row["top_level_metadata"] = _json_dumps(doc.get("metadata", {})) | ||
| row["top_level_labels"] = _json_dumps(doc.get("labels", {})) | ||
| row["header"] = _json_dumps(header) | ||
| parquet_rows.append(row) |
There was a problem hiding this comment.
row["header"] stores the same top-level JSON header for every sample row. For large datasets this will significantly bloat metadata.parquet and slow reads/writes. Consider storing the header once (e.g., a separate header.json file in output_dir, a single-row Parquet file, or Parquet key-value metadata) and have the reverse converter read it from there instead of duplicating it per row.
| rewritten.append(inp_copy) | ||
| payload_copy["inputs"] = rewritten | ||
| return _json_dumps(payload_copy) | ||
|
|
||
| df["sample_payload"] = df.apply(_resolved_payload_for_row, axis=1) |
There was a problem hiding this comment.
Rewriting sample_payload via df.apply(..., axis=1) is row-wise Python execution and will be a major bottleneck for large datasets when keep_relative_paths_in_parquet=False. Consider doing this rewrite while building parquet_rows (before constructing the DataFrame), or iterating over rows once to build a new sample_payload column (avoiding apply).
| - metadata.parquet is used only for routing (sample_index, shard_name) and lightweight | ||
| filtering; it does not store a full copy of each sample. |
There was a problem hiding this comment.
The top-level notes say metadata.parquet “does not store a full copy of each sample”, but the forward converter now persists sample_payload (full sample JSON) per row. Please update this note to match the current format (or clarify that this statement only applies to legacy exports without sample_payload).
| - metadata.parquet is used only for routing (sample_index, shard_name) and lightweight | |
| filtering; it does not store a full copy of each sample. | |
| - In current exports, ``metadata.parquet`` may also include ``sample_payload`` with the | |
| full sample JSON. Legacy exports may contain only routing fields | |
| (for example ``sample_index`` and ``shard_name``) plus lightweight filter columns. |
No description provided.