Support anyformat for input when uploading to HF by SilvioGiancola · Pull Request #19 · OpenSportsLab/VideoAnnotationTool

SilvioGiancola · 2026-04-21T16:13:01Z

No description provided.

…ith commit operation

…yback on non-video files

Copilot

Pull request overview

This PR broadens dataset conversion + Hugging Face upload support from video-only inputs to arbitrary inputs[] file types, and updates the GUI/upload flow accordingly.

Changes:

Update Parquet + WebDataset conversion to pack all input files (not just videos) and preserve full sample/header payloads for round-tripping.
Add Parquet upload mode improvements: configurable samples_per_shard and upload all generated artifacts in a single HF commit.
Adjust GUI behavior to avoid attempting video playback for non-video inputs; expand tests and docs to cover these behaviors.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tools/osl_json_to_parquet_webdataset.py	CLI wording + warning key updated from video → input files.
tools/README.md	Documentation updated to describe input-file packing; adds an additional example workflow.
tests/test_parquet_conversion_tools.py	Adds round-trip test for mixed (non-video + video) inputs.
tests/test_hf_transfer_service.py	Adds tests ensuring Parquet upload batches into a single HF commit and forwards `samples_per_shard`.
tests/gui/test_dataset_explorer_regressions.py	Regression test ensuring selecting non-video inputs doesn’t trigger playback.
tests/gui/test_core_lifecycle.py	Verifies new upload dialog persistence/enablement for `samples_per_shard`.
annotation_tool/ui/dialogs.py	Adds `samples_per_shard` control and settings persistence; disables it for JSON upload mode.
annotation_tool/tools/osl_json_to_parquet.py	Generalizes packing from video-only to any input file; writes `header` + `sample_payload`.
annotation_tool/tools/parquet_to_osl_json.py	Improves reconstruction fallback logic and supports extracting arbitrary input files.
annotation_tool/controllers/hf_transfer_service.py	Adds `samples_per_shard` to parquet upload; switches to single `create_commit` batch upload.
annotation_tool/controllers/hf_transfer_controller.py	Threads `samples_per_shard` through to the parquet upload worker.
annotation_tool/controllers/media_controller.py	Detects non-video inputs and avoids attempting playback / suppresses invalid-media dialogs for non-video selections.
annotation_tool/main_window.py	Updates upload completion UI text to show “Input files packed”.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T16:19:08Z

+from tools.osl_json_to_parquet import convert_json_to_parquet
+from tools.parquet_to_osl_json import convert_parquet_to_json


These imports look incorrect for this repo layout: there is no top-level Python package/module named tools.osl_json_to_parquet or tools.parquet_to_osl_json (the tools/ directory only contains CLI wrappers). This test will fail at import time unless the paths are changed. Import the functions from annotation_tool.tools.osl_json_to_parquet and annotation_tool.tools.parquet_to_osl_json (or otherwise ensure tools/ is an importable package with those modules).

Suggested change

from tools.osl_json_to_parquet import convert_json_to_parquet

from tools.parquet_to_osl_json import convert_parquet_to_json

from annotation_tool.tools.osl_json_to_parquet import convert_json_to_parquet

from annotation_tool.tools.parquet_to_osl_json import convert_parquet_to_json

Copilot · 2026-04-21T16:19:08Z

 """
 Convert an OpenSportsLib-style JSON annotation file into:
  1) a Parquet table with flattened metadata
-  2) WebDataset TAR shards containing the referenced video files + sample metadata
+  2) WebDataset TAR shards containing the referenced input files + sample metadata


The module docstring still says the Parquet output contains “flattened metadata”, but the new implementation stores the full sample_payload JSON per row (and no longer promotes metadata/labels into separate columns). Please update the docstring wording to reflect the new schema so callers don’t assume they can filter via flattened columns.

Copilot · 2026-04-21T16:19:08Z

                row = _flatten_sample_for_parquet(sample, global_idx, shard_pattern=f"{shard_prefix}-%06d.tar")
                row["shard_name"] = shard_name
                row["sample_index"] = global_idx
-                row["task"] = doc.get("task")
-                row["dataset_name"] = doc.get("dataset_name")
-                row["json_version"] = doc.get("version")
-                row["json_date"] = doc.get("date")
-                row["top_level_metadata"] = _json_dumps(doc.get("metadata", {}))
-                row["top_level_labels"] = _json_dumps(doc.get("labels", {}))
+                row["header"] = _json_dumps(header)
                parquet_rows.append(row)


row["header"] stores the same top-level JSON header for every sample row. For large datasets this will significantly bloat metadata.parquet and slow reads/writes. Consider storing the header once (e.g., a separate header.json file in output_dir, a single-row Parquet file, or Parquet key-value metadata) and have the reverse converter read it from there instead of duplicating it per row.

Copilot · 2026-04-21T16:19:09Z

+                        rewritten.append(inp_copy)
+                    payload_copy["inputs"] = rewritten
+                    return _json_dumps(payload_copy)
+
+                df["sample_payload"] = df.apply(_resolved_payload_for_row, axis=1)


Rewriting sample_payload via df.apply(..., axis=1) is row-wise Python execution and will be a major bottleneck for large datasets when keep_relative_paths_in_parquet=False. Consider doing this rewrite while building parquet_rows (before constructing the DataFrame), or iterating over rows once to build a new sample_payload column (avoiding apply).

Copilot · 2026-04-21T16:19:09Z

 - metadata.parquet is used only for routing (sample_index, shard_name) and lightweight
  filtering; it does not store a full copy of each sample.


The top-level notes say metadata.parquet “does not store a full copy of each sample”, but the forward converter now persists sample_payload (full sample JSON) per row. Please update this note to match the current format (or clarify that this statement only applies to legacy exports without sample_payload).

Suggested change

- metadata.parquet is used only for routing (sample_index, shard_name) and lightweight

filtering; it does not store a full copy of each sample.

- In current exports, ``metadata.parquet`` may also include ``sample_payload`` with the

full sample JSON. Legacy exports may contain only routing fields

(for example ``sample_index`` and ``shard_name``) plus lightweight filter columns.

SilvioGiancola added 3 commits April 19, 2026 19:07

feat: Implement batched upload for parquet datasets in Hugging Face w…

b995cdb

…ith commit operation

update format metadata parquet to accomodate for other modality input

2fd06c3

feat: Enhance MediaController to validate video input and prevent pla…

9ed4431

…yback on non-video files

Copilot AI review requested due to automatic review settings April 21, 2026 16:13

SilvioGiancola merged commit 03cff58 into main Apr 21, 2026
2 checks passed

Copilot started reviewing on behalf of SilvioGiancola April 21, 2026 16:13 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support anyformat for input when uploading to HF#19

Support anyformat for input when uploading to HF#19
SilvioGiancola merged 3 commits into
mainfrom
dev-silvio

SilvioGiancola commented Apr 21, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from tools.osl_json_to_parquet import convert_json_to_parquet
		from tools.parquet_to_osl_json import convert_parquet_to_json

		- metadata.parquet is used only for routing (sample_index, shard_name) and lightweight
		filtering; it does not store a full copy of each sample.

-- metadata.parquet is used only for routing (sample_index, shard_name) and lightweight
-  filtering; it does not store a full copy of each sample.
+- In current exports, ``metadata.parquet`` may also include ``sample_payload`` with the
+  full sample JSON. Legacy exports may contain only routing fields
+  (for example ``sample_index`` and ``shard_name``) plus lightweight filter columns.

Conversation

SilvioGiancola commented Apr 21, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants