Skip to content

Support anyformat for input when uploading to HF#19

Merged
SilvioGiancola merged 3 commits into
mainfrom
dev-silvio
Apr 21, 2026
Merged

Support anyformat for input when uploading to HF#19
SilvioGiancola merged 3 commits into
mainfrom
dev-silvio

Conversation

@SilvioGiancola
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings April 21, 2026 16:13
@SilvioGiancola SilvioGiancola merged commit 03cff58 into main Apr 21, 2026
2 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR broadens dataset conversion + Hugging Face upload support from video-only inputs to arbitrary inputs[] file types, and updates the GUI/upload flow accordingly.

Changes:

  • Update Parquet + WebDataset conversion to pack all input files (not just videos) and preserve full sample/header payloads for round-tripping.
  • Add Parquet upload mode improvements: configurable samples_per_shard and upload all generated artifacts in a single HF commit.
  • Adjust GUI behavior to avoid attempting video playback for non-video inputs; expand tests and docs to cover these behaviors.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tools/osl_json_to_parquet_webdataset.py CLI wording + warning key updated from video → input files.
tools/README.md Documentation updated to describe input-file packing; adds an additional example workflow.
tests/test_parquet_conversion_tools.py Adds round-trip test for mixed (non-video + video) inputs.
tests/test_hf_transfer_service.py Adds tests ensuring Parquet upload batches into a single HF commit and forwards samples_per_shard.
tests/gui/test_dataset_explorer_regressions.py Regression test ensuring selecting non-video inputs doesn’t trigger playback.
tests/gui/test_core_lifecycle.py Verifies new upload dialog persistence/enablement for samples_per_shard.
annotation_tool/ui/dialogs.py Adds samples_per_shard control and settings persistence; disables it for JSON upload mode.
annotation_tool/tools/osl_json_to_parquet.py Generalizes packing from video-only to any input file; writes header + sample_payload.
annotation_tool/tools/parquet_to_osl_json.py Improves reconstruction fallback logic and supports extracting arbitrary input files.
annotation_tool/controllers/hf_transfer_service.py Adds samples_per_shard to parquet upload; switches to single create_commit batch upload.
annotation_tool/controllers/hf_transfer_controller.py Threads samples_per_shard through to the parquet upload worker.
annotation_tool/controllers/media_controller.py Detects non-video inputs and avoids attempting playback / suppresses invalid-media dialogs for non-video selections.
annotation_tool/main_window.py Updates upload completion UI text to show “Input files packed”.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +5 to +6
from tools.osl_json_to_parquet import convert_json_to_parquet
from tools.parquet_to_osl_json import convert_parquet_to_json
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports look incorrect for this repo layout: there is no top-level Python package/module named tools.osl_json_to_parquet or tools.parquet_to_osl_json (the tools/ directory only contains CLI wrappers). This test will fail at import time unless the paths are changed. Import the functions from annotation_tool.tools.osl_json_to_parquet and annotation_tool.tools.parquet_to_osl_json (or otherwise ensure tools/ is an importable package with those modules).

Suggested change
from tools.osl_json_to_parquet import convert_json_to_parquet
from tools.parquet_to_osl_json import convert_parquet_to_json
from annotation_tool.tools.osl_json_to_parquet import convert_json_to_parquet
from annotation_tool.tools.parquet_to_osl_json import convert_parquet_to_json

Copilot uses AI. Check for mistakes.
Comment on lines 1 to +4
"""
Convert an OpenSportsLib-style JSON annotation file into:
1) a Parquet table with flattened metadata
2) WebDataset TAR shards containing the referenced video files + sample metadata
2) WebDataset TAR shards containing the referenced input files + sample metadata
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring still says the Parquet output contains “flattened metadata”, but the new implementation stores the full sample_payload JSON per row (and no longer promotes metadata/labels into separate columns). Please update the docstring wording to reflect the new schema so callers don’t assume they can filter via flattened columns.

Copilot uses AI. Check for mistakes.
Comment on lines 194 to 198
row = _flatten_sample_for_parquet(sample, global_idx, shard_pattern=f"{shard_prefix}-%06d.tar")
row["shard_name"] = shard_name
row["sample_index"] = global_idx
row["task"] = doc.get("task")
row["dataset_name"] = doc.get("dataset_name")
row["json_version"] = doc.get("version")
row["json_date"] = doc.get("date")
row["top_level_metadata"] = _json_dumps(doc.get("metadata", {}))
row["top_level_labels"] = _json_dumps(doc.get("labels", {}))
row["header"] = _json_dumps(header)
parquet_rows.append(row)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row["header"] stores the same top-level JSON header for every sample row. For large datasets this will significantly bloat metadata.parquet and slow reads/writes. Consider storing the header once (e.g., a separate header.json file in output_dir, a single-row Parquet file, or Parquet key-value metadata) and have the reverse converter read it from there instead of duplicating it per row.

Copilot uses AI. Check for mistakes.
Comment on lines +273 to +277
rewritten.append(inp_copy)
payload_copy["inputs"] = rewritten
return _json_dumps(payload_copy)

df["sample_payload"] = df.apply(_resolved_payload_for_row, axis=1)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewriting sample_payload via df.apply(..., axis=1) is row-wise Python execution and will be a major bottleneck for large datasets when keep_relative_paths_in_parquet=False. Consider doing this rewrite while building parquet_rows (before constructing the DataFrame), or iterating over rows once to build a new sample_payload column (avoiding apply).

Copilot uses AI. Check for mistakes.
Comment on lines 12 to 13
- metadata.parquet is used only for routing (sample_index, shard_name) and lightweight
filtering; it does not store a full copy of each sample.
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level notes say metadata.parquet “does not store a full copy of each sample”, but the forward converter now persists sample_payload (full sample JSON) per row. Please update this note to match the current format (or clarify that this statement only applies to legacy exports without sample_payload).

Suggested change
- metadata.parquet is used only for routing (sample_index, shard_name) and lightweight
filtering; it does not store a full copy of each sample.
- In current exports, ``metadata.parquet`` may also include ``sample_payload`` with the
full sample JSON. Legacy exports may contain only routing fields
(for example ``sample_index`` and ``shard_name``) plus lightweight filter columns.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants