Skip to content

Adding script to parse subtask.json and add response field to parquet…#163

Merged
shuheng-liu merged 3 commits into
mainfrom
feat/subtask_addition_script
Apr 22, 2026
Merged

Adding script to parse subtask.json and add response field to parquet…#163
shuheng-liu merged 3 commits into
mainfrom
feat/subtask_addition_script

Conversation

@akshay18iitg
Copy link
Copy Markdown
Collaborator

@akshay18iitg akshay18iitg commented Apr 16, 2026

What this does

Adds script for adding subtask annotation to parquet files

How it was tested

added subtask to Icelemonde dataset using the above script

How to checkout & try? (for the reviewer)

python src/opentau/scripts/add_subtask_response.py
--config_path configs/examples/add_subtask_response.json

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

@akshay18iitg akshay18iitg self-assigned this Apr 16, 2026
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Add subtask response script

Overview

Adds src/opentau/scripts/add_subtask_response.py, which reads per-episode subtask JSONs, maps time-based boundaries to frame indices via dataset FPS, and writes a response column into each episode's parquet file. Also adds a docs section and an example config. Nice, focused feature.

Correctness

  • Module docstring references a non-existent config. add_subtask_response.py points users to configs/examples/pi05_subtask.json, but the file added in this PR is configs/examples/add_subtask_response.json. Please make them match.
  • Non-atomic parquet rewrite. pq.write_table(table, parquet_path) overwrites the source file in place. If the process is killed mid-write the original parquet is destroyed. Since this script mutates user data, please write to a sibling temp file and os.replace() at the end.
  • int(entry["time"] * fps) floors boundaries. Floating-point drift (e.g. 2.5 * 30 = 74.99999…) can shift a boundary by one frame. round(...) is usually safer for boundary mapping.
  • Silently overwrites an existing response column. A prior run or external pipeline would be replaced with no warning. Consider at least a logger.warning on overwrite.

Project conventions

  • Logging setup is non-idiomatic. logging.basicConfig(...) at module scope runs at import time and has project-wide side effects. The dominant convention (12/14 scripts in src/opentau/scripts/) is from opentau.utils.utils import init_logging called inside the main function — see calculate_value.py, train.py, inference.py. Please switch.
  • warnings.warn vs logger.warning. The script uses warnings.warn for operational events (missing files, length mismatches). The codebase generally uses logger.warning for that; warnings.warn is usually reserved for deprecations / API misuse. You already have a logger — prefer it.
  • Example config ships a personal path. configs/examples/add_subtask_response.json has "root": "/home/ashah/Documents/IceLemonade", which won't work for anyone else. Use a placeholder (e.g. /path/to/local/dataset) matching the docs.

Code quality / style

  • _build_response_array inner loop is a Python row-by-row assignment. Slice assignment is cleaner and faster: responses[start_frame:end_frame] = [entry["subtask"]] * (end_frame - start_frame).
  • subtasks: list[dict] is vague — list[dict[str, float | str]] communicates the shape.
  • The hardcoded response feature entry {"dtype": "string", "shape": (1,), "names": None} is worth sanity-checking against how an existing string feature (e.g. task) is represented in info.json, so readers on the consumer side stay consistent.

Tests

No tests added. _build_response_array has non-trivial logic — sorting, boundary clamping, beyond-episode skip, last-subtask extension, empty subtasks, pre-first-subtask gap. tests/scripts/test_pi_mem_data_generator.py is a good template; a small parameterized test would catch regressions cheaply.

Risk

The script mutates datasets on disk without a dry-run or backup. Combined with the non-atomic write above, a bad config or malformed subtask JSON could silently corrupt a dataset. Consider a --dry-run and/or writing into a new parquet path that's only swapped once everything succeeds.

Suggested priorities

  1. Fix the wrong config filename in the module docstring.
  2. Replace module-level logging.basicConfig with init_logging() inside the main function.
  3. Remove the personal path from configs/examples/add_subtask_response.json.
  4. Make the parquet write atomic (temp file + os.replace).
  5. Add a focused unit test for _build_response_array.

Generated by Claude Code

Co-authored-by: Claude <noreply@anthropic.com>
@shuheng-liu shuheng-liu self-requested a review April 22, 2026 16:24
@shuheng-liu shuheng-liu merged commit cdfc9a8 into main Apr 22, 2026
5 checks passed
@shuheng-liu shuheng-liu deleted the feat/subtask_addition_script branch April 22, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants