Adding script to parse subtask.json and add response field to parquet… by akshay18iitg · Pull Request #163 · TensorAuto/OpenTau

akshay18iitg · 2026-04-16T20:33:05Z

What this does

Adds script for adding subtask annotation to parquet files

How it was tested

added subtask to Icelemonde dataset using the above script

How to checkout & try? (for the reviewer)

python src/opentau/scripts/add_subtask_response.py
--config_path configs/examples/add_subtask_response.json

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

… field

shuheng-liu

Review — Add subtask response script

Overview

Adds src/opentau/scripts/add_subtask_response.py, which reads per-episode subtask JSONs, maps time-based boundaries to frame indices via dataset FPS, and writes a response column into each episode's parquet file. Also adds a docs section and an example config. Nice, focused feature.

Correctness

Module docstring references a non-existent config. add_subtask_response.py points users to configs/examples/pi05_subtask.json, but the file added in this PR is configs/examples/add_subtask_response.json. Please make them match.
Non-atomic parquet rewrite. pq.write_table(table, parquet_path) overwrites the source file in place. If the process is killed mid-write the original parquet is destroyed. Since this script mutates user data, please write to a sibling temp file and os.replace() at the end.
int(entry["time"] * fps) floors boundaries. Floating-point drift (e.g. 2.5 * 30 = 74.99999…) can shift a boundary by one frame. round(...) is usually safer for boundary mapping.
Silently overwrites an existing response column. A prior run or external pipeline would be replaced with no warning. Consider at least a logger.warning on overwrite.

Project conventions

Logging setup is non-idiomatic. logging.basicConfig(...) at module scope runs at import time and has project-wide side effects. The dominant convention (12/14 scripts in src/opentau/scripts/) is from opentau.utils.utils import init_logging called inside the main function — see calculate_value.py, train.py, inference.py. Please switch.
warnings.warn vs logger.warning. The script uses warnings.warn for operational events (missing files, length mismatches). The codebase generally uses logger.warning for that; warnings.warn is usually reserved for deprecations / API misuse. You already have a logger — prefer it.
Example config ships a personal path. configs/examples/add_subtask_response.json has "root": "/home/ashah/Documents/IceLemonade", which won't work for anyone else. Use a placeholder (e.g. /path/to/local/dataset) matching the docs.

Code quality / style

_build_response_array inner loop is a Python row-by-row assignment. Slice assignment is cleaner and faster: responses[start_frame:end_frame] = [entry["subtask"]] * (end_frame - start_frame).
subtasks: list[dict] is vague — list[dict[str, float | str]] communicates the shape.
The hardcoded response feature entry {"dtype": "string", "shape": (1,), "names": None} is worth sanity-checking against how an existing string feature (e.g. task) is represented in info.json, so readers on the consumer side stay consistent.

Tests

No tests added. _build_response_array has non-trivial logic — sorting, boundary clamping, beyond-episode skip, last-subtask extension, empty subtasks, pre-first-subtask gap. tests/scripts/test_pi_mem_data_generator.py is a good template; a small parameterized test would catch regressions cheaply.

Risk

The script mutates datasets on disk without a dry-run or backup. Combined with the non-atomic write above, a bad config or malformed subtask JSON could silently corrupt a dataset. Consider a --dry-run and/or writing into a new parquet path that's only swapped once everything succeeds.

Suggested priorities

Fix the wrong config filename in the module docstring.
Replace module-level logging.basicConfig with init_logging() inside the main function.
Remove the personal path from configs/examples/add_subtask_response.json.
Make the parquet write atomic (temp file + os.replace).
Add a focused unit test for _build_response_array.

Generated by Claude Code

Co-authored-by: Claude <noreply@anthropic.com>

Adding script to parse subtask.json and add response field to parquet…

ad66856

… field

akshay18iitg self-assigned this Apr 16, 2026

akshay18iitg requested a review from shuheng-liu April 16, 2026 20:33

Adidng documentatio on how to use add_subtask_response script

a2c3083

shuheng-liu reviewed Apr 18, 2026

View reviewed changes

shuheng-liu mentioned this pull request Apr 22, 2026

Address review feedback on add_subtask_response script #173

Merged

3 tasks

Address review feedback on add_subtask_response script (#173)

80403d3

Co-authored-by: Claude <noreply@anthropic.com>

shuheng-liu self-requested a review April 22, 2026 16:24

shuheng-liu approved these changes Apr 22, 2026

View reviewed changes

shuheng-liu merged commit cdfc9a8 into main Apr 22, 2026
5 checks passed

shuheng-liu deleted the feat/subtask_addition_script branch April 22, 2026 16:28

shuheng-liu mentioned this pull request Apr 25, 2026

Merge claude/verify-precision-issue-SjtJf into claude/wonderful-fermat-0831fd #186

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding script to parse subtask.json and add response field to parquet…#163

Adding script to parse subtask.json and add response field to parquet…#163
shuheng-liu merged 3 commits into
mainfrom
feat/subtask_addition_script

akshay18iitg commented Apr 16, 2026 •

edited

Loading

Uh oh!

shuheng-liu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akshay18iitg commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

shuheng-liu left a comment

Choose a reason for hiding this comment

Review — Add subtask response script

Overview

Correctness

Project conventions

Code quality / style

Tests

Risk

Suggested priorities

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akshay18iitg commented Apr 16, 2026 •

edited

Loading