Adding script to parse subtask.json and add response field to parquet…#163
Conversation
shuheng-liu
left a comment
There was a problem hiding this comment.
Review — Add subtask response script
Overview
Adds src/opentau/scripts/add_subtask_response.py, which reads per-episode subtask JSONs, maps time-based boundaries to frame indices via dataset FPS, and writes a response column into each episode's parquet file. Also adds a docs section and an example config. Nice, focused feature.
Correctness
- Module docstring references a non-existent config.
add_subtask_response.pypoints users toconfigs/examples/pi05_subtask.json, but the file added in this PR isconfigs/examples/add_subtask_response.json. Please make them match. - Non-atomic parquet rewrite.
pq.write_table(table, parquet_path)overwrites the source file in place. If the process is killed mid-write the original parquet is destroyed. Since this script mutates user data, please write to a sibling temp file andos.replace()at the end. int(entry["time"] * fps)floors boundaries. Floating-point drift (e.g.2.5 * 30 = 74.99999…) can shift a boundary by one frame.round(...)is usually safer for boundary mapping.- Silently overwrites an existing
responsecolumn. A prior run or external pipeline would be replaced with no warning. Consider at least alogger.warningon overwrite.
Project conventions
- Logging setup is non-idiomatic.
logging.basicConfig(...)at module scope runs at import time and has project-wide side effects. The dominant convention (12/14 scripts insrc/opentau/scripts/) isfrom opentau.utils.utils import init_loggingcalled inside the main function — seecalculate_value.py,train.py,inference.py. Please switch. warnings.warnvslogger.warning. The script useswarnings.warnfor operational events (missing files, length mismatches). The codebase generally useslogger.warningfor that;warnings.warnis usually reserved for deprecations / API misuse. You already have alogger— prefer it.- Example config ships a personal path.
configs/examples/add_subtask_response.jsonhas"root": "/home/ashah/Documents/IceLemonade", which won't work for anyone else. Use a placeholder (e.g./path/to/local/dataset) matching the docs.
Code quality / style
_build_response_arrayinner loop is a Python row-by-row assignment. Slice assignment is cleaner and faster:responses[start_frame:end_frame] = [entry["subtask"]] * (end_frame - start_frame).subtasks: list[dict]is vague —list[dict[str, float | str]]communicates the shape.- The hardcoded response feature entry
{"dtype": "string", "shape": (1,), "names": None}is worth sanity-checking against how an existing string feature (e.g.task) is represented ininfo.json, so readers on the consumer side stay consistent.
Tests
No tests added. _build_response_array has non-trivial logic — sorting, boundary clamping, beyond-episode skip, last-subtask extension, empty subtasks, pre-first-subtask gap. tests/scripts/test_pi_mem_data_generator.py is a good template; a small parameterized test would catch regressions cheaply.
Risk
The script mutates datasets on disk without a dry-run or backup. Combined with the non-atomic write above, a bad config or malformed subtask JSON could silently corrupt a dataset. Consider a --dry-run and/or writing into a new parquet path that's only swapped once everything succeeds.
Suggested priorities
- Fix the wrong config filename in the module docstring.
- Replace module-level
logging.basicConfigwithinit_logging()inside the main function. - Remove the personal path from
configs/examples/add_subtask_response.json. - Make the parquet write atomic (temp file +
os.replace). - Add a focused unit test for
_build_response_array.
Generated by Claude Code
Co-authored-by: Claude <noreply@anthropic.com>
What this does
Adds script for adding subtask annotation to parquet files
How it was tested
added subtask to Icelemonde dataset using the above script
How to checkout & try? (for the reviewer)
python src/opentau/scripts/add_subtask_response.py
--config_path configs/examples/add_subtask_response.json
Checklist
Note: Before submitting this PR, please read the contributor guideline.