Skip to content

feat: add script to segment lerobot dataset#127

Merged
shuheng-liu merged 8 commits into
mainfrom
feat/segment-lerobot-dataset-v21
Mar 2, 2026
Merged

feat: add script to segment lerobot dataset#127
shuheng-liu merged 8 commits into
mainfrom
feat/segment-lerobot-dataset-v21

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented Feb 27, 2026

What this does

Add functionality to segment one or more lerobot episodes into multiple episodes in a new dataset.

How it was tested

  1. Ran CPU tests and
python src/opentau/scripts/segment_lerobot_dataset.py ~/.cache/huggingface/lerobot/physical-intelligence/libero /tmp/libero_segment /tmp/segments.json
python src/opentau/scripts/segment_lerobot_dataset.py ~/.cache/huggingface/lerobot/lerobot/droid_100/ /tmp/droid_segment /tmp/segments.json

where /tmp/segments.json reads

{"0": [[5, 15], [10, 20]], "1": [[0, 30]]}

How to checkout & try? (for the reviewer)

Run CPU tests and

python src/opentau/scripts/segment_lerobot_dataset.py ~/.cache/huggingface/lerobot/physical-intelligence/libero /tmp/libero_segment --episode-id 0 --segment 0:10 --segment 30:40 --segment 5:15
python src/opentau/scripts/segment_lerobot_dataset.py ~/.cache/huggingface/lerobot/lerobot/droid_100/ /tmp/droid_segment --episode-id 0 --segment 0:10 --segment 30:40 --segment 5:15

where /tmp/segments.json reads

{"0": [[5, 15], [10, 20]], "1": [[0, 30]]}

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Copilot AI review requested due to automatic review settings February 27, 2026 00:47
@shuheng-liu shuheng-liu self-assigned this Feb 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new script to segment a LeRobot dataset episode into multiple smaller episodes in a new output dataset. The implementation supports both v2.0 and v2.1 input formats and always outputs v2.1 format.

Changes:

  • Added src/opentau/scripts/segment_lerobot_dataset.py script to segment episodes by frame ranges
  • Added comprehensive test suite in tests/datasets/test_segment_lerobot_dataset.py with three test cases covering v2.1 input, v2.0->v2.1 conversion, and edge cases

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/opentau/scripts/segment_lerobot_dataset.py Main segmentation script with CLI argument parsing, dataset validation, parquet data slicing, metadata creation, and video/task handling
tests/datasets/test_segment_lerobot_dataset.py Test suite with three comprehensive tests validating segmentation logic, version conversion, and edge cases like overlapping/non-consecutive segments

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/opentau/scripts/segment_lerobot_dataset.py Outdated
Comment thread src/opentau/scripts/segment_lerobot_dataset.py Outdated
Comment thread src/opentau/scripts/segment_lerobot_dataset.py Outdated
Comment thread tests/datasets/test_segment_lerobot_dataset.py
Comment thread src/opentau/scripts/segment_lerobot_dataset.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/opentau/scripts/segment_lerobot_dataset.py Outdated
Comment thread src/opentau/scripts/segment_lerobot_dataset.py
Copy link
Copy Markdown
Member

@WilliamYue37 WilliamYue37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, but I think it could be better to have the user input a file with the segment start:end. For long recordings, adding an extra argument on the command line can get tedious/overwhelming.

@WilliamYue37 WilliamYue37 added the feature New feature or request label Feb 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/opentau/scripts/segment_lerobot_dataset.py Outdated
Comment thread src/opentau/scripts/segment_lerobot_dataset.py
Comment thread src/opentau/scripts/segment_lerobot_dataset.py
Copy link
Copy Markdown
Member

@WilliamYue37 WilliamYue37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a segments.json example file in the examples folder.

@WilliamYue37 WilliamYue37 self-requested a review March 2, 2026 20:49
@shuheng-liu shuheng-liu merged commit a7881b5 into main Mar 2, 2026
5 checks passed
@shuheng-liu shuheng-liu deleted the feat/segment-lerobot-dataset-v21 branch March 2, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants