fix: Respect PipelineTrainer max_batch_size by vivekkalyan · Pull Request #622 · OpenPipe/ART

vivekkalyan · 2026-03-18T22:53:33Z

Goal

Make PipelineTrainer.max_batch_size actually do what it says.

Today ART exposes max_batch_size, but _collect_batch() keeps draining the queue after it has already reached the configured cap. That makes the knob misleading and causes the trainer to train on oversized batches.

Direction

Keep this PR very small:

prove the current behavior is wrong with a focused regression test
fix only _collect_batch()
do not change stale-drop behavior, zero-variance handling, or any broader scheduling policy

What Changed

added a regression test that queues three valid groups with max_batch_size=2 and proves collection should happen across two batches, not one oversized batch
changed _collect_batch() so the opportunistic get_nowait() drain loop stops once len(batch) == self.max_batch_size
verified that the next _collect_batch() call returns the remaining group and then sees the sentinel normally

Why This Shape

The bug is not in the initial blocking wait for min_batch_size. It is in the follow-up drain loop, which ignored max_batch_size entirely.

So the fix stays exactly there. No queue rewrite, no new scheduler knobs, no policy changes.

Testing

Sky CPU unit run

ssh art-pipeline-batching-tests 'cd ~/sky_workdir && ~/.local/bin/uv run pytest tests/unit/test_pipeline_trainer_batching.py tests/unit/test_pipeline_trainer_metrics.py -q'
result: 2 passed, 8 warnings in 6.33s

Notes

This PR is stacked on top of PR feat: Support PipelineTrainer with dedicated LocalBackend #621.
Follow-up scheduling work can build on this once batch collection semantics are correct.

vivekkalyan requested review from bradhilton and corbt March 18, 2026 22:55

vivekkalyan mentioned this pull request Mar 20, 2026

feat: Add KL support to PipelineTrainer backends #624

Open

vivekkalyan requested a review from angkywilliam March 20, 2026 22:52

vivekkalyan changed the base branch from feat/pipeline-localbackend to main March 21, 2026 00:59

vivekkalyan changed the base branch from main to feat/pipeline-localbackend March 21, 2026 01:05

vivekkalyan added 2 commits March 20, 2026 18:07

test: Add max batch size regression coverage

c656608

fix: Respect max batch size in PipelineTrainer

d7a9a06

vivekkalyan force-pushed the fix/pipeline-max-batch branch from e748071 to d7a9a06 Compare March 21, 2026 01:07

vivekkalyan changed the base branch from feat/pipeline-localbackend to main March 21, 2026 01:09

angkywilliam approved these changes Mar 21, 2026

View reviewed changes

vivekkalyan merged commit 621e82b into main Mar 21, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Respect PipelineTrainer max_batch_size#622

fix: Respect PipelineTrainer max_batch_size#622
vivekkalyan merged 2 commits intomainfrom
fix/pipeline-max-batch

vivekkalyan commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vivekkalyan commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Direction

What Changed

Why This Shape

Testing

Sky CPU unit run

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vivekkalyan commented Mar 18, 2026 •

edited

Loading