Skip to content

fix: Respect PipelineTrainer max_batch_size#622

Merged
vivekkalyan merged 2 commits intomainfrom
fix/pipeline-max-batch
Mar 21, 2026
Merged

fix: Respect PipelineTrainer max_batch_size#622
vivekkalyan merged 2 commits intomainfrom
fix/pipeline-max-batch

Conversation

@vivekkalyan
Copy link
Collaborator

@vivekkalyan vivekkalyan commented Mar 18, 2026

Goal

Make PipelineTrainer.max_batch_size actually do what it says.

Today ART exposes max_batch_size, but _collect_batch() keeps draining the queue after it has already reached the configured cap. That makes the knob misleading and causes the trainer to train on oversized batches.

Direction

Keep this PR very small:

  • prove the current behavior is wrong with a focused regression test
  • fix only _collect_batch()
  • do not change stale-drop behavior, zero-variance handling, or any broader scheduling policy

What Changed

  • added a regression test that queues three valid groups with max_batch_size=2 and proves collection should happen across two batches, not one oversized batch
  • changed _collect_batch() so the opportunistic get_nowait() drain loop stops once len(batch) == self.max_batch_size
  • verified that the next _collect_batch() call returns the remaining group and then sees the sentinel normally

Why This Shape

The bug is not in the initial blocking wait for min_batch_size. It is in the follow-up drain loop, which ignored max_batch_size entirely.

So the fix stays exactly there. No queue rewrite, no new scheduler knobs, no policy changes.

Testing

Sky CPU unit run

  • ssh art-pipeline-batching-tests 'cd ~/sky_workdir && ~/.local/bin/uv run pytest tests/unit/test_pipeline_trainer_batching.py tests/unit/test_pipeline_trainer_metrics.py -q'
  • result: 2 passed, 8 warnings in 6.33s

Notes

@vivekkalyan vivekkalyan requested review from bradhilton and corbt March 18, 2026 22:55
@vivekkalyan vivekkalyan changed the base branch from feat/pipeline-localbackend to main March 21, 2026 00:59
@vivekkalyan vivekkalyan changed the base branch from main to feat/pipeline-localbackend March 21, 2026 01:05
@vivekkalyan vivekkalyan force-pushed the fix/pipeline-max-batch branch from e748071 to d7a9a06 Compare March 21, 2026 01:07
@vivekkalyan vivekkalyan changed the base branch from feat/pipeline-localbackend to main March 21, 2026 01:09
@vivekkalyan vivekkalyan merged commit 621e82b into main Mar 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants