Skip to content

Replace rampup batch size scheduler with custom step batch size schedules#4411

Merged
ko3n1g merged 2 commits intoNVIDIA:mainfrom
deepakn94:deepakn94/step_batch_schedule_v2
Apr 21, 2026
Merged

Replace rampup batch size scheduler with custom step batch size schedules#4411
ko3n1g merged 2 commits intoNVIDIA:mainfrom
deepakn94:deepakn94/step_batch_schedule_v2

Conversation

@deepakn94
Copy link
Copy Markdown
Contributor

@deepakn94 deepakn94 commented Apr 21, 2026

What does this PR do?

Re-lands #3779 (reverted in #4404) with backwards compatibility for the deprecated rampup_batch_size parameter.

Step batch size schedule

The new --step-batch-size-schedule argument accepts a string of THRESHOLD:BATCH_SIZE pairs that define arbitrary step-wise batch size changes during training. Thresholds support K/M/B/T suffixes and are interpreted as tokens when --seq-length is provided, otherwise as samples.

Example:

--step-batch-size-schedule "0:768 250B:1536 500B:3072 750B:6144"

This is implemented via a new StepBatchsizeNumMicroBatchesCalculator class.

Rampup batch size removal

The old RampupBatchsizeNumMicroBatchesCalculator and --rampup-batch-size argument are removed. Existing example configs are converted to equivalent step batch size schedules.

Backwards compatibility

The rampup_batch_size parameter is re-added to init_num_microbatches_calculator and reconfigure_num_microbatches_calculator as a deprecated optional kwarg (default None). If passed, a deprecation warning is logged on rank 0. The parameter is otherwise ignored. All internal callers are updated to use keyword arguments.

Files changed

  • megatron/core/num_microbatches_calculator.py: Added StepBatchsizeNumMicroBatchesCalculator, removed RampupBatchsizeNumMicroBatchesCalculator, added deprecated rampup_batch_size kwarg for backwards compatibility.
  • megatron/training/config/training_config.py: Replaced rampup_batch_size field with step_batch_size_schedule.
  • megatron/training/arguments.py: Removed rampup-related assertions and help text.
  • megatron/training/training.py: Updated update_train_iters to handle step schedule for sample-based training.
  • megatron/training/global_vars.py: Updated init_num_microbatches_calculator call to use keyword arguments.
  • examples/, tests/functional_tests/: Converted rampup configs to step batch size schedules.
  • tests/unit_tests/test_num_microbatches_calculator.py: Replaced rampup tests with step schedule tests, updated all calls to use keyword arguments.

…ules (NVIDIA#3779)

Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@deepakn94 deepakn94 requested review from a team as code owners April 21, 2026 20:31
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft April 21, 2026 20:31
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@deepakn94 deepakn94 changed the title Deepakn94/step batch schedule v2 Replace rampup batch size scheduler with custom step batch size schedules Apr 21, 2026
@deepakn94 deepakn94 marked this pull request as ready for review April 21, 2026 20:38
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 21, 2026 20:39
Re-add rampup_batch_size as a deprecated optional parameter to
init_num_microbatches_calculator and reconfigure_num_microbatches_calculator.
The parameter is ignored but logs a deprecation warning on rank 0.
Switch all internal callers to keyword arguments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@deepakn94 deepakn94 force-pushed the deepakn94/step_batch_schedule_v2 branch from 43b2f0a to f916fe8 Compare April 21, 2026 20:50
@ko3n1g ko3n1g merged commit 532ad92 into NVIDIA:main Apr 21, 2026
29 of 31 checks passed
@deepakn94 deepakn94 deleted the deepakn94/step_batch_schedule_v2 branch April 21, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants