fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping#82
Merged
anxiangsir merged 3 commits intomainfrom Feb 10, 2026
Merged
Conversation
b604da3 to
19eea97
Compare
19eea97 to
52a1915
Compare
…accumulation lr_scheduler total_iters was set to micro-step count (total_steps), but after moving lr_scheduler.step() to only fire on optimizer steps, the scheduler would only traverse 1/backward_passes_per_step of its budget. Divide total_iters by backward_passes_per_step so the full LR curve (warmup + polynomial decay) completes over the actual optimizer steps. No-op when backward_passes_per_step=1 (Stage-1).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two correctness bugs in the training loop at
training/train.py(lines 655-669) that affect runs withbackward_passes_per_step > 1(i.e. Stage-2 video training where it is set to 4).Bug 1:
lr_scheduler.step()called on every micro-step instead of every optimizer stepBefore (broken)
lr_scheduler.step()is called unconditionally on every loop iteration (every micro-step), butopt.step()is only called on actual optimizer-step iterations. Whenbackward_passes_per_step=4, the scheduler advances 4x faster than the optimizer.Concrete impact
The
PolynomialLRWarmupscheduler (seetraining/lr_scheduler.py) is initialized with:where
total_steps = num_sampled_data / batch_size / world_size. Thistotal_stepscounts micro-steps (loop iterations), not optimizer steps.With
backward_passes_per_step=4:total_stepsmicro-step iterations.total_steps / 4times.total_stepstimes - which matches the total_iters it was given, so the schedule shape is correct in terms of LR trajectory over the full training run.However, the scheduler should still only step when the optimizer steps, because:
last_epochcounter tracks optimizer updates, not micro-steps. Calling it on non-update steps means the LR changes between micro-batches within the same accumulation window - the 4 micro-batches that contribute to one optimizer step see 4 different learning rates. This is semantically wrong.total_stepsis meant to count optimizer steps (as the variable name suggests), the 4x over-stepping causes the scheduler to exhaust its budget in 1/4 of the training run, after which LR stays at the terminal value.Fix
Move
lr_scheduler.step()inside theelsebranch, so it only fires whenopt.step()fires.Note: If the intent is to have the scheduler span the full micro-step count,
total_stepspassed to the scheduler should be divided bybackward_passes_per_step. But the simpler and more standard fix is to align scheduler stepping with optimizer stepping.Bug 2: Off-by-one in accumulation step detection causes first optimizer step to use only 1 micro-batch
Before (broken)
When
global_step=0(first iteration) andbackward_passes_per_step=4:0 % 4 == 0→is_accumulation_step = FalseThis means the first optimizer update uses gradients from only 1 micro-batch instead of 4. All subsequent accumulation windows are also shifted by 1.
Concrete impact with
backward_passes_per_step=4step % 4 != 0False(0%4=0)True(1%4=1)True(2%4=2)True(3%4=3)False(4%4=0)Fix
Now the optimizer steps at
global_step = 3, 7, 11, ...— i.e. after every 4 micro-batches.Changes
Risk Assessment
global_stepandlr_schedulerstate. Resuming from a checkpoint saved by the old code will shift the accumulation window by 1 step (negligible impact) and the scheduler will be correctly aligned going forward.backward_passes_per_step=1, where both bugs are no-ops (1 % 1 == 0always, and scheduler steps every iteration = every optimizer step).