fix: steps_per_epoch should count forward passes, not optimizer steps#38
Conversation
steps_per_epoch was computed by dividing num_samples by (batch_size * accumulate_steps * world_size), which yields the number of optimizer updates. However train_epoch() runs one generate_batch() per loop iteration, so the loop count must be num_samples divided by (batch_size * world_size) only. With the default accumulate_steps=2, every epoch was silently processing only 50% of the intended data. Signed-off-by: huaweil <huaweil@nvidia.com>
|
@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates? |
it seems it was introduced in the original repo on 10/30/2025 with commit aad4b94 |
Signed-off-by: Ivan Basov <ibasov@nvidia.com>
FYI @jolle-ag |
|
/bot ok to test |
|
/copy-pr-bot ok to test |
|
/ok to test |
@ivanbasov, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 97bda50 |
|
Thanks, @huaweil-nv ! Great catch! |
…#38) * fix: steps_per_epoch should count forward passes, not optimizer steps steps_per_epoch was computed by dividing num_samples by (batch_size * accumulate_steps * world_size), which yields the number of optimizer updates. However train_epoch() runs one generate_batch() per loop iteration, so the loop count must be num_samples divided by (batch_size * world_size) only. With the default accumulate_steps=2, every epoch was silently processing only 50% of the intended data. Signed-off-by: huaweil <huaweil@nvidia.com> * ci: trigger CI validation Signed-off-by: Ivan Basov <ibasov@nvidia.com> --------- Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Ivan Basov <ibasov@nvidia.com>
…#38) * fix: steps_per_epoch should count forward passes, not optimizer steps steps_per_epoch was computed by dividing num_samples by (batch_size * accumulate_steps * world_size), which yields the number of optimizer updates. However train_epoch() runs one generate_batch() per loop iteration, so the loop count must be num_samples divided by (batch_size * world_size) only. With the default accumulate_steps=2, every epoch was silently processing only 50% of the intended data. Signed-off-by: huaweil <huaweil@nvidia.com> * ci: trigger CI validation Signed-off-by: Ivan Basov <ibasov@nvidia.com> --------- Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Ivan Basov <ibasov@nvidia.com>
Summary
steps_per_epochwas computed by dividingnum_samplesbybatch_size * accumulate_steps * world_size, which yields optimizer update count, not forward pass counttrain_epoch()runs onegenerate_batch()per loop iteration, so the loop count should benum_samples // (batch_size * world_size)accumulate_steps=2, every training epoch silently processed only 50% of the intended dataReproduction
Observe the log output:
Expected
steps_per_epoch = 8,388,608 // (512 × 1) = 16384, but got8,388,608 // (512 × 2 × 1) = 8192. Each step callsgenerate_batch()once, so only8192 × 512 = 4,194,304samples (50%) are seen per epoch.After the fix:
Root cause
train.pyline 1297-1298:The gradient accumulation logic at line 482 (
if (step+1) % accumulate_steps == 0) only controls whenoptimizer.step()runs — it does not generate additional batches. Soaccumulate_stepsshould not be in thesteps_per_epochdivisor.Fix
One-line change — remove
accumulate_stepsfrom the divisor:Test plan
Starting 16384 batcheswith default config (was 8192)test_training_utils.py