Skip to content

fix: steps_per_epoch should count forward passes, not optimizer steps#38

Merged
ivanbasov merged 2 commits into
mainfrom
fix/steps-per-epoch-undercount
Apr 2, 2026
Merged

fix: steps_per_epoch should count forward passes, not optimizer steps#38
ivanbasov merged 2 commits into
mainfrom
fix/steps-per-epoch-undercount

Conversation

@huaweil-nv
Copy link
Copy Markdown
Collaborator

Summary

  • steps_per_epoch was computed by dividing num_samples by batch_size * accumulate_steps * world_size, which yields optimizer update count, not forward pass count
  • train_epoch() runs one generate_batch() per loop iteration, so the loop count should be num_samples // (batch_size * world_size)
  • With the default accumulate_steps=2, every training epoch silently processed only 50% of the intended data

Reproduction

PYTHONPATH=code .venv/bin/python -u code/workflows/run.py \
  --config-name=config_local_test \
  workflow.task=train

Observe the log output:

Auto-scaled train.num_samples=8,388,608
[Epoch 0] Effective batch size: 1024 (512 × 2 × 1)
train_epoch: Starting 8192 batches...

Expected steps_per_epoch = 8,388,608 // (512 × 1) = 16384, but got 8,388,608 // (512 × 2 × 1) = 8192. Each step calls generate_batch() once, so only 8192 × 512 = 4,194,304 samples (50%) are seen per epoch.

After the fix:

train_epoch: Starting 16384 batches...

Root cause

train.py line 1297-1298:

effective_batch_size = per_device_batch_size * accumulate_steps * dist.world_size
steps_per_epoch = effective_num_samples // effective_batch_size

The gradient accumulation logic at line 482 (if (step+1) % accumulate_steps == 0) only controls when optimizer.step() runs — it does not generate additional batches. So accumulate_steps should not be in the steps_per_epoch divisor.

Fix

One-line change — remove accumulate_steps from the divisor:

steps_per_epoch = effective_num_samples // (per_device_batch_size * dist.world_size)

Test plan

  • Verified fix: Starting 16384 batches with default config (was 8192)
  • All existing tests pass (2 pre-existing failures unrelated to this change)
  • Added regression tests in test_training_utils.py

steps_per_epoch was computed by dividing num_samples by
(batch_size * accumulate_steps * world_size), which yields the number
of optimizer updates.  However train_epoch() runs one generate_batch()
per loop iteration, so the loop count must be num_samples divided by
(batch_size * world_size) only.  With the default accumulate_steps=2,
every epoch was silently processing only 50% of the intended data.

Signed-off-by: huaweil <huaweil@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@huaweil-nv huaweil-nv requested a review from bmhowe23 April 2, 2026 07:50
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates?

@bmhowe23 bmhowe23 requested review from ivanbasov and kvmto April 2, 2026 13:57
@ivanbasov
Copy link
Copy Markdown
Member

@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates?

it seems it was introduced in the original repo on 10/30/2025 with commit aad4b94

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 2, 2026

@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates?

it seems it was introduced in the original repo on 10/30/2025 with commit aad4b94

FYI @jolle-ag

@ivanbasov
Copy link
Copy Markdown
Member

/bot ok to test

@ivanbasov
Copy link
Copy Markdown
Member

/copy-pr-bot ok to test

@ivanbasov
Copy link
Copy Markdown
Member

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 2, 2026

/ok to test

@ivanbasov, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@ivanbasov
Copy link
Copy Markdown
Member

/ok to test 97bda50

@ivanbasov
Copy link
Copy Markdown
Member

Thanks, @huaweil-nv ! Great catch!

@ivanbasov ivanbasov merged commit 31c60dd into main Apr 2, 2026
13 checks passed
@ivanbasov ivanbasov deleted the fix/steps-per-epoch-undercount branch April 2, 2026 16:39
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…#38)

* fix: steps_per_epoch should count forward passes, not optimizer steps

steps_per_epoch was computed by dividing num_samples by
(batch_size * accumulate_steps * world_size), which yields the number
of optimizer updates.  However train_epoch() runs one generate_batch()
per loop iteration, so the loop count must be num_samples divided by
(batch_size * world_size) only.  With the default accumulate_steps=2,
every epoch was silently processing only 50% of the intended data.

Signed-off-by: huaweil <huaweil@nvidia.com>

* ci: trigger CI validation

Signed-off-by: Ivan Basov <ibasov@nvidia.com>

---------

Signed-off-by: huaweil <huaweil@nvidia.com>
Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Ivan Basov <ibasov@nvidia.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…#38)

* fix: steps_per_epoch should count forward passes, not optimizer steps

steps_per_epoch was computed by dividing num_samples by
(batch_size * accumulate_steps * world_size), which yields the number
of optimizer updates.  However train_epoch() runs one generate_batch()
per loop iteration, so the loop count must be num_samples divided by
(batch_size * world_size) only.  With the default accumulate_steps=2,
every epoch was silently processing only 50% of the intended data.

Signed-off-by: huaweil <huaweil@nvidia.com>

* ci: trigger CI validation

Signed-off-by: Ivan Basov <ibasov@nvidia.com>

---------

Signed-off-by: huaweil <huaweil@nvidia.com>
Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Co-authored-by: Ivan Basov <ibasov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants