fix: steps_per_epoch should count forward passes, not optimizer steps by huaweil-nv · Pull Request #38 · NVIDIA/Ising-Decoding

huaweil-nv · 2026-04-02T07:49:29Z

Summary

steps_per_epoch was computed by dividing num_samples by batch_size * accumulate_steps * world_size, which yields optimizer update count, not forward pass count
train_epoch() runs one generate_batch() per loop iteration, so the loop count should be num_samples // (batch_size * world_size)
With the default accumulate_steps=2, every training epoch silently processed only 50% of the intended data

Reproduction

PYTHONPATH=code .venv/bin/python -u code/workflows/run.py \
  --config-name=config_local_test \
  workflow.task=train

Observe the log output:

Auto-scaled train.num_samples=8,388,608
[Epoch 0] Effective batch size: 1024 (512 × 2 × 1)
train_epoch: Starting 8192 batches...

Expected steps_per_epoch = 8,388,608 // (512 × 1) = 16384, but got 8,388,608 // (512 × 2 × 1) = 8192. Each step calls generate_batch() once, so only 8192 × 512 = 4,194,304 samples (50%) are seen per epoch.

After the fix:

train_epoch: Starting 16384 batches...

Root cause

train.py line 1297-1298:

effective_batch_size = per_device_batch_size * accumulate_steps * dist.world_size
steps_per_epoch = effective_num_samples // effective_batch_size

The gradient accumulation logic at line 482 (if (step+1) % accumulate_steps == 0) only controls when optimizer.step() runs — it does not generate additional batches. So accumulate_steps should not be in the steps_per_epoch divisor.

Fix

One-line change — remove accumulate_steps from the divisor:

steps_per_epoch = effective_num_samples // (per_device_batch_size * dist.world_size)

Test plan

Verified fix: Starting 16384 batches with default config (was 8192)
All existing tests pass (2 pre-existing failures unrelated to this change)
Added regression tests in test_training_utils.py

steps_per_epoch was computed by dividing num_samples by (batch_size * accumulate_steps * world_size), which yields the number of optimizer updates. However train_epoch() runs one generate_batch() per loop iteration, so the loop count must be num_samples divided by (batch_size * world_size) only. With the default accumulate_steps=2, every epoch was silently processing only 50% of the intended data. Signed-off-by: huaweil <huaweil@nvidia.com>

copy-pr-bot · 2026-04-02T07:49:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bmhowe23 · 2026-04-02T13:57:17Z

@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates?

ivanbasov · 2026-04-02T15:55:23Z

@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates?

it seems it was introduced in the original repo on 10/30/2025 with commit aad4b94

Signed-off-by: Ivan Basov <ibasov@nvidia.com>

bmhowe23 · 2026-04-02T15:58:12Z

@ivanbasov @kvmto can you please investigate and let me know if this is a problem inherited from the original repo or if it was introduced as part of our updates?

it seems it was introduced in the original repo on 10/30/2025 with commit aad4b94

FYI @jolle-ag

ivanbasov · 2026-04-02T16:07:35Z

/bot ok to test

ivanbasov · 2026-04-02T16:09:40Z

/copy-pr-bot ok to test

ivanbasov · 2026-04-02T16:13:26Z

/ok to test

copy-pr-bot · 2026-04-02T16:13:30Z

/ok to test

@ivanbasov, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

ivanbasov · 2026-04-02T16:14:35Z

/ok to test 97bda50

ivanbasov · 2026-04-02T16:38:53Z

Thanks, @huaweil-nv ! Great catch!

…#38) * fix: steps_per_epoch should count forward passes, not optimizer steps steps_per_epoch was computed by dividing num_samples by (batch_size * accumulate_steps * world_size), which yields the number of optimizer updates. However train_epoch() runs one generate_batch() per loop iteration, so the loop count must be num_samples divided by (batch_size * world_size) only. With the default accumulate_steps=2, every epoch was silently processing only 50% of the intended data. Signed-off-by: huaweil <huaweil@nvidia.com> * ci: trigger CI validation Signed-off-by: Ivan Basov <ibasov@nvidia.com> --------- Signed-off-by: huaweil <huaweil@nvidia.com> Signed-off-by: Ivan Basov <ibasov@nvidia.com> Co-authored-by: Ivan Basov <ibasov@nvidia.com>

huaweil-nv requested a review from bmhowe23 April 2, 2026 07:50

bmhowe23 requested review from ivanbasov and kvmto April 2, 2026 13:57

ci: trigger CI validation

97bda50

Signed-off-by: Ivan Basov <ibasov@nvidia.com>

ivanbasov approved these changes Apr 2, 2026

View reviewed changes

ivanbasov merged commit 31c60dd into main Apr 2, 2026
13 checks passed

ivanbasov deleted the fix/steps-per-epoch-undercount branch April 2, 2026 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: steps_per_epoch should count forward passes, not optimizer steps#38

fix: steps_per_epoch should count forward passes, not optimizer steps#38
ivanbasov merged 2 commits into
mainfrom
fix/steps-per-epoch-undercount

huaweil-nv commented Apr 2, 2026

Uh oh!

copy-pr-bot Bot commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

copy-pr-bot Bot commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huaweil-nv commented Apr 2, 2026

Summary

Reproduction

Root cause

Fix

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

bmhowe23 commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

copy-pr-bot Bot commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

ivanbasov commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants