Skip to content

Enable timing minibatch#66

Merged
michaelmckinsey1 merged 5 commits into
LBANN:mainfrom
michaelmckinsey1:per-minibatch
May 28, 2026
Merged

Enable timing minibatch#66
michaelmckinsey1 merged 5 commits into
LBANN:mainfrom
michaelmckinsey1:per-minibatch

Conversation

@michaelmckinsey1
Copy link
Copy Markdown
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented May 7, 2026

Time minibatch to evaluate weak scaling, as the number of samples processed per rank is constant when increasing ranks. The current timers for epoch time and total time don't encapsulate this.

@michaelmckinsey1 michaelmckinsey1 self-assigned this May 7, 2026
@michaelmckinsey1 michaelmckinsey1 mentioned this pull request May 27, 2026
2 tasks
Copy link
Copy Markdown
Collaborator

@PatrickRMiles PatrickRMiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can go in as-is, but I would suggest one of two things:

  1. Measure the perf impact of the cuda synchronize in a few scenarios, or
  2. Make the minibatch timing toggleable in the config file, so if we find down the line that this synchronize is noticeably hampering performance, we can disable it temporarily while we work on a better solution.

Comment thread ScaFFold/utils/trainer.py
Comment on lines +731 to +737
if time_minibatch:
# This sync has some potential performance impact
# TODO: Would be better to measure this with Caliper, which uses CUDA events.
torch.cuda.synchronize(self.device)
minibatch_time_s = (
time.perf_counter() - minibatch_start_time
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured the perf impacts of this sync?

Copy link
Copy Markdown
Collaborator Author

@michaelmckinsey1 michaelmckinsey1 May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the FOM, I'm changing the timer to CUDA events in #71, which do not need the synchronize, and I am aggregating over all ranks there. Aggregation happens after the epoch such that no perf impact.

So #71 will be superseding these exact issues.

FWIW the performance impact in this PR is negligible even with the sync.

@michaelmckinsey1 michaelmckinsey1 merged commit 2e70a3a into LBANN:main May 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants