Enable timing minibatch by michaelmckinsey1 · Pull Request #66 · LBANN/ScaFFold

michaelmckinsey1 · 2026-05-07T17:41:21Z

Time minibatch to evaluate weak scaling, as the number of samples processed per rank is constant when increasing ranks. The current timers for epoch time and total time don't encapsulate this.

PatrickRMiles

I think this can go in as-is, but I would suggest one of two things:

Measure the perf impact of the cuda synchronize in a few scenarios, or
Make the minibatch timing toggleable in the config file, so if we find down the line that this synchronize is noticeably hampering performance, we can disable it temporarily while we work on a better solution.

PatrickRMiles · 2026-05-28T18:19:16Z

+                        if time_minibatch:
+                            # This sync has some potential performance impact
+                            # TODO: Would be better to measure this with Caliper, which uses CUDA events.
+                            torch.cuda.synchronize(self.device)
+                            minibatch_time_s = (
+                                time.perf_counter() - minibatch_start_time
+                            )


Have you measured the perf impacts of this sync?

For the FOM, I'm changing the timer to CUDA events in #71, which do not need the synchronize, and I am aggregating over all ranks there. Aggregation happens after the epoch such that no perf impact.

So #71 will be superseding these exact issues.

FWIW the performance impact in this PR is negligible even with the sync.

fix dtypes for torch

e583d85

michaelmckinsey1 self-assigned this May 7, 2026

Add per minibatch timer

3dfbd13

michaelmckinsey1 force-pushed the per-minibatch branch from 040c4f1 to 3dfbd13 Compare May 7, 2026 17:42

michaelmckinsey1 added 2 commits May 7, 2026 10:42

Merge remote-tracking branch 'origin/fix-dtypes' into per-minibatch

47a4812

cleanup

c9ef075

michaelmckinsey1 requested a review from PatrickRMiles May 7, 2026 21:53

Merge remote-tracking branch 'origin/main' into per-minibatch

ed246f5

michaelmckinsey1 mentioned this pull request May 27, 2026

Add FOM #71

Open

2 tasks

PatrickRMiles approved these changes May 28, 2026

View reviewed changes

michaelmckinsey1 merged commit 2e70a3a into LBANN:main May 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable timing minibatch#66

Enable timing minibatch#66
michaelmckinsey1 merged 5 commits into
LBANN:mainfrom
michaelmckinsey1:per-minibatch

michaelmckinsey1 commented May 7, 2026 •

edited

Loading

Uh oh!

PatrickRMiles left a comment

Uh oh!

PatrickRMiles May 28, 2026

Uh oh!

michaelmckinsey1 May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelmckinsey1 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PatrickRMiles left a comment

Choose a reason for hiding this comment

Uh oh!

PatrickRMiles May 28, 2026

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelmckinsey1 commented May 7, 2026 •

edited

Loading

michaelmckinsey1 May 28, 2026 •

edited

Loading