SAM Callback DDP and Multi Data Fixes. #187

melo-gonzo · 2024-04-18T22:30:35Z

Two bugs are addressed in this PR:

In DDP, the pytorch lightning trainer api is different, so the original trainer.model.optimizers will not work properly. This pr switches the logic to use the task module to get the optimizer names instead, which is consistent across training pipelines.
In multi-data training, each dataset does not necessarily have the same amount of samples. When there is sample imbalance, the _compute_loss() function only returns losses for datasets that are still being processed. The optimizer_names still contain mappings for datasets with no more samples to process, which was causing problems when gathering losses and computing the "global loss". This pr adds a few checks to extract_optimizer_specific_loss to fix these issues.

…n the loss, adding fix for ddp where trainer class switches (using task module to get opt names now).

laserkelvin

Just one comment - merge after you've added

laserkelvin · 2024-04-18T23:46:15Z

matsciml/lightning/callbacks.py

@@ -779,11 +783,12 @@ def on_before_optimizer_step(
            org_weights = self._first_step(optimizer)
        with torch.enable_grad():
            loss = task._compute_losses(self.batch)
-            if len(trainer.optimizers) > 1:
-                loss = self.extract_optimizer_specific_loss(trainer, optimizer, loss)
+            if len(task.optimizers()) > 1:


Could you just throw a comment in here just to say this is the multitask case?

adding check to see if the optimizers associated dataset is present i…

cc84b90

…n the loss, adding fix for ddp where trainer class switches (using task module to get opt names now).

melo-gonzo requested a review from laserkelvin April 18, 2024 22:30

laserkelvin added the bug Something isn't working label Apr 18, 2024

laserkelvin approved these changes Apr 18, 2024

View reviewed changes

update sam callback with comment about number of optimizers in multitask

6b2718f

melo-gonzo merged commit 46c1737 into IntelLabs:main Apr 19, 2024
2 of 3 checks passed

melo-gonzo deleted the sam-multidata-and-ddp-fix branch April 19, 2024 15:54

laserkelvin mentioned this pull request Apr 19, 2024

Noisy node positions pretraining task #191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAM Callback DDP and Multi Data Fixes. #187

SAM Callback DDP and Multi Data Fixes. #187

melo-gonzo commented Apr 18, 2024

laserkelvin left a comment

laserkelvin Apr 18, 2024

SAM Callback DDP and Multi Data Fixes. #187

SAM Callback DDP and Multi Data Fixes. #187

Conversation

melo-gonzo commented Apr 18, 2024

laserkelvin left a comment

Choose a reason for hiding this comment

laserkelvin Apr 18, 2024

Choose a reason for hiding this comment