-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
Description
Environment
- PyTorch Lightning Version: 1.4.8 (This happens in all 1.4.x)
- PyTorch Version: 1.9.1
- Python version: 3.8
- OS: Linux
- CUDA/cuDNN version: 11.4
- GPU models and configuration: MNIST, ddp
- How you installed PyTorch:
pipwhen installing pytorch-lightning - Any other relevant information: Till 1.3.8 everything worked
🐛 Bug
Hi, I have been using PL 1.3.x all along, when I updated to 1.4.x (I have tried from 1.4.0 to1.4.8) I started getting weird values for validation loss/metric. Training uses 2 gpus, ddp and 2 dataloaders for validation.
At validation_epoch_end I do aggregate (average) the results of dataloader_idx_0 and dataloader_idx_1, but when I check the values printed by self.log they don't add up
Aggregate method used,
def aggregate_validation_metrics(self, val_outputs, loss_name):
tot_loss: torch.FloatTensor = torch.tensor(0.0, device=self.device)
# multi data loader
if isinstance(val_outputs[0], list):
for loss in val_outputs:
tot_loss += sum(loss) / len(loss)
tot_loss = tot_loss / len(val_outputs)
# single data loader
else:
tot_loss += sum(val_outputs) / len(val_outputs)
self.log(
f"tot_{loss_name}",
tot_loss,
on_step=False,
on_epoch=True,
prog_bar=True,
logger=True,
sync_dist=True,
rank_zero_only=True,
)and its results
Expected behavior
Correct aggregated (averaged) values at validation_epoch_end
To Reproduce
I have used MNIST model and have attached the code
run python simple_classifier.py
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
