Skip to content

Getting strange validation loss/metric values when multiple data-loaders are used #9683

@raman-r-4978

Description

@raman-r-4978

Environment

  • PyTorch Lightning Version: 1.4.8 (This happens in all 1.4.x)
  • PyTorch Version: 1.9.1
  • Python version: 3.8
  • OS: Linux
  • CUDA/cuDNN version: 11.4
  • GPU models and configuration: MNIST, ddp
  • How you installed PyTorch: pip when installing pytorch-lightning
  • Any other relevant information: Till 1.3.8 everything worked

🐛 Bug

Hi, I have been using PL 1.3.x all along, when I updated to 1.4.x (I have tried from 1.4.0 to1.4.8) I started getting weird values for validation loss/metric. Training uses 2 gpus, ddp and 2 dataloaders for validation.

At validation_epoch_end I do aggregate (average) the results of dataloader_idx_0 and dataloader_idx_1, but when I check the values printed by self.log they don't add up

Aggregate method used,

def aggregate_validation_metrics(self, val_outputs, loss_name):
    tot_loss: torch.FloatTensor = torch.tensor(0.0, device=self.device)
    # multi data loader
    if isinstance(val_outputs[0], list):
        for loss in val_outputs:
            tot_loss += sum(loss) / len(loss)
        tot_loss = tot_loss / len(val_outputs)
    # single data loader
    else:
        tot_loss += sum(val_outputs) / len(val_outputs)

    self.log(
        f"tot_{loss_name}",
        tot_loss,
        on_step=False,
        on_epoch=True,
        prog_bar=True,
        logger=True,
        sync_dist=True,
        rank_zero_only=True,
    )

and its results

PL1.4.8 validation results

Expected behavior

Correct aggregated (averaged) values at validation_epoch_end

To Reproduce

I have used MNIST model and have attached the code

  1. simple_classifier.py
  2. mnist_datamodule.py

run python simple_classifier.py

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions