Skip to content

on_validation_batch_end() return empty list in 'dp' mode  #11539

@allanchan339

Description

@allanchan339

🐛 Bug

To Reproduce

The following tutorial is used and cloned to my machine to test the functionality of Distributed Data-Parallel and Data Parallel.

https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch-lightning/Optimize_Pytorch_Lightning_models_with_Weights_%26_Biases.ipynb

The validation step will return preds for a customized callback LogPredictionsCallback() that saves images.
The results between DDP and DP are completely different. In DDP mode, the sanity checking pass.
However, in DP mode, the "outputs" in LogPredictionsCallback() are just an empty list and fail the sanity checking.
Where the error codes like :


File "/home/user/Desktop/Code/ALEN/lightning_log.py", line 151, in on_validation_batch_end
    for y_i, y_pred in list(zip(y[:n], outputs[:n]))]
IndexError: dimension specified as 0 but tensor has no dimensions

As 2 GPU is required, the bug cannot reproduce in the Colab environment as Colab will not provide multiple free GPUs.

Expected behavior

The preds from validation_step() should be accumulated and can be called "outputs" in Callback, just like DDP mode.

Environment

  • CUDA:
    - GPU:
    - NVIDIA GeForce RTX 3090
    - NVIDIA GeForce RTX 3090
    - available: True
    - version: 11.2
  • Packages:
    - numpy: 1.21.5
    - pyTorch_debug: False
    - pyTorch_version: 1.10.0
    - pytorch-lightning: 1.5.8
    - tqdm: 4.62.3
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.9.7
    - version: #202201071026-Ubuntu SMP Fri Jan 7 16:52:09 UTC 2022

Additional context

cc @justusschock @awaelchli @akihironitta @rohitgr7

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions