-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingstrategy: dp (removed in pl)DataParallelDataParallel
Description
🐛 Bug
To Reproduce
The following tutorial is used and cloned to my machine to test the functionality of Distributed Data-Parallel and Data Parallel.
The validation step will return preds for a customized callback LogPredictionsCallback() that saves images.
The results between DDP and DP are completely different. In DDP mode, the sanity checking pass.
However, in DP mode, the "outputs" in LogPredictionsCallback() are just an empty list and fail the sanity checking.
Where the error codes like :
File "/home/user/Desktop/Code/ALEN/lightning_log.py", line 151, in on_validation_batch_end
for y_i, y_pred in list(zip(y[:n], outputs[:n]))]
IndexError: dimension specified as 0 but tensor has no dimensions
As 2 GPU is required, the bug cannot reproduce in the Colab environment as Colab will not provide multiple free GPUs.
Expected behavior
The preds from validation_step() should be accumulated and can be called "outputs" in Callback, just like DDP mode.
Environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 3090
- NVIDIA GeForce RTX 3090
- available: True
- version: 11.2 - Packages:
- numpy: 1.21.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0
- pytorch-lightning: 1.5.8
- tqdm: 4.62.3 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.7
- version: #202201071026-Ubuntu SMP Fri Jan 7 16:52:09 UTC 2022
Additional context
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstrategy: dp (removed in pl)DataParallelDataParallel