New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch/Segmentation/nnUNet] If multiple GPUs requested code will not run #1189
Labels
bug
Something isn't working
Comments
vijaypshah
changed the title
[Model/Framework] What is the problem?
[PyTorch/Segmentation/nnUNet] If multiple GPUs requested code will not run
Aug 14, 2022
Hi, I've run the command for 2 GPUs and it works fine for me:
I've found that might be PLT issue with some systems, please check Lightning-AI/pytorch-lightning#4612 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Related to Model/Framework(s)
PyTorch/Segmentation/nnUNet
Describe the bug
I am trying to run the example provided on the nnUNet. The code works fine when I use single GPU. However, if I request for 2 GPU it will not work.
Following command works:
python scripts/benchmark.py --mode train --gpus 1 --dim 3 --batch_size 2 --amp
Following command gets stuck
python scripts/benchmark.py --mode train --gpus 2 --dim 3 --batch_size 2 --amp
387 training, 97 validation, 484 test examples
Filters: [32, 64, 128, 256, 320, 320],
Kernels: [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]
Strides: [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]]
/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called
full_state_update
that hasnot been set for this class (Dice). The property determines if
update
bydefault needs access to the full metric state. If this is not the case, significant speedups can be
achieved and we recommend setting this to
False
.We provide an checking function
from torchmetrics.utilities import check_forward_full_state_property
that can be used to check if the
full_state_update=True
(old and potential slower behaviour,default for now) or if
full_state_update=False
can be used safely.warnings.warn(*args, **kwargs)
Using 16bit native Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default
ModelSummary
callback.GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:133: UserWarning: You defined a
validation_step
but have noval_dataloader
. Skipping val loop.rank_zero_warn("You defined a
validation_step
but have noval_dataloader
. Skipping val loop.")Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
To Reproduce
Steps to reproduce the behavior:
Install '...' : git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Segmentation/nnUNet
docker build -t nnunet .
mkdir data results
sudo singularity build nnunetMultiGPU.sif docker-daemon://nnunet:latest
Launch : singularity shell --nv -B ${PWD}/data:/data -B ${PWD}/results:/results -B ${PWD}:/workspace nnunetMultiGPU.sif
Expected behavior
Training to start as provided in the example
Environment
Please provide at least:
The text was updated successfully, but these errors were encountered: