DDP training freezes immediately #17389

GeoffNN · 2023-04-15T00:00:41Z

Bug description

I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting:

python /home/negroni/deeponet-fno/src/burgers/pytorch_deeponet.py --ngpus 3

Using backend: tensorflow.compat.v1

2023-04-14 16:56:35.997710: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-14 16:56:36.145661: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-14 16:56:36.609342: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-04-14 16:56:36.609396: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-04-14 16:56:36.609401: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Enable just-in-time compilation with XLA.

WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/deepxde/nn/initializers.py:118: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/deepxde/nn/initializers.py:118: The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")

=============================
torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000
=============================

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

wandb: Currently logged in as: geoffnn. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.14.2 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in logs/wandb/run-20230414_165640-z1m33sa0
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run azure-smoke-113
wandb: ⭐️ View project at https://wandb.ai/geoffnn/PDEs-Burgers
wandb: 🚀 View run at https://wandb.ai/geoffnn/PDEs-Burgers/runs/z1m33sa0
Loaded data
Loaded data
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:lightning_fabric.utilities.distributed:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Using backend: pytorch

Using backend: pytorch

/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")

=============================
torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000
=============================

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

Loaded data
Loaded data

=============================
torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000
=============================

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Loaded data
Loaded data
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
INFO:pytorch_lightning.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------

INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

What version are you seeing the problem on?

2.0+ and 1.9.x

How to reproduce the bug

I can't reduce to a small repro, but the code is here: https://github.com/GeoffNN/deeponet-fno/blob/main/src/burgers/pytorch_deeponet.py

Error messages and logs

# Error messages and logs here please

Environment

Current environment

CUDA:
- GPU:
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- available: True
- version: 11.7
Lightning:
- lightning: 2.0.1
- lightning-cloud: 0.5.32
- lightning-utilities: 0.7.0
- pytorch-lightning: 1.9.3
- torch: 2.0.0
- torchaudio: 0.13.1
- torchmetrics: 0.11.1
- torchvision: 0.14.1
Packages:
- absl-py: 1.4.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- altair: 4.2.2
- anyio: 3.6.2
- appdirs: 1.4.4
- arrow: 1.2.3
- asttokens: 2.2.1
- astunparse: 1.6.3
- async-timeout: 4.0.2
- attrs: 22.2.0
- backcall: 0.2.0
- backports.functools-lru-cache: 1.6.4
- beautifulsoup4: 4.12.0
- black: 23.3.0
- blessed: 1.20.0
- brotlipy: 0.7.0
- cachetools: 5.3.0
- certifi: 2022.12.7
- cffi: 1.15.1
- charset-normalizer: 2.0.4
- click: 8.1.3
- cmake: 3.26.1
- colorama: 0.4.6
- contourpy: 1.0.7
- croniter: 1.3.8
- cryptography: 38.0.4
- cycler: 0.11.0
- dateutils: 0.6.12
- debugpy: 1.5.1
- decorator: 5.1.1
- deepdiff: 6.3.0
- deepxde: 1.8.0
- dnspython: 2.3.0
- docker-pycreds: 0.4.0
- email-validator: 1.3.1
- entrypoints: 0.4
- exceptiongroup: 1.1.0
- executing: 1.2.0
- fastapi: 0.88.0
- filelock: 3.10.7
- flatbuffers: 23.1.21
- flit-core: 3.6.0
- fonttools: 4.38.0
- frozenlist: 1.3.3
- fsspec: 2023.1.0
- gast: 0.4.0
- gitdb: 4.0.10
- gitpython: 3.1.31
- google-auth: 2.16.1
- google-auth-oauthlib: 0.4.6
- google-pasta: 0.2.0
- gpustat: 1.0.0
- grpcio: 1.51.1
- h11: 0.14.0
- h5py: 3.8.0
- hcpdenn: 0.0.1
- httpcore: 0.16.3
- httptools: 0.5.0
- httpx: 0.23.3
- idna: 3.4
- importlib-metadata: 6.0.0
- importlib-resources: 5.12.0
- iniconfig: 2.0.0
- inquirer: 3.1.3
- ipykernel: 6.15.0
- ipython: 8.10.0
- itsdangerous: 2.1.2
- jax: 0.3.25
- jaxlib: 0.3.25+cuda11.cudnn82
- jedi: 0.18.2
- jinja2: 3.1.2
- joblib: 1.2.0
- jsonschema: 4.17.3
- jupyter-client: 7.0.6
- jupyter-core: 4.12.0
- keras: 2.11.0
- kiwisolver: 1.4.4
- libclang: 15.0.6.1
- lightning: 2.0.1
- lightning-cloud: 0.5.32
- lightning-utilities: 0.7.0
- lit: 16.0.0
- markdown: 3.4.1
- markdown-it-py: 2.2.0
- markupsafe: 2.1.2
- matplotlib: 3.7.0
- matplotlib-inline: 0.1.6
- mdurl: 0.1.2
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- ml-dtypes: 0.0.4
- mpmath: 1.3.0
- multidict: 6.0.4
- mypy-extensions: 1.0.0
- nest-asyncio: 1.5.6
- networkx: 3.0
- numpy: 1.23.5
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-cupti-cu11: 11.7.101
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- nvidia-cufft-cu11: 10.9.0.58
- nvidia-curand-cu11: 10.2.10.91
- nvidia-cusolver-cu11: 11.4.0.1
- nvidia-cusparse-cu11: 11.7.4.91
- nvidia-ml-py: 11.495.46
- nvidia-nccl-cu11: 2.14.3
- nvidia-nvtx-cu11: 11.7.91
- oauthlib: 3.2.2
- opt-einsum: 3.3.0
- ordered-set: 4.1.0
- orjson: 3.8.9
- packaging: 23.0
- pandas: 1.5.3
- parso: 0.8.3
- pathspec: 0.11.1
- pathtools: 0.1.2
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.3.0
- pip: 22.3.1
- platformdirs: 3.2.0
- pluggy: 1.0.0
- pooch: 1.6.0
- prompt-toolkit: 3.0.36
- protobuf: 3.19.6
- psutil: 5.9.4
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- pyaml: 21.10.1
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pybind11: 2.10.3
- pycparser: 2.21
- pydantic: 1.10.7
- pygments: 2.14.0
- pyjwt: 2.6.0
- pyopenssl: 22.0.0
- pyparsing: 3.0.9
- pyrsistent: 0.19.3
- pysocks: 1.7.1
- pytest: 7.2.1
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-multipart: 0.0.6
- pytorch-lightning: 1.9.3
- pytz: 2022.7.1
- pyyaml: 6.0
- pyzmq: 19.0.2
- readchar: 4.0.5
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- rfc3986: 1.5.0
- rich: 13.3.3
- rsa: 4.9
- scienceplots: 2.0.1
- scikit-learn: 1.2.1
- scikit-optimize: 0.9.0
- scikit-sparse: 0.4.8
- scipy: 1.10.1
- seaborn: 0.12.2
- sentry-sdk: 1.16.0
- setproctitle: 1.3.2
- setuptools: 65.6.3
- six: 1.16.0
- sklearn: 0.0.post1
- smmap: 5.0.0
- sniffio: 1.3.0
- soupsieve: 2.4
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- sympy: 1.11.1
- tensorboard: 2.11.2
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- tensorflow: 2.11.0
- tensorflow-addons: 0.19.0
- tensorflow-estimator: 2.11.0
- tensorflow-io-gcs-filesystem: 0.30.0
- termcolor: 2.2.0
- theseus-ai: 0.1.4
- threadpoolctl: 3.1.0
- tomli: 2.0.1
- toolz: 0.12.0
- torch: 2.0.0
- torchaudio: 0.13.1
- torchmetrics: 0.11.1
- torchvision: 0.14.1
- tornado: 6.2
- tqdm: 4.64.1
- traitlets: 5.9.0
- triton: 2.0.0
- typeguard: 2.13.3
- typing-extensions: 4.4.0
- ujson: 5.7.0
- urllib3: 1.26.14
- uvicorn: 0.21.1
- uvloop: 0.17.0
- wandb: 0.13.10
- watchfiles: 0.19.0
- wcwidth: 0.2.6
- websocket-client: 1.5.1
- websockets: 10.4
- werkzeug: 2.2.3
- wheel: 0.38.4
- wrapt: 1.14.1
- yarl: 1.8.2
- zipp: 3.14.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.16
- version: Quantisation and Pruning Support #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023

``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningModule #- PyTorch Lightning Version (e.g., 1.5.0): 2.0.1 #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): 2.0.0 #- Python version (e.g., 3.9): 3.9.16 #- OS (e.g., Linux): Linux #- CUDA/cuDNN version: 11.7 #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): pip #- Running environment of LightningApp (e.g. local, cloud): server ```

More info

No response

cc @justusschock @awaelchli

The text was updated successfully, but these errors were encountered:

ryan597 · 2023-04-17T18:48:58Z

Are you able to get other ddp jobs to run?
Try the below script.

import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 3

model = BoringModel()
trainer = L.Trainer(max_epochs=10,
                    devices=ngpus)

trainer.fit(model)

GeoffNN · 2023-04-18T17:27:47Z

Ah. Thanks for the reduction. No, this doesn't seem to work either. Again, I get

python ~/deeponet-fno/src/burgers/toy_ddp.py
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

and then nothing.

ryan597 · 2023-04-18T20:01:01Z

I don't understand why torchvision is outputting an error here as it wasn't in the script.

Did you install PyTorch using Miniconda or pip?
Try setting up a clean env

conda create -n testenv python=3.9 
conda activate testenv
pip install torch torchvision lightning
python -c "import torch; print(torch.__version__)"

Also do export NCCL_DEBUG=INFO and try the test script again in the new env.

shoang22 · 2023-06-24T03:35:12Z

Hi, was this ever fixed? I'm running into the same issue using BoringModel

ryan597 · 2023-06-24T07:51:09Z

@shoang22 the exact problem wasn't really identified. It looks like a problem in the installation. Have you tried creating a clean environment with the above steps?

shoang22 · 2023-06-25T00:16:46Z

I did try a clean install, but the problem persisted. I was, however, able to solve the problem. I was running my script on a SLURM cluster. It turns out that I needed to include srun in my bash file - sbatch wasn't enough.

GeoffNN added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 15, 2023

Borda added ver: 2.0.x ver: 1.9.x ver: 2.1.x labels May 3, 2023

awaelchli added strategy: ddp DistributedDataParallel repro needed The issue is missing a reproducible example and removed needs triage Waiting to be triaged by maintainers labels Nov 25, 2023

awaelchli mentioned this issue Mar 10, 2024

[WIP] Basic system check for troubleshooting multi-GPU issues #19609

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP training freezes immediately #17389

DDP training freezes immediately #17389

GeoffNN commented Apr 15, 2023 •

edited by github-actions bot

ryan597 commented Apr 17, 2023

GeoffNN commented Apr 18, 2023 •

edited

ryan597 commented Apr 18, 2023

shoang22 commented Jun 24, 2023

ryan597 commented Jun 24, 2023

shoang22 commented Jun 25, 2023

DDP training freezes immediately #17389

DDP training freezes immediately #17389

Comments

GeoffNN commented Apr 15, 2023 • edited by github-actions bot

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

ryan597 commented Apr 17, 2023

GeoffNN commented Apr 18, 2023 • edited

ryan597 commented Apr 18, 2023

shoang22 commented Jun 24, 2023

ryan597 commented Jun 24, 2023

shoang22 commented Jun 25, 2023

GeoffNN commented Apr 15, 2023 •

edited by github-actions bot

GeoffNN commented Apr 18, 2023 •

edited