Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP training freezes immediately #17389

GeoffNN opened this issue Apr 15, 2023 · 6 comments

DDP training freezes immediately #17389

GeoffNN opened this issue Apr 15, 2023 · 6 comments
bug Something isn't working repro needed The issue is missing a reproducible example strategy: ddp DistributedDataParallel ver: 1.9.x ver: 2.0.x ver: 2.1.x


Copy link

GeoffNN commented Apr 15, 2023

Bug description

I'm trying to run a job with several GPUs. My script immediately gets stuck after outputting:

python /home/negroni/deeponet-fno/src/burgers/ --ngpus 3

Using backend: tensorflow.compat.v1

2023-04-14 16:56:35.997710: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-14 16:56:36.145661: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-14 16:56:36.609342: W tensorflow/compiler/xla/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2023-04-14 16:56:36.609396: W tensorflow/compiler/xla/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2023-04-14 16:56:36.609401: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/tensorflow/python/compat/ disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/tensorflow/python/compat/ disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Enable just-in-time compilation with XLA.

WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/deepxde/nn/ The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

WARNING:tensorflow:From /home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/deepxde/nn/ The name tf.keras.initializers.he_normal is deprecated. Please use tf.compat.v1.keras.initializers.he_normal instead.

/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")

torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

wandb: Currently logged in as: geoffnn. Use `wandb login --relogin` to force relogin
wandb: wandb version 0.14.2 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.13.10
wandb: Run data is saved locally in logs/wandb/run-20230414_165640-z1m33sa0
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run azure-smoke-113
wandb: ⭐️ View project at
wandb: 🚀 View run at
Loaded data
Loaded data
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:lightning_fabric.utilities.distributed:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Using backend: pytorch

Using backend: pytorch

/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
/home/negroni/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")

torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

Loaded data
Loaded data

torch.cuda.is_available(): True
torch.cuda.get_device_name(0): NVIDIA RTX A6000

Namespace(batch=5, lr=0.001, lr_scheduler_step=2000, lr_scheduler_factor=0.9, ridge=0.0001, epochs=500, nsamples=500, nsamples_residual=250, Nbasis=75, ngpus=3, max_iterations=50, log_every_n_steps=1, viscosity=0.01) 

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Loaded data
Loaded data
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
All distributed processes registered. Starting with 3 processes

INFO:pytorch_lightning.utilities.rank_zero:You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read

What version are you seeing the problem on?

2.0+ and 1.9.x

How to reproduce the bug

I can't reduce to a small repro, but the code is here:

Error messages and logs

# Error messages and logs here please


Current environment
Current environment
  • CUDA:
    - GPU:
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - NVIDIA RTX A6000
    - available: True
    - version: 11.7
  • Lightning:
    - lightning: 2.0.1
    - lightning-cloud: 0.5.32
    - lightning-utilities: 0.7.0
    - pytorch-lightning: 1.9.3
    - torch: 2.0.0
    - torchaudio: 0.13.1
    - torchmetrics: 0.11.1
    - torchvision: 0.14.1
  • Packages:
    - absl-py: 1.4.0
    - aiohttp: 3.8.4
    - aiosignal: 1.3.1
    - altair: 4.2.2
    - anyio: 3.6.2
    - appdirs: 1.4.4
    - arrow: 1.2.3
    - asttokens: 2.2.1
    - astunparse: 1.6.3
    - async-timeout: 4.0.2
    - attrs: 22.2.0
    - backcall: 0.2.0
    - backports.functools-lru-cache: 1.6.4
    - beautifulsoup4: 4.12.0
    - black: 23.3.0
    - blessed: 1.20.0
    - brotlipy: 0.7.0
    - cachetools: 5.3.0
    - certifi: 2022.12.7
    - cffi: 1.15.1
    - charset-normalizer: 2.0.4
    - click: 8.1.3
    - cmake: 3.26.1
    - colorama: 0.4.6
    - contourpy: 1.0.7
    - croniter: 1.3.8
    - cryptography: 38.0.4
    - cycler: 0.11.0
    - dateutils: 0.6.12
    - debugpy: 1.5.1
    - decorator: 5.1.1
    - deepdiff: 6.3.0
    - deepxde: 1.8.0
    - dnspython: 2.3.0
    - docker-pycreds: 0.4.0
    - email-validator: 1.3.1
    - entrypoints: 0.4
    - exceptiongroup: 1.1.0
    - executing: 1.2.0
    - fastapi: 0.88.0
    - filelock: 3.10.7
    - flatbuffers: 23.1.21
    - flit-core: 3.6.0
    - fonttools: 4.38.0
    - frozenlist: 1.3.3
    - fsspec: 2023.1.0
    - gast: 0.4.0
    - gitdb: 4.0.10
    - gitpython: 3.1.31
    - google-auth: 2.16.1
    - google-auth-oauthlib: 0.4.6
    - google-pasta: 0.2.0
    - gpustat: 1.0.0
    - grpcio: 1.51.1
    - h11: 0.14.0
    - h5py: 3.8.0
    - hcpdenn: 0.0.1
    - httpcore: 0.16.3
    - httptools: 0.5.0
    - httpx: 0.23.3
    - idna: 3.4
    - importlib-metadata: 6.0.0
    - importlib-resources: 5.12.0
    - iniconfig: 2.0.0
    - inquirer: 3.1.3
    - ipykernel: 6.15.0
    - ipython: 8.10.0
    - itsdangerous: 2.1.2
    - jax: 0.3.25
    - jaxlib: 0.3.25+cuda11.cudnn82
    - jedi: 0.18.2
    - jinja2: 3.1.2
    - joblib: 1.2.0
    - jsonschema: 4.17.3
    - jupyter-client: 7.0.6
    - jupyter-core: 4.12.0
    - keras: 2.11.0
    - kiwisolver: 1.4.4
    - libclang:
    - lightning: 2.0.1
    - lightning-cloud: 0.5.32
    - lightning-utilities: 0.7.0
    - lit: 16.0.0
    - markdown: 3.4.1
    - markdown-it-py: 2.2.0
    - markupsafe: 2.1.2
    - matplotlib: 3.7.0
    - matplotlib-inline: 0.1.6
    - mdurl: 0.1.2
    - mkl-fft: 1.3.1
    - mkl-random: 1.2.2
    - mkl-service: 2.4.0
    - ml-dtypes: 0.0.4
    - mpmath: 1.3.0
    - multidict: 6.0.4
    - mypy-extensions: 1.0.0
    - nest-asyncio: 1.5.6
    - networkx: 3.0
    - numpy: 1.23.5
    - nvidia-cublas-cu11:
    - nvidia-cuda-cupti-cu11: 11.7.101
    - nvidia-cuda-nvrtc-cu11: 11.7.99
    - nvidia-cuda-runtime-cu11: 11.7.99
    - nvidia-cudnn-cu11:
    - nvidia-cufft-cu11:
    - nvidia-curand-cu11:
    - nvidia-cusolver-cu11:
    - nvidia-cusparse-cu11:
    - nvidia-ml-py: 11.495.46
    - nvidia-nccl-cu11: 2.14.3
    - nvidia-nvtx-cu11: 11.7.91
    - oauthlib: 3.2.2
    - opt-einsum: 3.3.0
    - ordered-set: 4.1.0
    - orjson: 3.8.9
    - packaging: 23.0
    - pandas: 1.5.3
    - parso: 0.8.3
    - pathspec: 0.11.1
    - pathtools: 0.1.2
    - pexpect: 4.8.0
    - pickleshare: 0.7.5
    - pillow: 9.3.0
    - pip: 22.3.1
    - platformdirs: 3.2.0
    - pluggy: 1.0.0
    - pooch: 1.6.0
    - prompt-toolkit: 3.0.36
    - protobuf: 3.19.6
    - psutil: 5.9.4
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.2
    - pyaml: 21.10.1
    - pyasn1: 0.4.8
    - pyasn1-modules: 0.2.8
    - pybind11: 2.10.3
    - pycparser: 2.21
    - pydantic: 1.10.7
    - pygments: 2.14.0
    - pyjwt: 2.6.0
    - pyopenssl: 22.0.0
    - pyparsing: 3.0.9
    - pyrsistent: 0.19.3
    - pysocks: 1.7.1
    - pytest: 7.2.1
    - python-dateutil: 2.8.2
    - python-dotenv: 1.0.0
    - python-editor: 1.0.4
    - python-multipart: 0.0.6
    - pytorch-lightning: 1.9.3
    - pytz: 2022.7.1
    - pyyaml: 6.0
    - pyzmq: 19.0.2
    - readchar: 4.0.5
    - requests: 2.28.1
    - requests-oauthlib: 1.3.1
    - rfc3986: 1.5.0
    - rich: 13.3.3
    - rsa: 4.9
    - scienceplots: 2.0.1
    - scikit-learn: 1.2.1
    - scikit-optimize: 0.9.0
    - scikit-sparse: 0.4.8
    - scipy: 1.10.1
    - seaborn: 0.12.2
    - sentry-sdk: 1.16.0
    - setproctitle: 1.3.2
    - setuptools: 65.6.3
    - six: 1.16.0
    - sklearn: 0.0.post1
    - smmap: 5.0.0
    - sniffio: 1.3.0
    - soupsieve: 2.4
    - stack-data: 0.6.2
    - starlette: 0.22.0
    - starsessions: 1.3.0
    - sympy: 1.11.1
    - tensorboard: 2.11.2
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - tensorflow: 2.11.0
    - tensorflow-addons: 0.19.0
    - tensorflow-estimator: 2.11.0
    - tensorflow-io-gcs-filesystem: 0.30.0
    - termcolor: 2.2.0
    - theseus-ai: 0.1.4
    - threadpoolctl: 3.1.0
    - tomli: 2.0.1
    - toolz: 0.12.0
    - torch: 2.0.0
    - torchaudio: 0.13.1
    - torchmetrics: 0.11.1
    - torchvision: 0.14.1
    - tornado: 6.2
    - tqdm: 4.64.1
    - traitlets: 5.9.0
    - triton: 2.0.0
    - typeguard: 2.13.3
    - typing-extensions: 4.4.0
    - ujson: 5.7.0
    - urllib3: 1.26.14
    - uvicorn: 0.21.1
    - uvloop: 0.17.0
    - wandb: 0.13.10
    - watchfiles: 0.19.0
    - wcwidth: 0.2.6
    - websocket-client: 1.5.1
    - websockets: 10.4
    - werkzeug: 2.2.3
    - wheel: 0.38.4
    - wrapt: 1.14.1
    - yarl: 1.8.2
    - zipp: 3.14.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.9.16
    - version: Quantisation and Pruning Support #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningModule #- PyTorch Lightning Version (e.g., 1.5.0): 2.0.1 #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): 2.0.0 #- Python version (e.g., 3.9): 3.9.16 #- OS (e.g., Linux): Linux #- CUDA/cuDNN version: 11.7 #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): pip #- Running environment of LightningApp (e.g. local, cloud): server ```

More info

No response

cc @justusschock @awaelchli

@GeoffNN GeoffNN added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 15, 2023
Copy link

ryan597 commented Apr 17, 2023

Are you able to get other ddp jobs to run?
Try the below script.

import lightning as L
from lightning.pytorch.demos.boring_classes import BoringModel

ngpus = 3

model = BoringModel()
trainer = L.Trainer(max_epochs=10,

Copy link

GeoffNN commented Apr 18, 2023

Ah. Thanks for the reduction. No, this doesn't seem to work either. Again, I get

python ~/deeponet-fno/src/burgers/
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
~/miniconda3/envs/pde/lib/python3.9/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
All distributed processes registered. Starting with 2 processes

You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read

and then nothing.

Copy link

ryan597 commented Apr 18, 2023

I don't understand why torchvision is outputting an error here as it wasn't in the script.

Did you install PyTorch using Miniconda or pip?
Try setting up a clean env

conda create -n testenv python=3.9 
conda activate testenv
pip install torch torchvision lightning
python -c "import torch; print(torch.__version__)"

Also do export NCCL_DEBUG=INFO and try the test script again in the new env.

Copy link

Hi, was this ever fixed? I'm running into the same issue using BoringModel

Copy link

ryan597 commented Jun 24, 2023

@shoang22 the exact problem wasn't really identified. It looks like a problem in the installation. Have you tried creating a clean environment with the above steps?

Copy link

I did try a clean install, but the problem persisted. I was, however, able to solve the problem. I was running my script on a SLURM cluster. It turns out that I needed to include srun in my bash file - sbatch wasn't enough.

@awaelchli awaelchli added strategy: ddp DistributedDataParallel repro needed The issue is missing a reproducible example and removed needs triage Waiting to be triaged by maintainers labels Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
bug Something isn't working repro needed The issue is missing a reproducible example strategy: ddp DistributedDataParallel ver: 1.9.x ver: 2.0.x ver: 2.1.x
None yet

No branches or pull requests

5 participants