Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster #19817

Open
OswaldHe opened this issue Apr 25, 2024 · 4 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@OswaldHe
Copy link

OswaldHe commented Apr 25, 2024

Bug description

I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Training Script:

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))


# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

SLURM batch script:

#!/bin/bash

#SBATCH -p mi1004x
#SBATCH --nodes=2             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4   # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err

source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py

Error messages and logs

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Environment

Current environment
  • CUDA:
    • GPU:
      • AMD Instinct MI100
      • AMD Instinct MI100
      • AMD Instinct MI100
      • AMD Instinct MI100
    • available: True
    • version: None
  • Lightning:
    • lightning: 2.2.1
    • lightning-utilities: 0.11.2
    • pytorch-lightning: 2.2.1
    • pytorch-triton-rocm: 2.2.0
    • torch: 2.2.0+rocm5.6
    • torchaudio: 2.2.0+rocm5.6
    • torchmetrics: 1.3.2
    • torchvision: 0.17.0+rocm5.6
  • Packages:
    • absl-py: 2.1.0
    • aiohttp: 3.9.3
    • aiosignal: 1.3.1
    • annotated-types: 0.6.0
    • async-timeout: 4.0.3
    • attrs: 23.2.0
    • certifi: 2022.12.7
    • charset-normalizer: 2.1.1
    • deepspeed: 0.14.0
    • filelock: 3.9.0
    • frozenlist: 1.4.1
    • fsspec: 2023.4.0
    • future: 1.0.0
    • grpcio: 1.62.1
    • hjson: 3.1.0
    • idna: 3.4
    • imageio: 2.34.0
    • jinja2: 3.1.2
    • lightning: 2.2.1
    • lightning-utilities: 0.11.2
    • markdown: 3.6
    • markupsafe: 2.1.3
    • mpmath: 1.3.0
    • multidict: 6.0.5
    • networkx: 3.2.1
    • ninja: 1.11.1.1
    • numpy: 1.26.3
    • packaging: 24.0
    • pandas: 2.2.1
    • pillow: 10.2.0
    • pip: 23.3.1
    • protobuf: 5.26.1
    • psutil: 5.9.8
    • py-cpuinfo: 9.0.0
    • pydantic: 2.7.0
    • pydantic-core: 2.18.1
    • pynvml: 11.5.0
    • python-dateutil: 2.9.0.post0
    • pytorch-lightning: 2.2.1
    • pytorch-triton-rocm: 2.2.0
    • pytz: 2024.1
    • pyyaml: 6.0.1
    • requests: 2.28.1
    • setuptools: 68.2.2
    • six: 1.16.0
    • sympy: 1.12
    • tensorboard: 2.16.2
    • tensorboard-data-server: 0.7.2
    • test-tube: 0.7.5
    • torch: 2.2.0+rocm5.6
    • torchaudio: 2.2.0+rocm5.6
    • torchmetrics: 1.3.2
    • torchvision: 0.17.0+rocm5.6
    • tqdm: 4.66.2
    • typing-extensions: 4.8.0
    • tzdata: 2024.1
    • urllib3: 1.26.13
    • werkzeug: 3.0.1
    • wheel: 0.41.2
    • yarl: 1.9.4
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.10.14
    • release: 5.14.0-162.18.1.el9_1.x86_64
    • version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023

More info

No response

@OswaldHe OswaldHe added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 25, 2024
@jaydeepradeJD
Copy link

Try using "srun python3 train.py". python --> python3

@OswaldHe
Copy link
Author

I tried python3, but the issue still remains.

@FelixBrakel
Copy link

I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.

@Furkan9015
Copy link

A bottleneck for good especially if you can not do sruns but only sbatch within the environment you work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants