You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Training Script:
importosfromtorchimportoptim, nn, utils, Tensorfromtorchvision.datasetsimportMNISTfromtorchvision.transformsimportToTensorimportlightningasL# define any number of nn.Modules (or use your current ones)encoder=nn.Sequential(nn.Linear(28*28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder=nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28*28))
# define the LightningModuleclassLitAutoEncoder(L.LightningModule):
def__init__(self, encoder, decoder):
super().__init__()
self.encoder=encoderself.decoder=decoderdeftraining_step(self, batch, batch_idx):
# training_step defines the train loop.# it is independent of forwardx, y=batchx=x.view(x.size(0), -1)
z=self.encoder(x)
x_hat=self.decoder(z)
loss=nn.functional.mse_loss(x_hat, x)
# Logging to TensorBoard (if installed) by defaultself.log("train_loss", loss)
returnlossdefconfigure_optimizers(self):
optimizer=optim.Adam(self.parameters(), lr=1e-3)
returnoptimizer# init the autoencoderautoencoder=LitAutoEncoder(encoder, decoder)
# setup datadataset=MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader=utils.data.DataLoader(dataset)
# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)trainer=L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
SLURM batch script:
#!/bin/bash#SBATCH -p mi1004x#SBATCH --nodes=2 # This needs to match Trainer(num_nodes=...)#SBATCH --ntasks-per-node=4 # This needs to match Trainer(devices=...)#SBATCH --time=0-00:30:00#SBATCH -e slurm-%j.errsource~/miniconda3/bin/activate pylight
# run script from above
srun python train.py
Error messages and logs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Environment
Current environment
CUDA:
GPU:
AMD Instinct MI100
AMD Instinct MI100
AMD Instinct MI100
AMD Instinct MI100
available: True
version: None
Lightning:
lightning: 2.2.1
lightning-utilities: 0.11.2
pytorch-lightning: 2.2.1
pytorch-triton-rocm: 2.2.0
torch: 2.2.0+rocm5.6
torchaudio: 2.2.0+rocm5.6
torchmetrics: 1.3.2
torchvision: 0.17.0+rocm5.6
Packages:
absl-py: 2.1.0
aiohttp: 3.9.3
aiosignal: 1.3.1
annotated-types: 0.6.0
async-timeout: 4.0.3
attrs: 23.2.0
certifi: 2022.12.7
charset-normalizer: 2.1.1
deepspeed: 0.14.0
filelock: 3.9.0
frozenlist: 1.4.1
fsspec: 2023.4.0
future: 1.0.0
grpcio: 1.62.1
hjson: 3.1.0
idna: 3.4
imageio: 2.34.0
jinja2: 3.1.2
lightning: 2.2.1
lightning-utilities: 0.11.2
markdown: 3.6
markupsafe: 2.1.3
mpmath: 1.3.0
multidict: 6.0.5
networkx: 3.2.1
ninja: 1.11.1.1
numpy: 1.26.3
packaging: 24.0
pandas: 2.2.1
pillow: 10.2.0
pip: 23.3.1
protobuf: 5.26.1
psutil: 5.9.8
py-cpuinfo: 9.0.0
pydantic: 2.7.0
pydantic-core: 2.18.1
pynvml: 11.5.0
python-dateutil: 2.9.0.post0
pytorch-lightning: 2.2.1
pytorch-triton-rocm: 2.2.0
pytz: 2024.1
pyyaml: 6.0.1
requests: 2.28.1
setuptools: 68.2.2
six: 1.16.0
sympy: 1.12
tensorboard: 2.16.2
tensorboard-data-server: 0.7.2
test-tube: 0.7.5
torch: 2.2.0+rocm5.6
torchaudio: 2.2.0+rocm5.6
torchmetrics: 1.3.2
torchvision: 0.17.0+rocm5.6
tqdm: 4.66.2
typing-extensions: 4.8.0
tzdata: 2024.1
urllib3: 1.26.13
werkzeug: 3.0.1
wheel: 0.41.2
yarl: 1.9.4
System:
OS: Linux
architecture:
64bit
ELF
processor: x86_64
python: 3.10.14
release: 5.14.0-162.18.1.el9_1.x86_64
version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Training Script:
SLURM batch script:
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: