cuda device with a number ignored #301

bernstei · 2024-01-18T16:55:27Z

From what we can tell from playing with it, passing devices such as cuda:2 to run_train.py doesn't seem to work - it appears to still use device 0 (Note that I had to patch the CLI argument parser to allow strings like cuda:N, which I'd be happy to share). I'd have expected to see torch.cuda.set_device(N) someplace, e.g. in

mace/mace/tools/torch_tools.py

Line 51 in 6df8827

def init_device(device_str: str) -> torch.device:

Instead it looks like the device string including the :N is passed to various torch calls throughout the code.

Has anyone actually tested this functionality?

Note that setting CUDA_VISIBLE_DEVICES before running run_train is sufficient for us, so maybe it's not important and this issue can be closed, but having code that does the wrong thing seems bad.

The text was updated successfully, but these errors were encountered:

chiku-parida · 2024-03-20T11:25:58Z

Could you please share what are specific tags and modifications needed to run multi-gpu training?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda device with a number ignored #301

cuda device with a number ignored #301

bernstei commented Jan 18, 2024 •

edited

chiku-parida commented Mar 20, 2024

cuda device with a number ignored #301

cuda device with a number ignored #301

Comments

bernstei commented Jan 18, 2024 • edited

chiku-parida commented Mar 20, 2024

bernstei commented Jan 18, 2024 •

edited