Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda device with a number ignored #301

Open
bernstei opened this issue Jan 18, 2024 · 1 comment
Open

cuda device with a number ignored #301

bernstei opened this issue Jan 18, 2024 · 1 comment

Comments

@bernstei
Copy link
Collaborator

bernstei commented Jan 18, 2024

From what we can tell from playing with it, passing devices such as cuda:2 to run_train.py doesn't seem to work - it appears to still use device 0 (Note that I had to patch the CLI argument parser to allow strings like cuda:N, which I'd be happy to share). I'd have expected to see torch.cuda.set_device(N) someplace, e.g. in

def init_device(device_str: str) -> torch.device:

Instead it looks like the device string including the :N is passed to various torch calls throughout the code.

Has anyone actually tested this functionality?

Note that setting CUDA_VISIBLE_DEVICES before running run_train is sufficient for us, so maybe it's not important and this issue can be closed, but having code that does the wrong thing seems bad.

@chiku-parida
Copy link

Could you please share what are specific tags and modifications needed to run multi-gpu training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants