Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run Neuralangelo; NVML not supported #15

Closed
mitdave95 opened this issue Aug 14, 2023 · 3 comments
Closed

Unable to run Neuralangelo; NVML not supported #15

mitdave95 opened this issue Aug 14, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@mitdave95
Copy link

mitdave95 commented Aug 14, 2023

Getting the below error

torchrun --nproc_per_node=1 train.py --logdir=logs/sample/toy_example --config=projects/neuralangelo/configs/custom/toy_example.yaml --show_pbar
Traceback (most recent call last):
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 46, in main
    set_affinity(args.local_rank)
  File "/data/imaginaire/utils/gpu_affinity.py", line 74, in set_affinity
    os.sched_setaffinity(0, dev.get_cpu_affinity())
  File "/data/imaginaire/utils/gpu_affinity.py", line 50, in get_cpu_affinity
    for j in pynvml.nvmlDeviceGetCpuAffinity(self.handle, Device._nvml_affinity_elements):
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1745, in nvmlDeviceGetCpuAffinity
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 442) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-14_16:29:36
  host      : c7c816135a1c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 442)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Running on Windows 11, RTX 4090, from WSL Ubuntu 22.04.02 with --gpus all flag

@chenhsuanlin
Copy link
Contributor

Hi @mitdave95, could you try commenting out this line? This is an optional function that sets the processor affinity. If this resolves your issue, I can push a hotfix. Thanks!

@mitdave95
Copy link
Author

@chenhsuanlin it worked! thanks
also, need to comment this line in extract_mesh.py

@chenhsuanlin
Copy link
Contributor

Fixed in 3b1b95f! Please feel free to reopen if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants