Unable to run Neuralangelo; NVML not supported #15

mitdave95 · 2023-08-14T16:32:31Z

Getting the below error

torchrun --nproc_per_node=1 train.py --logdir=logs/sample/toy_example --config=projects/neuralangelo/configs/custom/toy_example.yaml --show_pbar
Traceback (most recent call last):
  File "train.py", line 104, in <module>
    main()
  File "train.py", line 46, in main
    set_affinity(args.local_rank)
  File "/data/imaginaire/utils/gpu_affinity.py", line 74, in set_affinity
    os.sched_setaffinity(0, dev.get_cpu_affinity())
  File "/data/imaginaire/utils/gpu_affinity.py", line 50, in get_cpu_affinity
    for j in pynvml.nvmlDeviceGetCpuAffinity(self.handle, Device._nvml_affinity_elements):
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1745, in nvmlDeviceGetCpuAffinity
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 442) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-14_16:29:36
  host      : c7c816135a1c
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 442)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Running on Windows 11, RTX 4090, from WSL Ubuntu 22.04.02 with --gpus all flag

The text was updated successfully, but these errors were encountered:

chenhsuanlin · 2023-08-14T17:42:42Z

Hi @mitdave95, could you try commenting out this line? This is an optional function that sets the processor affinity. If this resolves your issue, I can push a hotfix. Thanks!

mitdave95 · 2023-08-14T21:27:35Z

@chenhsuanlin it worked! thanks
also, need to comment this line in extract_mesh.py

chenhsuanlin · 2023-08-14T22:07:36Z

Fixed in 3b1b95f! Please feel free to reopen if the issue persists.

chenhsuanlin closed this as completed Aug 14, 2023

bpezet mentioned this issue Aug 16, 2023

commit 3b1b95f still get torch.distributed.elastic.multiprocessing.errors.ChildFailedError #27

Closed

chenhsuanlin added the bug Something isn't working label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run Neuralangelo; NVML not supported #15

Unable to run Neuralangelo; NVML not supported #15

mitdave95 commented Aug 14, 2023 •

edited

chenhsuanlin commented Aug 14, 2023

mitdave95 commented Aug 14, 2023

chenhsuanlin commented Aug 14, 2023

Unable to run Neuralangelo; NVML not supported #15

Unable to run Neuralangelo; NVML not supported #15

Comments

mitdave95 commented Aug 14, 2023 • edited

chenhsuanlin commented Aug 14, 2023

mitdave95 commented Aug 14, 2023

chenhsuanlin commented Aug 14, 2023

mitdave95 commented Aug 14, 2023 •

edited