We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在使用LLaMa-Facorty容器化部署时,通过指定CUDA_VISIBLE_DEVICES环境变量选择在哪些GPU上运行,此时如果容器分配了2张卡,而CUDA_VISIBLE_DEVICES人为指定了非0卡,那么通过nvidia-smi命令查询显存使用,显存用量永远加在0号卡上,实际上应该在1号卡上。
CUDA_VISIBLE_DEVICES
root@chenweiyi-ed43f-0:/mnt/chenweiyi/LLaMA-Factory# nvidia-smi Fri May 24 17:59:35 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-80GB Off | 00000000:1A:00.0 Off | Off | | N/A 83C P0 146W / 400W | 6858MiB / 8192MiB | 99% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB Off | 00000000:88:00.0 Off | 0 | | N/A 86C P0 235W / 400W | 0MiB / 8192MiB | 97% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+
从宿主机上能看到是运行于1号卡上。
分析是cuda通过识别CUDA_VISIBLE_DEVICES环境变量来确定卡的逻辑顺序0、1、2、3... 而nvml库在查询时不会使用这个环境变量。 而hami拦截库通过cuCtxGetDevice(&dev);确定当前卡的顺序,在这种情况下造成了顺序偏移。
这种情况是否应该通过设备uuid来确定顺序
The text was updated successfully, but these errors were encountered:
thanks for submitting that issue, we will check that soon
Sorry, something went wrong.
No branches or pull requests
在使用LLaMa-Facorty容器化部署时,通过指定
CUDA_VISIBLE_DEVICES
环境变量选择在哪些GPU上运行,此时如果容器分配了2张卡,而CUDA_VISIBLE_DEVICES
人为指定了非0卡,那么通过nvidia-smi命令查询显存使用,显存用量永远加在0号卡上,实际上应该在1号卡上。从宿主机上能看到是运行于1号卡上。
分析是cuda通过识别
CUDA_VISIBLE_DEVICES
环境变量来确定卡的逻辑顺序0、1、2、3... 而nvml库在查询时不会使用这个环境变量。而hami拦截库通过cuCtxGetDevice(&dev);确定当前卡的顺序,在这种情况下造成了顺序偏移。
这种情况是否应该通过设备uuid来确定顺序
The text was updated successfully, but these errors were encountered: