Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA内存拦截的逻辑BUG #320

Open
coldzerofear opened this issue May 24, 2024 · 1 comment
Open

CUDA内存拦截的逻辑BUG #320

coldzerofear opened this issue May 24, 2024 · 1 comment

Comments

@coldzerofear
Copy link

在使用LLaMa-Facorty容器化部署时,通过指定CUDA_VISIBLE_DEVICES环境变量选择在哪些GPU上运行,此时如果容器分配了2张卡,而CUDA_VISIBLE_DEVICES人为指定了非0卡,那么通过nvidia-smi命令查询显存使用,显存用量永远加在0号卡上,实际上应该在1号卡上。

root@chenweiyi-ed43f-0:/mnt/chenweiyi/LLaMA-Factory# nvidia-smi 
Fri May 24 17:59:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:1A:00.0 Off |                  Off |
| N/A   83C    P0             146W / 400W |   6858MiB /  8192MiB |     99%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:88:00.0 Off |                    0 |
| N/A   86C    P0             235W / 400W |      0MiB /  8192MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

从宿主机上能看到是运行于1号卡上。

分析是cuda通过识别CUDA_VISIBLE_DEVICES环境变量来确定卡的逻辑顺序0、1、2、3... 而nvml库在查询时不会使用这个环境变量。
而hami拦截库通过cuCtxGetDevice(&dev);确定当前卡的顺序,在这种情况下造成了顺序偏移。

这种情况是否应该通过设备uuid来确定顺序

@archlitchi
Copy link
Collaborator

thanks for submitting that issue, we will check that soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants