-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The limitations on gpumem and gpucores are not working correctly. #332
Comments
what is the output of 'nvidia-smi' inside container? |
nvidia-smi
nvidia-smi -a
|
i see, is the utilization of GPU vibrate from 0-100 during execution or stable at 100 |
it's alway stable at 100% |
could you try using tensorflow/pytorch benchmarks(https://github.com/tensorflow/benchmarks), HAMi-core implement gpucores limitation by blocking new kernels from running, so if you have a very very large kernel already submitted, then we can't do anything to limit its utilization |
I'll try. Additionally, when setting
the log is:
I copied these logs from inside the pod. I don't know how to copy the full entries because they don't show in the pod log (via console), so I can't write them to a log file. |
@thungrac
or
This way, the output.txt file will contain all output from your program. |
@haitwang-cloud many thanks Here is the output when i run |
1. Issue or feature description
Hi HAMi team,
I configured a pod with the image oguzpastirmaci/gpu-burn for testing GPU burn, but when running the burn, it's still using 100% of the GPU card's utilization (100 cores over the 20 cores configuration) and exceeding the memory limit (1332 MiB over the 1000 MiB configuration).
2. Steps to reproduce the issue
Deploy the pod
Then run burn
./gpu_burn 1000
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)Additional information that might help better understand your environment and reproduce the bug:
k8s version: v1.27.12+rke2r1
linux OS version: Ubuntu 22.04.4 LTS
linux kernel version: 5.15.0-102-generic
container runtime: containerd://1.7.11-k3s2
nvidia-container-runtime: runc version 1.1.12
nvidia driver: 550.67
cuda version: 12.4
The text was updated successfully, but these errors were encountered: