-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the parameters resourceCores/resourceMem/resourceName cannot work #356
Comments
thanks ,we will fix that soon |
@archlitchi are you in the fix this issue? |
@jxfruit can you try again using below yaml content to test. apiVersion: v1
kind: Pod
metadata:
name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
restartPolicy: OnFailure
containers:
- name: xlabfe73ef20d3cc329522779f35dd1ebaa4
image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
command: ["bash", "-c", "sleep 100000"]
resources:
limits:
nvidia.com/gpu: 1
xx/vcuda-core: 1 |
@lengrongfu still failed when create pod, got error: BTW, the doc(https://github.com/Project-HAMi/HAMi/blob/master/docs/config.md) said that the install parameter 'devicePlugin.deviceSplitCount' will generate N vgpu, but after setting, I create pod use the vgpu, which will failed. So is there something I missed? |
Hope to provides below info:
|
nvidia.com/gpu: 10 this field in the limits config value is what? because i look have one pod value is 10, another pod value is 1. this field in the limits value cannot more then node native device number. |
@lengrongfu the two pods are different situations. I just make a response for your questions. When only use ‘nvidia.com/gpu: 10’, the pod will be pending for the sake of no enough resource can be allocated. When add 'xx/vcuda-core' , the pod will throw the error: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected |
I test this case, don't recurrence? kind: Deployment
apiVersion: apps/v1
metadata:
name: vgpu-deployment
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: vgpu-test
image: chrstnhntschl/gpu_burn
args:
- '6000'
resources:
limits:
nvidia.com/vgpu: '1'
test/vcuda-core: '50'
test/vcuda-memory: '4680' |
can u share config.yaml or install command ? |
|
Sorry, failed to use your values.yaml to deploy. The chart is incomplete. Can u share the total installing helm charts? BTW, in my testing, field resourceName works when custom value. However, it cannot allocate the vgpu which is greater than the current node. |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
as the title described, all self-defined values cannot work
2. Steps to reproduce the issue
deployed with cmd:
helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system
yaml:
above the task got error:
Error: endpoint not found in cache for a registered resource: xx/vcuda-core
Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected
cm about hami-scheduler-newversion:
3. Information to attach (optional if deemed irrelevant)
hami image version: v2.3.12 and latest all failed
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
uname -a
dmesg
The text was updated successfully, but these errors were encountered: