the parameters resourceCores/resourceMem/resourceName cannot work #356

jxfruit · 2024-06-16T13:09:19Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

as the title described, all self-defined values cannot work

2. Steps to reproduce the issue

deployed with cmd:
helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system
yaml:

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          #nvidia.com/gpu: 2
          #nvidia.com/gpumem: 3000
          #nvidia.com/gpucores: 33
          #xx/vcuda-memory: 1
          xx/vcuda-core: 1
          #nvidia.com/gpumem: 3000

above the task got error:
Error: endpoint not found in cache for a registered resource: xx/vcuda-core
Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

cm about hami-scheduler-newversion:

apiVersion: v1
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: xx/vcuda-memory
        ignoredByScheduler: true
      - name: xx/vcuda-core
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
      - name: huawei.com/Ascend910-memory
        ignoredByScheduler: true
      - name: huawei.com/Ascend910
        ignoredByScheduler: true
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: hami
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-06-16T12:41:49Z"
  labels:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/instance: hami
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: hami
    app.kubernetes.io/version: 0.0.2
    helm.sh/chart: hami-2.0.0
  name: hami-scheduler-newversion
  namespace: kube-system
  resourceVersion: "72005391"
  uid: 7c8697ce-114c-4e16-9732-d4ccf7290e6b

3. Information to attach (optional if deemed irrelevant)

hami image version: v2.3.12 and latest all failed
Common error checking:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The vgpu-device-plugin container logs
The vgpu-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg

The text was updated successfully, but these errors were encountered:

archlitchi · 2024-06-17T04:08:12Z

thanks ,we will fix that soon

lengrongfu · 2024-06-17T13:48:17Z

@archlitchi are you in the fix this issue?

lengrongfu · 2024-06-17T13:52:04Z

@jxfruit can you try again using below yaml content to test.

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          nvidia.com/gpu: 1
          xx/vcuda-core: 1

jxfruit · 2024-06-18T07:09:02Z

@lengrongfu still failed when create pod, got error:
Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0

BTW, the doc(https://github.com/Project-HAMi/HAMi/blob/master/docs/config.md) said that the install parameter 'devicePlugin.deviceSplitCount' will generate N vgpu, but after setting, I create pod use the vgpu, which will failed.

So is there something I missed?

lengrongfu · 2024-06-18T07:16:56Z

Hope to provides below info:

provide node yaml content, node having registry nvidia.com/gpu=10 to allocatable and capacity?
provide you used hami version info?
Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0 This error is what components appear?

jxfruit · 2024-06-18T07:41:49Z

apiVersion: v1
kind: Pod
metadata:
name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
restartPolicy: OnFailure
containers:
- name: xlabfe73ef20d3cc329522779f35dd1ebaa4
  image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
  command: ["bash", "-c", "sleep 100000"]
  resources:
  limits:
  nvidia.com/gpu: 10
hami image version: v2.3.12 and latest all failed
I am confused. I install hami with the command:
helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system
And I create pod with yaml:
apiVersion: v1
kind: Pod
metadata:
name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
restartPolicy: OnFailure
containers:
- name: xlabfe73ef20d3cc329522779f35dd1ebaa4
  image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
  command: ["bash", "-c", "sleep 100000"]
  resources:
  limits:
  nvidia.com/gpu: 1
  xx/vcuda-core: 1

the pod describe info get the error info with command: kubectl describe pod/xlabfe73ef20d3cc329522779f35dd1ebaa4
Name: xlabfe73ef20d3cc329522779f35dd1ebaa4
Namespace: default
Priority: 0
Node: inter-app2/
Start Time: Tue, 18 Jun 2024 15:36:08 +0800
Labels:
Annotations: hami.io/bind-phase: allocating
hami.io/bind-time: 1718696168
hami.io/vgpu-devices-allocated: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:;
hami.io/vgpu-devices-to-allocate: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:;
hami.io/vgpu-node: inter-app2
hami.io/vgpu-time: 1718696168
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

the message changed....
After I tried a few times, I got:
Warning FailedScheduling 25s hami-scheduler binding rejected: node inter-app2 has been locked within 5 minutes

lengrongfu · 2024-06-18T11:14:40Z

nvidia.com/gpu: 10 this field in the limits config value is what？ because i look have one pod value is 10， another pod value is 1.

this field in the limits value cannot more then node native device number.

jxfruit · 2024-06-19T01:39:27Z

@lengrongfu the two pods are different situations. I just make a response for your questions. When only use ‘nvidia.com/gpu: 10’, the pod will be pending for the sake of no enough resource can be allocated. When add 'xx/vcuda-core' , the pod will throw the error: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

lengrongfu · 2024-06-20T08:08:21Z

I test this case, don't recurrence?

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

jxfruit · 2024-06-21T04:02:36Z

I test this case, don't recurrence?

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

can u share config.yaml or install command ?
I try it again

lengrongfu · 2024-06-21T07:40:12Z

helm -n nvidia-vgpu get values nvidia-vgpu; you can refer to below value yaml, i am not found helm install command.

hami:
  ascendResourceMem: huawei.com/Ascend910-memory
  ascendResourceName: huawei.com/Ascend910
  dcuResourceCores: hygon.com/dcucores
  dcuResourceMem: hygon.com/dcumem
  dcuResourceName: hygon.com/dcunum
  devicePlugin:
    deviceCoreScaling: 1
    deviceMemoryScaling: 1
    deviceSplitCount: 10
    disablecorelimit: "false"
    extraArgs:
    - -v=false
    hygonImageRepository: 4pdosc/vdcu-device-plugin
    hygonImageTag: v1.0
    hygondriver: /root/dcu-driver/dtk-22.10.1-vdcu
    hygonimage: 4pdosc/vdcu-device-plugin:v1.0
    hygonnodeSelector:
      dcu: "on"
    imagePullPolicy: IfNotPresent
    libPath: /usr/local/vgpu
    migStrategy: none
    mlunodeSelector:
      mlu: "on"
    monitorctrPath: /usr/local/vgpu/containers
    monitorimage: projecthami/hami
    nvidianodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: "true"
      nvidia.com/vgpu.deploy.device-plugin: "true"
    pluginPath: /var/lib/kubelet/device-plugins
    podAnnotations: {}
    registry: docker.m.daocloud.io
    repository: projecthami/hami
    runtimeClassName: ""
    service:
      httpPort: 31992
    tolerations: []
  fullnameOverride: ""
  global:
    annotations: {}
    gpuHookPath: /usr/local
    labels: {}
  iluvatarResourceCore: iluvatar.ai/vcuda-core
  iluvatarResourceMem: iluvatar.ai/vcuda-memory
  iluvatarResourceName: iluvatar.ai/vgpu
  imagePullSecrets: []
  mluResourceCores: cambricon.com/mlu.smlu.vcore
  mluResourceMem: cambricon.com/mlu.smlu.vmemory
  mluResourceName: cambricon.com/vmlu
  nameOverride: ""
  podSecurityPolicy:
    enabled: false
  resourceCores: test/vcuda-core
  resourceMem: test/vcuda-memory
  resourceMemPercentage: nvidia.com/gpumem-percentage
  resourceName: nvidia.com/vgpu
  resourcePriority: nvidia.com/priority
  resources:
    limits:
      cpu: 500m
      memory: 720Mi
    requests:
      cpu: 100m
      memory: 128Mi
  scheduler:
    customWebhook:
      enabled: false
      host: 127.0.0.1
      path: /webhook
      port: 31998
    defaultCores: 0
    defaultGPUNum: 1
    defaultMem: 0
    defaultSchedulerPolicy:
      gpuSchedulerPolicy: spread
      nodeSchedulerPolicy: binpack
    extender:
      extraArgs:
      - --debug
      - -v=4
      imagePullPolicy: IfNotPresent
      registry: docker.m.daocloud.io
      repository: projecthami/hami
    kubeScheduler:
      enabled: true
      extraArgs:
      - --policy-config-file=/config/config.json
      - -v=4
      extraNewArgs:
      - --config=/config/config.yaml
      - -v=4
      imagePullPolicy: IfNotPresent
      imageTag: v1.24.0
      registry: k8s-gcr.m.daocloud.io
      repository: kubernetes/kube-scheduler
    leaderElect: true
    metricsBindAddress: :9395
    mutatingWebhookConfiguration:
      failurePolicy: Ignore
    nodeName: ""
    nodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: "true"
    patch:
      imagePullPolicy: IfNotPresent
      newRepository: liangjw/kube-webhook-certgen
      newTag: v1.1.1
      nodeSelector: {}
      podAnnotations: {}
      priorityClassName: ""
      registry: docker.io
      repository: jettech/kube-webhook-certgen
      runAsUser: 2000
      tag: v1.5.2
      tolerations: []
    podAnnotations: {}
    service:
      annotations: {}
      httpPort: 443
      labels: {}
      monitorPort: 31993
      schedulerPort: 31998
    serviceMonitor:
      enable: false
    tolerations: []
  schedulerName: hami-scheduler
  version: v2.3.11

jxfruit · 2024-06-22T09:29:28Z

Sorry, failed to use your values.yaml to deploy. The chart is incomplete. Can u share the total installing helm charts?

BTW, in my testing, field resourceName works when custom value. However, it cannot allocate the vgpu which is greater than the current node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the parameters resourceCores/resourceMem/resourceName cannot work #356

the parameters resourceCores/resourceMem/resourceName cannot work #356

jxfruit commented Jun 16, 2024 •

edited

Loading

archlitchi commented Jun 17, 2024

lengrongfu commented Jun 17, 2024

lengrongfu commented Jun 17, 2024

jxfruit commented Jun 18, 2024

lengrongfu commented Jun 18, 2024

jxfruit commented Jun 18, 2024

lengrongfu commented Jun 18, 2024 •

edited

Loading

jxfruit commented Jun 19, 2024

lengrongfu commented Jun 20, 2024 •

edited

Loading

jxfruit commented Jun 21, 2024

lengrongfu commented Jun 21, 2024

jxfruit commented Jun 22, 2024

the parameters resourceCores/resourceMem/resourceName cannot work #356

the parameters resourceCores/resourceMem/resourceName cannot work #356

Comments

jxfruit commented Jun 16, 2024 • edited Loading

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

archlitchi commented Jun 17, 2024

lengrongfu commented Jun 17, 2024

lengrongfu commented Jun 17, 2024

jxfruit commented Jun 18, 2024

lengrongfu commented Jun 18, 2024

jxfruit commented Jun 18, 2024

lengrongfu commented Jun 18, 2024 • edited Loading

jxfruit commented Jun 19, 2024

lengrongfu commented Jun 20, 2024 • edited Loading

jxfruit commented Jun 21, 2024

lengrongfu commented Jun 21, 2024

jxfruit commented Jun 22, 2024

jxfruit commented Jun 16, 2024 •

edited

Loading

lengrongfu commented Jun 18, 2024 •

edited

Loading

lengrongfu commented Jun 20, 2024 •

edited

Loading