Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the parameters resourceCores/resourceMem/resourceName cannot work #356

Open
9 tasks
jxfruit opened this issue Jun 16, 2024 · 12 comments
Open
9 tasks

the parameters resourceCores/resourceMem/resourceName cannot work #356

jxfruit opened this issue Jun 16, 2024 · 12 comments

Comments

@jxfruit
Copy link

jxfruit commented Jun 16, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

as the title described, all self-defined values cannot work

2. Steps to reproduce the issue

deployed with cmd:
helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system
yaml:

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          #nvidia.com/gpu: 2
          #nvidia.com/gpumem: 3000
          #nvidia.com/gpucores: 33
          #xx/vcuda-memory: 1
          xx/vcuda-core: 1
          #nvidia.com/gpumem: 3000

above the task got error:
Error: endpoint not found in cache for a registered resource: xx/vcuda-core
Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

cm about hami-scheduler-newversion:

apiVersion: v1
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: xx/vcuda-memory
        ignoredByScheduler: true
      - name: xx/vcuda-core
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
      - name: huawei.com/Ascend910-memory
        ignoredByScheduler: true
      - name: huawei.com/Ascend910
        ignoredByScheduler: true
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: hami
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-06-16T12:41:49Z"
  labels:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/instance: hami
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: hami
    app.kubernetes.io/version: 0.0.2
    helm.sh/chart: hami-2.0.0
  name: hami-scheduler-newversion
  namespace: kube-system
  resourceVersion: "72005391"
  uid: 7c8697ce-114c-4e16-9732-d4ccf7290e6b

3. Information to attach (optional if deemed irrelevant)

hami image version: v2.3.12 and latest all failed
Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The vgpu-device-plugin container logs
  • The vgpu-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
@archlitchi
Copy link
Collaborator

thanks ,we will fix that soon

@lengrongfu
Copy link
Member

@archlitchi are you in the fix this issue?

@lengrongfu
Copy link
Member

@jxfruit can you try again using below yaml content to test.

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          nvidia.com/gpu: 1
          xx/vcuda-core: 1

@jxfruit
Copy link
Author

jxfruit commented Jun 18, 2024

@lengrongfu still failed when create pod, got error:
Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0

BTW, the doc(https://github.com/Project-HAMi/HAMi/blob/master/docs/config.md) said that the install parameter 'devicePlugin.deviceSplitCount' will generate N vgpu, but after setting, I create pod use the vgpu, which will failed.

So is there something I missed?

@lengrongfu
Copy link
Member

Hope to provides below info:

  • provide node yaml content, node having registry nvidia.com/gpu=10 to allocatable and capacity?
  • provide you used hami version info?
  • Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0 This error is what components appear?

@jxfruit
Copy link
Author

jxfruit commented Jun 18, 2024

  • image
    apiVersion: v1
    kind: Pod
    metadata:
    name: xlabfe73ef20d3cc329522779f35dd1ebaa4
    spec:
    restartPolicy: OnFailure
    containers:

    • name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
      limits:
      nvidia.com/gpu: 10
  • hami image version: v2.3.12 and latest all failed

  • I am confused. I install hami with the command:
    helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system
    And I create pod with yaml:
    apiVersion: v1
    kind: Pod
    metadata:
    name: xlabfe73ef20d3cc329522779f35dd1ebaa4
    spec:
    restartPolicy: OnFailure
    containers:

    • name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
      limits:
      nvidia.com/gpu: 1
      xx/vcuda-core: 1

the pod describe info get the error info with command: kubectl describe pod/xlabfe73ef20d3cc329522779f35dd1ebaa4
Name: xlabfe73ef20d3cc329522779f35dd1ebaa4
Namespace: default
Priority: 0
Node: inter-app2/
Start Time: Tue, 18 Jun 2024 15:36:08 +0800
Labels:
Annotations: hami.io/bind-phase: allocating
hami.io/bind-time: 1718696168
hami.io/vgpu-devices-allocated: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:;
hami.io/vgpu-devices-to-allocate: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:;
hami.io/vgpu-node: inter-app2
hami.io/vgpu-time: 1718696168
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

the message changed....
After I tried a few times, I got:
Warning FailedScheduling 25s hami-scheduler binding rejected: node inter-app2 has been locked within 5 minutes

@lengrongfu
Copy link
Member

lengrongfu commented Jun 18, 2024

nvidia.com/gpu: 10 this field in the limits config value is what? because i look have one pod value is 10, another pod value is 1.

this field in the limits value cannot more then node native device number.

@jxfruit
Copy link
Author

jxfruit commented Jun 19, 2024

@lengrongfu the two pods are different situations. I just make a response for your questions. When only use ‘nvidia.com/gpu: 10’, the pod will be pending for the sake of no enough resource can be allocated. When add 'xx/vcuda-core' , the pod will throw the error: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

@lengrongfu
Copy link
Member

lengrongfu commented Jun 20, 2024

I test this case, don't recurrence?

image

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

@jxfruit
Copy link
Author

jxfruit commented Jun 21, 2024

I test this case, don't recurrence?

image

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

can u share config.yaml or install command ?
I try it again

@lengrongfu
Copy link
Member

  • helm -n nvidia-vgpu get values nvidia-vgpu; you can refer to below value yaml, i am not found helm install command.
hami:
  ascendResourceMem: huawei.com/Ascend910-memory
  ascendResourceName: huawei.com/Ascend910
  dcuResourceCores: hygon.com/dcucores
  dcuResourceMem: hygon.com/dcumem
  dcuResourceName: hygon.com/dcunum
  devicePlugin:
    deviceCoreScaling: 1
    deviceMemoryScaling: 1
    deviceSplitCount: 10
    disablecorelimit: "false"
    extraArgs:
    - -v=false
    hygonImageRepository: 4pdosc/vdcu-device-plugin
    hygonImageTag: v1.0
    hygondriver: /root/dcu-driver/dtk-22.10.1-vdcu
    hygonimage: 4pdosc/vdcu-device-plugin:v1.0
    hygonnodeSelector:
      dcu: "on"
    imagePullPolicy: IfNotPresent
    libPath: /usr/local/vgpu
    migStrategy: none
    mlunodeSelector:
      mlu: "on"
    monitorctrPath: /usr/local/vgpu/containers
    monitorimage: projecthami/hami
    nvidianodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: "true"
      nvidia.com/vgpu.deploy.device-plugin: "true"
    pluginPath: /var/lib/kubelet/device-plugins
    podAnnotations: {}
    registry: docker.m.daocloud.io
    repository: projecthami/hami
    runtimeClassName: ""
    service:
      httpPort: 31992
    tolerations: []
  fullnameOverride: ""
  global:
    annotations: {}
    gpuHookPath: /usr/local
    labels: {}
  iluvatarResourceCore: iluvatar.ai/vcuda-core
  iluvatarResourceMem: iluvatar.ai/vcuda-memory
  iluvatarResourceName: iluvatar.ai/vgpu
  imagePullSecrets: []
  mluResourceCores: cambricon.com/mlu.smlu.vcore
  mluResourceMem: cambricon.com/mlu.smlu.vmemory
  mluResourceName: cambricon.com/vmlu
  nameOverride: ""
  podSecurityPolicy:
    enabled: false
  resourceCores: test/vcuda-core
  resourceMem: test/vcuda-memory
  resourceMemPercentage: nvidia.com/gpumem-percentage
  resourceName: nvidia.com/vgpu
  resourcePriority: nvidia.com/priority
  resources:
    limits:
      cpu: 500m
      memory: 720Mi
    requests:
      cpu: 100m
      memory: 128Mi
  scheduler:
    customWebhook:
      enabled: false
      host: 127.0.0.1
      path: /webhook
      port: 31998
    defaultCores: 0
    defaultGPUNum: 1
    defaultMem: 0
    defaultSchedulerPolicy:
      gpuSchedulerPolicy: spread
      nodeSchedulerPolicy: binpack
    extender:
      extraArgs:
      - --debug
      - -v=4
      imagePullPolicy: IfNotPresent
      registry: docker.m.daocloud.io
      repository: projecthami/hami
    kubeScheduler:
      enabled: true
      extraArgs:
      - --policy-config-file=/config/config.json
      - -v=4
      extraNewArgs:
      - --config=/config/config.yaml
      - -v=4
      imagePullPolicy: IfNotPresent
      imageTag: v1.24.0
      registry: k8s-gcr.m.daocloud.io
      repository: kubernetes/kube-scheduler
    leaderElect: true
    metricsBindAddress: :9395
    mutatingWebhookConfiguration:
      failurePolicy: Ignore
    nodeName: ""
    nodeSelector:
      nvidia.com/gpu.deploy.container-toolkit: "true"
    patch:
      imagePullPolicy: IfNotPresent
      newRepository: liangjw/kube-webhook-certgen
      newTag: v1.1.1
      nodeSelector: {}
      podAnnotations: {}
      priorityClassName: ""
      registry: docker.io
      repository: jettech/kube-webhook-certgen
      runAsUser: 2000
      tag: v1.5.2
      tolerations: []
    podAnnotations: {}
    service:
      annotations: {}
      httpPort: 443
      labels: {}
      monitorPort: 31993
      schedulerPort: 31998
    serviceMonitor:
      enable: false
    tolerations: []
  schedulerName: hami-scheduler
  version: v2.3.11

@jxfruit
Copy link
Author

jxfruit commented Jun 22, 2024

Sorry, failed to use your values.yaml to deploy. The chart is incomplete. Can u share the total installing helm charts?

BTW, in my testing, field resourceName works when custom value. However, it cannot allocate the vgpu which is greater than the current node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants