vgpu-scheduler-extender device error, node xxxxx not found and device-plugin get node error resource name may not be empty #284

betterxu · 2024-04-22T08:09:58Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

i check the hami-device-plugin yaml the env nodename is exist

spec:
  containers:
  - command:
    - nvidia-device-plugin
    - --resource-name=nvidia.com/gpu
    - --mig-strategy=none
    - --device-memory-scaling=1
    - --device-cores-scaling=1
    - --device-split-count=10
    - --disable-core-limit=false
    - -v=false
    env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: NVIDIA_MIG_MONITOR_DEVICES
      value: all
    - name: HOOK_PATH
      value: /usr/local
    image: projecthami/hami:v2.3.9
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - -c
          - cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/

1. Issue or feature description

kubectl logs -f -n kube-system hami-device-plugin-gt9h5 -c device-plugin
E0422 08:03:18.033017 27021 register.go:171] get node error resource name may not be empty
E0422 08:03:18.033309 27021 register.go:193] Failed to register annotation: resource name may not be empty

kubectl logs -f -n kube-system hami-scheduler-774d56586c-98hjg vgpu-scheduler-extender
E0422 08:02:30.014651 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014749 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014774 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014781 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014789 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014799 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The vgpu-device-plugin container logs
The vgpu-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-22T08:10:11Z

Hi @betterxu,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

TymonLee · 2024-04-23T07:46:36Z

I encountered same situation before. The reason in my case is that, the env name "NODE_NAME" is invalid in some versions. I added the env below and the problem was resolved.

    - name: NodeName
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

wawa0210 · 2024-04-26T03:01:41Z

I encountered same situation before. The reason in my case is that, the env name "NODE_NAME" is invalid in some versions. I added the env below and the problem was resolved.
    - name: NodeName
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

You can try the latest version of the image and charts, it should be ok, there was a break change before

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vgpu-scheduler-extender device error, node xxxxx not found and device-plugin get node error resource name may not be empty #284

vgpu-scheduler-extender device error, node xxxxx not found and device-plugin get node error resource name may not be empty #284

betterxu commented Apr 22, 2024 •

edited

Loading

github-actions bot commented Apr 22, 2024

TymonLee commented Apr 23, 2024

wawa0210 commented Apr 26, 2024

vgpu-scheduler-extender device error, node xxxxx not found and device-plugin get node error resource name may not be empty #284

vgpu-scheduler-extender device error, node xxxxx not found and device-plugin get node error resource name may not be empty #284

Comments

betterxu commented Apr 22, 2024 • edited Loading

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

github-actions bot commented Apr 22, 2024

TymonLee commented Apr 23, 2024

wawa0210 commented Apr 26, 2024

betterxu commented Apr 22, 2024 •

edited

Loading