Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgpu-scheduler-extender device error, node xxxxx not found and device-plugin get node error resource name may not be empty #284

Open
1 of 9 tasks
betterxu opened this issue Apr 22, 2024 · 3 comments

Comments

@betterxu
Copy link

betterxu commented Apr 22, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

i check the hami-device-plugin yaml the env nodename is exist

spec:
  containers:
  - command:
    - nvidia-device-plugin
    - --resource-name=nvidia.com/gpu
    - --mig-strategy=none
    - --device-memory-scaling=1
    - --device-cores-scaling=1
    - --device-split-count=10
    - --disable-core-limit=false
    - -v=false
    env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: NVIDIA_MIG_MONITOR_DEVICES
      value: all
    - name: HOOK_PATH
      value: /usr/local
    image: projecthami/hami:v2.3.9
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - -c
          - cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/

1. Issue or feature description

kubectl logs -f -n kube-system hami-device-plugin-gt9h5 -c device-plugin
E0422 08:03:18.033017 27021 register.go:171] get node error resource name may not be empty
E0422 08:03:18.033309 27021 register.go:193] Failed to register annotation: resource name may not be empty

kubectl logs -f -n kube-system hami-scheduler-774d56586c-98hjg vgpu-scheduler-extender
E0422 08:02:30.014651 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014749 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014774 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014781 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014789 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found
E0422 08:02:30.014799 1 scheduler.go:289] get node xxxxx device error, node xxxxx not found

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The vgpu-device-plugin container logs
  • The vgpu-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
Copy link

Hi @betterxu,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

@TymonLee
Copy link

I encountered same situation before. The reason in my case is that, the env name "NODE_NAME" is invalid in some versions. I added the env below and the problem was resolved.

    - name: NodeName
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

@wawa0210
Copy link
Member

I encountered same situation before. The reason in my case is that, the env name "NODE_NAME" is invalid in some versions. I added the env below and the problem was resolved.

    - name: NodeName
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

You can try the latest version of the image and charts, it should be ok, there was a break change before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants