Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] give clearer error message if there are no nodes eligible for nvidia-driver-toolkit addon #5773

Open
SlavikCA opened this issue May 8, 2024 · 4 comments
Labels
question Further information is requested

Comments

@SlavikCA
Copy link

SlavikCA commented May 8, 2024

I installed Harvester 1.3.0 on single node:

  • DELL T7920
  • 2x Xeon Gold 5218
  • 64GB RAM
  • nVidia Quadro

I know, that nVidia model may not be supported, but I wanted to try vGPU.

I enabled Addon pcidevices-controller. I can see nVidia device:

Screenshot 2024-05-08 at 15 22 32

I enabled Addon: nvidia-driver-toolkit:

Screenshot 2024-05-08 at 15 17 22

And nothing happens...
All I see is:

This resource is currently in a transitioning state, but there isn't a detailed message available.

How can I troubleshoot? Where can I see logs?

@SlavikCA SlavikCA added the question Further information is requested label May 8, 2024
@SlavikCA
Copy link
Author

SlavikCA commented May 8, 2024

@ibrokethecloud
Copy link
Contributor

the addon creates a daemonset nvidia-driver-runtime in the harvester-system namespace where logs are available.

@SlavikCA
Copy link
Author

SlavikCA commented May 9, 2024

kubectl get daemonset --all-namespaces

NAMESPACE           NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR 
harvester-system    nvidia-driver-runtime   0         0         0       0            0           sriovgpu.harvesterhci.io/driver-needed=true

looks like the issue is NODE SELECTOR set to sriovgpu.harvesterhci.io/driver-needed=true
which my node doesn't have:

kubectl describe node t7920
...
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.4.144
                    cluster.x-k8s.io/cluster-name: local
                    cluster.x-k8s.io/cluster-namespace: fleet-local
                    cluster.x-k8s.io/labels-from-machine: 
                    cluster.x-k8s.io/machine: custom-c07fb39e1e4b
                    csi.volume.kubernetes.io/nodeid: {"driver.longhorn.io":"t7920"}
                    etcd.rke2.cattle.io/local-snapshots-timestamp: 2024-05-09T00:00:02Z
                    etcd.rke2.cattle.io/node-address: 10.0.4.144
                    etcd.rke2.cattle.io/node-name: t7920-e02038ff
                    flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"4e:47:3a:87:00:1e"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.0.4.144
                    kubevirt.io/heartbeat: 2024-05-09T03:13:19Z
                    kubevirt.io/ksm-handler-managed: false
                    management.cattle.io/pod-limits:
                      {"cpu":"17265m","devices.kubevirt.io/kvm":"1","devices.kubevirt.io/tun":"1","devices.kubevirt.io/vhost-net":"1","memory":"32488056576","nv...
                    management.cattle.io/pod-requests:
                      {"cpu":"12275m","devices.kubevirt.io/kvm":"1","devices.kubevirt.io/tun":"1","devices.kubevirt.io/vhost-net":"1","ephemeral-storage":"50M",...
                    node.alpha.kubernetes.io/ttl: 0
                    node.harvesterhci.io/ntp-service: {"ntpSyncStatus":"synced","currentNtpServers":"0.suse.pool.ntp.org"}
                    rke2.io/encryption-config-hash: start-1174704689d140fc0c4b7c929cd6c01ee40648a9925783e82f7950f07089bf4c
                    rke2.io/hostname: t7920
                    rke2.io/internal-ip: 10.0.4.144
                    rke2.io/node-args:
                      ["server","--disable","rke2-snapshot-controller","--disable","rke2-snapshot-controller-crd","--disable","rke2-snapshot-validation-webhook"...
                    rke2.io/node-config-hash: YY52OZWQSA5XJVVKBENDUHPM4DPSBL3HDCYYII6OCK7QBFMG3HPQ====
                    rke2.io/node-env: {}
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...
Capacity:
  cpu:                                                64
  devices.kubevirt.io/kvm:                            1k
  devices.kubevirt.io/tun:                            1k
  devices.kubevirt.io/vhost-net:                      1k
  ephemeral-storage:                                  153707984Ki
  hugepages-1Gi:                                      0
  hugepages-2Mi:                                      0
  memory:                                             65515140Ki
  nvidia.com/GP104GL_QUADRO_P5000:                    1
  nvidia.com/GP104_HIGH_DEFINITION_AUDIO_CONTROLLER:  1
  pods:                                               200

so, can i suggest improvements, when enabling that addon, if there are no such nodes with sriovgpu.harvesterhci.io/driver-needed=true - give the user the warning about that?

After I clicked to ENABLE it - it is stuck now: it can't finish the installation and i can't disable it anynome. It's stays in ENABLING state forever.

The root cause seems to be that even though I have nVidia card, because it's not SR-IOV enabled that didn't addition of that annotation.

@SlavikCA SlavikCA changed the title [Question] How to find logs for nvidia-driver-toolkit installation? [ENHANCEMENT] give clearer error message if there are no nodes eligible for nvidia-driver-toolkit addon May 9, 2024
@ibrokethecloud
Copy link
Contributor

the label is added by the controller when it finds a supported GPU. This is to ensure that driver is not installed on nodes where it is not needed, as the driver install will likely fail when no devices are present.

@rebeccazzzz rebeccazzzz added this to New in Community Issue Review via automation May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants