Skip to content

Unable to set proper tolerations with GPU-Operator #270

@larivierec

Description

@larivierec

1. Quick Debug Checklist

  • [-] Are you running on an Ubuntu 18.04 node? 20.04 (but it works for the node using a GTX 1070)
  • Are you running Kubernetes v1.13+?
  • [-] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? containerd
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

When using a legacy gpu that is required for computer boot I am unable to "un-label" this node from being pushed the full GPU-Operator installation.

This is a microk8s cluster running 1.22 and the latest nvidia gpu-operator. The 1070 GPU works just fine.

2. Steps to reproduce the issue

Multinode cluster with 2 nodes

  1. legacy gpu
  2. one newer gpu (my case) gtx 1070

Label the node using `kubectl label node <legacy_node> legacy-gpu=true

Using the helm chart gpu-operator

operator:
      defaultRuntime: containerd
      runtimeClass: nvidia-container-runtime
      tolerations:
        - key: "legacy-gpu"
          operator: "Exists"
          effect: "NoSchedule"

It installs it on the proper node, but all other tasks are propagated to both nodes.
I tried setting each section from the helm chart with this toleration and it does nothing. Just keeps re-installing on each node (even uninstalling + reinstalling the chart).

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
NAME                                   READY   STATUS                   RESTARTS      AGE     IP             NODE          NOMINATED NODE   READINESS GATES
nvidia-driver-daemonset-bkcbr          1/1     Running                  0             7m57s   10.1.134.241   k8s-gpu       <none>           <none>
gpu-feature-discovery-9slwv            1/1     Running                  0             7m54s   10.1.134.201   k8s-gpu       <none>           <none>
nvidia-cuda-validator-gf9f9            0/1     Completed                0             6m17s   10.1.134.249   k8s-gpu       <none>           <none>
nvidia-device-plugin-daemonset-dp9jp   1/1     Running                  0             7m56s   10.1.134.255   k8s-gpu       <none>           <none>
nvidia-device-plugin-validator-cv4sj   0/1     Completed                0             6m6s    10.1.134.203   k8s-gpu       <none>           <none>
nvidia-operator-validator-jqx6v        1/1     Running                  0             7m57s   10.1.134.199   k8s-gpu       <none>           <none>
nvidia-dcgm-hk6pl                      1/1     Running                  0             7m55s   10.1.134.230   k8s-gpu       <none>           <none>
nvidia-dcgm-exporter-trtxs             1/1     Running                  0             7m55s   10.1.134.239   k8s-gpu       <none>           <none>
nvidia-driver-daemonset-klg6x          0/1     Init:CrashLoopBackOff    6 (74s ago)   7m57s   10.1.11.56     k8s-cluster   <none>           <none>
nvidia-dcgm-ks99z                      0/1     Init:RunContainerError   3 (13s ago)   74s     10.1.11.28     k8s-cluster   <none>           <none>
gpu-feature-discovery-j9wh9            0/1     Init:RunContainerError   3 (12s ago)   74s     10.1.11.51     k8s-cluster   <none>           <none>
nvidia-device-plugin-daemonset-f59vn   0/1     Init:RunContainerError   3 (11s ago)   74s     10.1.11.39     k8s-cluster   <none>           <none>
nvidia-dcgm-exporter-jq8n9             0/1     Init:CrashLoopBackOff    3 (16s ago)   74s     10.1.11.62     k8s-cluster   <none>           <none>
nvidia-operator-validator-4876k        0/1     Init:CrashLoopBackOff    3 (13s ago)   74s     10.1.11.1      k8s-cluster   <none>           <none>

If theres any other missing info let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions