Skip to content

gpu operator does not reconcile cluster policy to update managed daemonsets #186

@amrmahdi

Description

@amrmahdi

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

The gpu-operator does not reconcile the clusterpolicy updates applied to the cluster. For example if you update the helm chart values to specify a different image of the dcgm exporter, the exporter daemonset does not get updated with the supplied image. Same is true for other daemonsets managed by the operator.

2. Steps to reproduce the issue

  1. Install the operator using the helm chart (latest from master) with the image of the driver latest from master
  2. Do a helm upgrade with a different dcgm exporter image
  3. You will see that thte dcgm daemonset does not get reconciled and updated

3. Information to attach (optional if deemed irrelevant)

I've also tried to restart the gpu operator pod to see if it will pick up the changes, it did not.

$ kubectl get daemonsets.apps -n gpu-operator-resources
NAME                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                2         2         2       2            2           nvidia.com/gpu.deploy.gpu-feature-discovery=true   8h
nvidia-container-toolkit-daemonset   2         2         2       2            2           nvidia.com/gpu.deploy.container-toolkit=true       8h
nvidia-dcgm-exporter                 2         2         2       2            2           nvidia.com/gpu.deploy.dcgm-exporter=true           8h
nvidia-device-plugin-daemonset       2         2         2       2            2           nvidia.com/gpu.deploy.device-plugin=true           8h
nvidia-driver-daemonset              2         2         2       2            2           nvidia.com/gpu.deploy.driver=true                  8h
$ kubectl describe  clusterpolicy
Name:         cluster-policy
Namespace:
Labels:       app.kubernetes.io/component=gpu-operator
              app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: nvidia-gpu-operator
              meta.helm.sh/release-namespace: default
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
  Creation Timestamp:  2021-05-07T12:58:52Z
  Generation:          6
  Resource Version:  597411
  Self Link:         /apis/nvidia.com/v1/clusterpolicies/cluster-policy
  UID:               1f6fdfbc-ec07-4ec9-b17e-6d77ae09168b
Spec:
  Dcgm Exporter:
    Args:
      -f
      /etc/dcgm-exporter/dcp-metrics-included.csv
    Image:              dcgm-exporter
    Image Pull Policy:  IfNotPresent
    Node Selector:
      nvidia.com/gpu.deploy.dcgm-exporter:  true
    Priority Class Name:                    system-node-critical
    Repository:                             nvcr.io/nvidia/k8s
    Version:                                2.1.8-2.4.0-rc.2-ubuntu20.04
  Device Plugin:
    Env:
      Name:             PASS_DEVICE_SPECS
      Value:            true
      Name:             FAIL_ON_INIT_ERROR
      Value:            true
      Name:             DEVICE_LIST_STRATEGY
      Value:            envvar
      Name:             DEVICE_ID_STRATEGY
      Value:            uuid
      Name:             NVIDIA_VISIBLE_DEVICES
      Value:            all
      Name:             NVIDIA_DRIVER_CAPABILITIES
      Value:            all
    Image:              k8s-device-plugin
    Image Pull Policy:  IfNotPresent
    Node Selector:
      nvidia.com/gpu.deploy.device-plugin:  true
    Priority Class Name:                    system-node-critical
    Repository:                             nvcr.io/nvidia
    Security Context:
      Privileged:  true
    Version:       v0.9.0-ubi8
  Driver:
    Enabled:            true
    Image:              driver
    Image Pull Policy:  IfNotPresent
    Licensing Config:
      Config Map Name:
    Node Selector:
      nvidia.com/gpu.deploy.driver:  true
    Priority Class Name:             system-node-critical
    Repo Config:
      Config Map Name:
      Destination Dir:
    Repository:         nvcr.io/nvidia
    Security Context:
      Privileged:  true
      Se Linux Options:
        Level:  s0
    Tolerations:
      Effect:    NoSchedule
      Key:       nvidia.com/gpu
      Operator:  Exists
    Version:     460.73.01
  Gfd:
    Env:
      Name:             GFD_SLEEP_INTERVAL
      Value:            60s
      Name:             GFD_FAIL_ON_INIT_ERROR
      Value:            true
    Image:              gpu-feature-discovery
    Image Pull Policy:  IfNotPresent
    Node Selector:
      nvidia.com/gpu.deploy.gpu-feature-discovery:  true
    Priority Class Name:                            system-node-critical
    Repository:                                     nvcr.io/nvidia
    Tolerations:
      Effect:    NoSchedule
      Key:       nvidia.com/gpu
      Operator:  Exists
    Version:     v0.4.1
  Mig:
    Strategy:  single
  Operator:
    Default Runtime:  containerd
    Init Container:
      Image:              cuda
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia
      Version:            11.2.1-base-ubi8
    Validator:
      Image:              cuda-sample
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/k8s
      Version:            vectoradd-cuda10.2
  Toolkit:
    Image:              container-toolkit
    Image Pull Policy:  IfNotPresent
    Node Selector:
      nvidia.com/gpu.deploy.container-toolkit:  true
    Priority Class Name:                        system-node-critical
    Repository:                                 nvcr.io/nvidia/k8s
    Security Context:
      Privileged:  true
      Se Linux Options:
        Level:  s0
    Tolerations:
      Effect:    NoSchedule
      Key:       nvidia.com/gpu
      Operator:  Exists
    Version:     1.5.0-ubuntu18.04
Status:
  State:  ready
Events:   <none>

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions