-
Notifications
You must be signed in to change notification settings - Fork 435
Closed
Description
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
The gpu-operator does not reconcile the clusterpolicy updates applied to the cluster. For example if you update the helm chart values to specify a different image of the dcgm exporter, the exporter daemonset does not get updated with the supplied image. Same is true for other daemonsets managed by the operator.
2. Steps to reproduce the issue
- Install the operator using the helm chart (latest from master) with the image of the driver latest from master
- Do a helm upgrade with a different dcgm exporter image
- You will see that thte dcgm daemonset does not get reconciled and updated
3. Information to attach (optional if deemed irrelevant)
I've also tried to restart the gpu operator pod to see if it will pick up the changes, it did not.
$ kubectl get daemonsets.apps -n gpu-operator-resources
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 2 2 2 2 2 nvidia.com/gpu.deploy.gpu-feature-discovery=true 8h
nvidia-container-toolkit-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.container-toolkit=true 8h
nvidia-dcgm-exporter 2 2 2 2 2 nvidia.com/gpu.deploy.dcgm-exporter=true 8h
nvidia-device-plugin-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.device-plugin=true 8h
nvidia-driver-daemonset 2 2 2 2 2 nvidia.com/gpu.deploy.driver=true 8h$ kubectl describe clusterpolicy
Name: cluster-policy
Namespace:
Labels: app.kubernetes.io/component=gpu-operator
app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: nvidia-gpu-operator
meta.helm.sh/release-namespace: default
API Version: nvidia.com/v1
Kind: ClusterPolicy
Metadata:
Creation Timestamp: 2021-05-07T12:58:52Z
Generation: 6
Resource Version: 597411
Self Link: /apis/nvidia.com/v1/clusterpolicies/cluster-policy
UID: 1f6fdfbc-ec07-4ec9-b17e-6d77ae09168b
Spec:
Dcgm Exporter:
Args:
-f
/etc/dcgm-exporter/dcp-metrics-included.csv
Image: dcgm-exporter
Image Pull Policy: IfNotPresent
Node Selector:
nvidia.com/gpu.deploy.dcgm-exporter: true
Priority Class Name: system-node-critical
Repository: nvcr.io/nvidia/k8s
Version: 2.1.8-2.4.0-rc.2-ubuntu20.04
Device Plugin:
Env:
Name: PASS_DEVICE_SPECS
Value: true
Name: FAIL_ON_INIT_ERROR
Value: true
Name: DEVICE_LIST_STRATEGY
Value: envvar
Name: DEVICE_ID_STRATEGY
Value: uuid
Name: NVIDIA_VISIBLE_DEVICES
Value: all
Name: NVIDIA_DRIVER_CAPABILITIES
Value: all
Image: k8s-device-plugin
Image Pull Policy: IfNotPresent
Node Selector:
nvidia.com/gpu.deploy.device-plugin: true
Priority Class Name: system-node-critical
Repository: nvcr.io/nvidia
Security Context:
Privileged: true
Version: v0.9.0-ubi8
Driver:
Enabled: true
Image: driver
Image Pull Policy: IfNotPresent
Licensing Config:
Config Map Name:
Node Selector:
nvidia.com/gpu.deploy.driver: true
Priority Class Name: system-node-critical
Repo Config:
Config Map Name:
Destination Dir:
Repository: nvcr.io/nvidia
Security Context:
Privileged: true
Se Linux Options:
Level: s0
Tolerations:
Effect: NoSchedule
Key: nvidia.com/gpu
Operator: Exists
Version: 460.73.01
Gfd:
Env:
Name: GFD_SLEEP_INTERVAL
Value: 60s
Name: GFD_FAIL_ON_INIT_ERROR
Value: true
Image: gpu-feature-discovery
Image Pull Policy: IfNotPresent
Node Selector:
nvidia.com/gpu.deploy.gpu-feature-discovery: true
Priority Class Name: system-node-critical
Repository: nvcr.io/nvidia
Tolerations:
Effect: NoSchedule
Key: nvidia.com/gpu
Operator: Exists
Version: v0.4.1
Mig:
Strategy: single
Operator:
Default Runtime: containerd
Init Container:
Image: cuda
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia
Version: 11.2.1-base-ubi8
Validator:
Image: cuda-sample
Image Pull Policy: IfNotPresent
Repository: nvcr.io/nvidia/k8s
Version: vectoradd-cuda10.2
Toolkit:
Image: container-toolkit
Image Pull Policy: IfNotPresent
Node Selector:
nvidia.com/gpu.deploy.container-toolkit: true
Priority Class Name: system-node-critical
Repository: nvcr.io/nvidia/k8s
Security Context:
Privileged: true
Se Linux Options:
Level: s0
Tolerations:
Effect: NoSchedule
Key: nvidia.com/gpu
Operator: Exists
Version: 1.5.0-ubuntu18.04
Status:
State: ready
Events: <none>Metadata
Metadata
Assignees
Labels
No labels