Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot enable GDRcopy using Nvidia driver CRD due to wrong indentation in 0500_daemonset.yaml #713

Closed
age9990 opened this issue May 3, 2024 · 3 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@age9990
Copy link

age9990 commented May 3, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: 6.2.0-39
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):K8s
  • GPU Operator Version:v24.3.0

2. Issue or feature description

When enable GDRcopy in nvidia driver CR, driver daemonset is not changed and error log showed in gpu operator pod.
{"level":"error","ts":"2024-05-03T06:29:33.398Z","msg":"Error while syncing state","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"a902d530-65d4-480e-8157-0e0c21d0a332","error":"failed to create k8s objects from manifests: failed to render kubernetes manifests: error rendering file /opt/gpu-operator/manifests/state-driver/0500_daemonset.yaml: failed to unmarshal manifest /opt/gpu-operator/manifests/state-driver/0500_daemonset.yaml: error converting YAML to JSON: yaml: line 195: did not find expected key"}

Looking into this file, the indentation is not correct, missing two spaces from L493 to L496.

volumeMounts:
- name: run-nvidia
mountPath: /run/nvidia
mountPropagation: HostToContainer
- name: var-log
mountPath: /var/log
- name: dev-log
mountPath: /dev/log
readOnly: true
{{- if and (.Openshift) (.Runtime.OpenshiftDriverToolkitEnabled) }}
- name: shared-nvidia-driver-toolkit
mountPath: /mnt/shared-nvidia-driver-toolkit
{{- end}}
{{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }}
{{- range .AdditionalConfigs.VolumeMounts }}
- name: {{ .Name }}
mountPath: {{ .MountPath }}
subPath: {{ .SubPath }}
readOnly: {{ .ReadOnly }}
{{- end }}
{{- end }}

Once I fixed the indentation and rebuilt the image, the GDRcopy can be enabled with no error.

@cdesiniotis cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label May 6, 2024
@cdesiniotis cdesiniotis self-assigned this May 6, 2024
@cdesiniotis
Copy link
Contributor

@age9990 thanks for reporting this issue.

@cdesiniotis
Copy link
Contributor

I have a fix out for this here: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1083

@cdesiniotis
Copy link
Contributor

GPU Operator 24.6.0 has been released and contains the fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

2 participants