Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-operator fails to start due to deletion of nonexistent resources #484

Closed
4 of 5 tasks
xknight opened this issue Feb 6, 2023 · 8 comments
Closed
4 of 5 tasks

Comments

@xknight
Copy link

xknight commented Feb 6, 2023

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Clean up of PSP resources fails in k8s 1.25.4 / OKD 4.12 (tested with gpu-operator 22.9.2):

{"level":"info","ts":1675677539.5370145,"logger":"controllers.ClusterPolicy","msg":"Couldn't delete","PodSecurityPolicies":"gpu-operator-privileged","Error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\""}
{"level":"error","ts":1675677539.5370853,"msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"2a1a3aa8-0a42-4425-95fd-0bd83f2d6ad7","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\""}

It seems the code here should treat a non-existent resource definition as "not found" and pass the condition, but it fails instead.

2. Steps to reproduce the issue

  1. Install the gpu-operator in an OKD 4.12 cluster
  2. Observe the logs for the gpu-operator pod

3. Other information

Commenting out the code block above allows the operator to start normally.

@shivamerla
Copy link
Contributor

@xknight thanks for reporting this. If k8s version is >= 1.25.x we avoid creating/deleting PSP here. Need to check why that didn't work in your case. Which version of operator are you using?

@xknight
Copy link
Author

xknight commented Feb 7, 2023

@shivamerla I tested this with the latest release (22.9.2). Is that codepath also for deleting resources, or only for adding them? I'm asking because I didn't receive any other panics around creating resources.

I'm happy to make a stack trace or test a patch if it helps.

Anecdotally, the operator appears to choose OpenShift codepaths automatically even when setting openshift: false or use_ocp_driver_toolkit: false (I tried these out when trying to understand the issue better). OKD users can't use NFD (due to it pointing to Redhat registry images) or install the GPU Operator (because it's not there) from the Operator Hub, so Helm needs to be used. I prefer Helm anyway, as it keeps things the same as non-OKD/OpenShift clusters.

@xknight xknight closed this as completed Feb 8, 2023
@xknight xknight reopened this Feb 8, 2023
@xknight
Copy link
Author

xknight commented Feb 8, 2023

(I closed the issue only to reopen it realizing that it still persists.)

@shivamerla
Copy link
Contributor

@xknight somehow k8s version check seems to be failing here in your case and we end up adding PSP manifests for creation/deletion. Can you double check if you see this message in operator logs.

PodSecurityPolicy no longer supported by API. Skipping...

@xknight
Copy link
Author

xknight commented Feb 9, 2023

@shivamerla no, that's not in the log.

@xknight
Copy link
Author

xknight commented Feb 9, 2023

The log reads:

{"level":"info","ts":1675919974.3011851,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.25.0-2653+a34b9e9499e6c3-dirty"}

This is the version provided by OKD 4.12.0-0.okd-2023-01-21-055900; there is one newer version available now, but I doubt that the "structure" of the semantic version will be different.

@shivamerla
Copy link
Contributor

@xknight that explains it, we need to fix the constraint as v1.25.0-0 in the check for to handle these cases. Will change it as part of next patch release.

@shivamerla
Copy link
Contributor

@xknight this has been fixed with v23.3.0 of the operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants