You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deviation from expected behavior:
If a user creates an invalid objectbucketclaim with an integer for .spec.additionalConfig.maxObjects, this is accepted into the cluster. When the operator starts, it will log a warning repeatedly warning of an invalid bucketclaim:
E0328 13:00:32.118449 14728 reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.4/tools/cache/reflector.go:229: Failed to watch *v1alpha1.ObjectBucketClaim: failed to list *v1alpha1.ObjectBucketClaim: json: cannot unmarshal number into Go struct field ObjectBucketClaimSpec.items.spec.additionalConfig of type string
and eventually timeout on start with a last error:
2024-03-28 13:02:52.055424 C | rookcmd: failed to run operator: gave up to run the operator manager: failed to run the controller-runtime manager: [failed to wait for ceph-block-pool-controller caches to sync: timed out waiting for cache to be synced for Kind *v1.CephBlockPool, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]
The rook operator then exits and the pod is restarted (crashLoopBackoff)
Expected behavior:
The objectbucketclaim should be rejected and/or the operator should be able to overlook this and start correctly, continuing to log warnings.
How to reproduce it (minimal and precise):
Create an objectbucketclaim ike this:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: this-breaks-cluster
spec:
generateBucketName: sample
storageClassName: general-s3
additionalConfig:
maxObjects: 1000 # should be maxObjects: "1000"
Restart the operator & observe logs and crashloopbackoff
Of course the string works. If you adjust the OBC to have maxObjects: "100" then on the next start of the operator the crashloop goes away and the operator runs.
The problem is, a random user of our cluster can create a broken yaml which is accepted (/not validated) by the CRD. This then kills the operator on the next restart and breaks all operator operations for all users until we discover/fix it. Worse, the log doesn't mention which OBC has this problem so our search is manual to find which namespace and OBC has this integer. And our inital search today was difficult because the offending OBC was created weeks ago, and the operator was only restarted recently.
We've got a gatekeeper rule in place as a bandaid for now, but this should be gracefully handled such that it doesn't fail.
Is this a bug report or feature request?
Deviation from expected behavior:
If a user creates an invalid objectbucketclaim with an integer for .spec.additionalConfig.maxObjects, this is accepted into the cluster. When the operator starts, it will log a warning repeatedly warning of an invalid bucketclaim:
and eventually timeout on start with a last error:
The rook operator then exits and the pod is restarted (crashLoopBackoff)
Expected behavior:
The objectbucketclaim should be rejected and/or the operator should be able to overlook this and start correctly, continuing to log warnings.
How to reproduce it (minimal and precise):
Create an objectbucketclaim ike this:
Restart the operator & observe logs and crashloopbackoff
This is tested on
Environment:
uname -a
): 6.5 / 6.6rook version
inside of a Rook Pod): 1.13.5ceph -v
): 17.2.7kubectl version
): 1.26.12ceph health
in the Rook Ceph toolbox): okThe text was updated successfully, but these errors were encountered: