Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying cluster.yaml on v1.13.8: failed calling webhook "cephcluster-wh-rook-ceph-admission-controller-rook-ceph.rook.io": connect: connection refused #14116

Closed
maon-fp opened this issue Apr 23, 2024 · 23 comments
Labels

Comments

@maon-fp
Copy link

maon-fp commented Apr 23, 2024

I've upgraded rook from v1.10.11 to v1.13.8 step by step (v1.10.11 -> v1.11.11 -> v1.12.11 -> v1.13.8). On https://rook.github.io/docs/rook/v1.13/Upgrade/rook-upgrade/ I've read that the admission controller is gone (which was enabled in my setup by ROOK_DISABLE_ADMISSION_CONTROLLER: "false"). So I changed this to ROOK_DISABLE_ADMISSION_CONTROLLER: "true" when still running v1.12.11.
Upgrade to v1.13.8 went smoothly. Now I want to upgrade to Reef and try to apply the cluster.yaml. But this gives me:

rook $ kaf 04-cluster-prod.yaml
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"ceph.rook.io/v1\",\"kind\":\"CephCluster\",\"metadata\":{\"annotations\":{},\"name\":\"rook-ceph\",\"namespace\":\"rook-ceph\"},\"spec\":{\"annotations\":null,\"cephVersion\":{\"allowUnsupported\":false,\"image\":\"quay.io/ceph/ceph:v18.2.2\"},\"cleanupPolicy\":{\"allowUninstallWithVolumes\":false,\"confirmation\":\"\",\"sanitizeDisks\":{\"dataSource\":\"zero\",\"iteration\":1,\"method\":\"quick\"}},\"continueUpgradeAfterChecksEvenIfNotHealthy\":false,\"crashCollector\":{\"disable\":false},\"csi\":{\"cephfs\":null,\"readAffinity\":{\"enabled\":false}},\"dashboard\":{\"enabled\":true,\"ssl\":true},\"dataDirHostPath\":\"/var/lib/rook\",\"disruptionManagement\":{\"managePodBudgets\":true,\"osdMaintenanceTimeout\":30,\"pgHealthCheckTimeout\":0},\"healthCheck\":{\"daemonHealth\":{\"mon\":{\"disabled\":false,\"interval\":\"45s\"},\"osd\":{\"disabled\":false,\"interval\":\"60s\"},\"status\":{\"disabled\":false,\"interval\":\"60s\"}},\"livenessProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}},\"startupProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}}},\"labels\":null,\"logCollector\":{\"enabled\":true,\"maxLogSize\":\"500M\",\"periodicity\":\"daily\"},\"mgr\":{\"allowMultiplePerNode\":true,\"count\":2,\"modules\":null},\"mon\":{\"allowMultiplePerNode\":true,\"count\":3},\"monitoring\":{\"enabled\":false,\"metricsDisabled\":false},\"network\":{\"connections\":{\"compression\":{\"enabled\":false},\"encryption\":{\"enabled\":false},\"requireMsgr2\":false}},\"priorityClassNames\":{\"mgr\":\"system-cluster-critical\",\"mon\":\"system-node-critical\",\"osd\":\"system-node-critical\"},\"removeOSDsIfOutAndSafeToRemove\":false,\"resources\":null,\"skipUpgradeChecks\":false,\"storage\":{\"config\":null,\"nodes\":[{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme1n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage1.<redacted>\"},{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme2n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage2.<redacted>\"}],\"onlyApplyOSDPlacement\":false,\"useAllDevices\":false,\"useAllNodes\":false},\"waitTimeoutForHealthyOSDInMinutes\":10}}\n"}},"spec":{"cephVersion":{"image":"quay.io/ceph/ceph:v18.2.2"},"csi":{"cephfs":null,"readAffinity":{"enabled":false}},"mgr":{"modules":null}}}
to:
Resource: "ceph.rook.io/v1, Resource=cephclusters", GroupVersionKind: "ceph.rook.io/v1, Kind=CephCluster"
Name: "rook-ceph", Namespace: "rook-ceph"
for: "04-cluster-prod.yaml": error when patching "04-cluster-prod.yaml": Internal error occurred: failed calling webhook "cephcluster-wh-rook-ceph-admission-controller-rook-ceph.rook.io": failed to call webhook: Post "https://rook-ceph-admission-controller.rook-ceph.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s": dial tcp 10.99.221.127:443: connect: connection refused

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS (Focal Fossa)
  • Kernel (e.g. uname -a): 5.15.0-105-generic
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod): v1.13.8
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
  • Kubernetes version (use kubectl version): v1.29.2
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK
@maon-fp maon-fp added the bug label Apr 23, 2024
@maon-fp
Copy link
Author

maon-fp commented Apr 23, 2024

My cluster.yaml: 04-cluster-prod.txt

@maon-fp
Copy link
Author

maon-fp commented Apr 23, 2024

Operator shows no errors or warnings.

@travisn
Copy link
Member

travisn commented Apr 23, 2024

@subhamkrai What are the steps to manually disable the admission controller? I can't seem to find it from previous issues.

@subhamkrai
Copy link
Contributor

@subhamkrai What are the steps to manually disable the admission controller? I can't seem to find it from previous issues.

not exactly remember but setting the value true should work but if that is not working try deleting

validating webhook rook-ceph-webhook

@maon-fp

@maon-fp
Copy link
Author

maon-fp commented Apr 25, 2024

Thank you for your replies.

not exactly remember but setting the value true should work but if that is not working try deleting

validating webhook rook-ceph-webhook

Are those supposed to be pods? I don't have any of those. I'm currently at v1.13.8: there is no ROOK_DISABLE_ADMISSION_CONTROLLER anymore. How can I set it to true now?

@subhamkrai
Copy link
Contributor

https://github.com/rook/rook/blob/release-1.12/deploy/examples/operator.yaml#L509 it was there till 1.12 and in 1.13 we removed it.

validating webhook rook-ceph-webhook this is not a pod, this is kubernetes resource, try kubectl get validatingwebhook rook-ceph-webhook*

@maon-fp
Copy link
Author

maon-fp commented Apr 25, 2024

@subhamkrai Thank you for pointing me in the right direction. I can see those resources:

$ kubectl api-resources --verbs=list -n rook-ceph | grep hook
mutatingwebhookconfigurations                       admissionregistration.k8s.io/v1   false        MutatingWebhookConfiguration
validatingwebhookconfigurations                     admissionregistration.k8s.io/v1   false        ValidatingWebhookConfiguration
$ kubectl api-resources --verbs=list -n rook-ceph | grep val
validatingwebhookconfigurations                     admissionregistration.k8s.io/v1   false        ValidatingWebhookConfiguratio

So none of the ones you mentioned, or?

https://github.com/rook/rook/blob/release-1.12/deploy/examples/operator.yaml#L509 it was there till 1.12 and in 1.13 we removed it.

So no chance to set it to true now?

@subhamkrai
Copy link
Contributor

@maon-fp could you also share svc list in rook-ceoh namespace?

@subhamkrai
Copy link
Contributor

Also could you share the top 10lines of rook operator pods logs

@maon-fp
Copy link
Author

maon-fp commented Apr 25, 2024

Yes, of course.

List of services:

$ kgs                                                                                                                                                         production:rook-ceph 
NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
csi-rbdplugin-metrics            ClusterIP   10.104.212.46    <none>        8080/TCP,8081/TCP   3y104d
rook-ceph-admission-controller   ClusterIP   10.99.221.127    <none>        443/TCP             2y2d
rook-ceph-mgr                    ClusterIP   10.109.30.124    <none>        9283/TCP            3y104d
rook-ceph-mgr-dashboard          ClusterIP   10.107.242.106   <none>        8443/TCP            3y104d
rook-ceph-mon-a                  ClusterIP   10.101.39.245    <none>        6789/TCP,3300/TCP   3y104d
rook-ceph-mon-c                  ClusterIP   10.110.130.143   <none>        6789/TCP,3300/TCP   3y104d
rook-ceph-mon-d                  ClusterIP   10.110.86.107    <none>        6789/TCP,3300/TCP   3y104d

First lines of operator log:

$ kl rook-ceph-operator-9f688fcc5-v2q6j | head -n 10                                                                                                          production:rook-ceph 
2024/04/23 14:00:19 maxprocs: Leaving GOMAXPROCS=24: CPU quota undefined
2024-04-23 14:00:19.215493 I | rookcmd: starting Rook v1.13.8 with arguments '/usr/local/bin/rook ceph operator'
2024-04-23 14:00:19.215514 I | rookcmd: flag values: --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-level=INFO
2024-04-23 14:00:19.215519 I | cephcmd: starting Rook-Ceph operator
2024-04-23 14:00:19.322061 I | cephcmd: base ceph version inside the rook operator image is "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)"
2024-04-23 14:00:19.332548 I | op-k8sutil: ROOK_CURRENT_NAMESPACE_ONLY="false" (env var)
2024-04-23 14:00:19.332558 I | operator: watching all namespaces for Ceph CRs
2024-04-23 14:00:19.332604 I | operator: setting up schemes
2024-04-23 14:00:19.335083 I | operator: setting up the controller-runtime manager
2024-04-23 14:00:19.335422 I | ceph-cluster-controller: successfully started
``

@subhamkrai
Copy link
Contributor

logs didn't help much but yeah delete the following resources in rook-ceph namespace(probably)

 Certificate rook-admission-controller-cert
 Issuer "selfsigned-issuer"
service "rook-ceph-admission-controller"

Also if you could share the -o yaml output of certificate and issue mentioned above to make sure that you are deleting the right resources. But yes we need to clean above three resources.

@maon-fp
Copy link
Author

maon-fp commented Apr 26, 2024

rook-admission-controller-cert:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  creationTimestamp: "2022-04-23T18:45:33Z"
  generation: 1
  name: rook-admission-controller-cert
  namespace: rook-ceph
  resourceVersion: "301286319"
  uid: 22aa348f-e223-4f98-870e-aab4ef1f71a9
spec:
  dnsNames:
  - rook-ceph-admission-controller
  - rook-ceph-admission-controller.rook-ceph.svc
  - rook-ceph-admission-controller.rook-ceph.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: selfsigned-issuer
  secretName: rook-ceph-admission-controller
status:
  conditions:
  - lastTransitionTime: "2022-04-23T18:45:34Z"
    message: Certificate is up to date and has not expired
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2024-07-11T18:45:34Z"
  notBefore: "2024-04-12T18:45:34Z"
  renewalTime: "2024-06-11T18:45:34Z"
  revision: 13

selfsigned-issuer:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  creationTimestamp: "2022-04-23T18:45:32Z"
  generation: 1
  name: selfsigned-issuer
  namespace: rook-ceph
  resourceVersion: "138597982"
  uid: 68162730-aade-4670-b830-1cf97005ef5c
spec:
  selfSigned: {}
status:
  conditions:
  - lastTransitionTime: "2022-04-23T18:45:32Z"
    observedGeneration: 1
    reason: IsReady
    status: "True"
    type: Ready

rook-ceph-admission-controller:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-04-23T18:45:34Z"
  name: rook-ceph-admission-controller
  namespace: rook-ceph
  resourceVersion: "214711462"
  uid: b62cac4d-ce0c-4f3d-aa19-ff2f9d9d553c
spec:
  clusterIP: 10.99.221.127
  clusterIPs:
  - 10.99.221.127
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 443
    protocol: TCP
    targetPort: 9443
  selector:
    app: rook-ceph-operator
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

@maon-fp
Copy link
Author

maon-fp commented Apr 26, 2024

I deleted those resources but still get (a slightly different) error:

Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"ceph.rook.io/v1\",\"kind\":\"CephCluster\",\"metadata\":{\"annotations\":{},\"name\":\"rook-ceph\",\"namespace\":\"rook-ceph\"},\"spec\":{\"annotations\":null,\"cephVersion\":{\"allowUnsupported\":false,\"image\":\"quay.io/ceph/ceph:v18.2.2\"},\"cleanupPolicy\":{\"allowUninstallWithVolumes\":false,\"confirmation\":\"\",\"sanitizeDisks\":{\"dataSource\":\"zero\",\"iteration\":1,\"method\":\"quick\"}},\"continueUpgradeAfterChecksEvenIfNotHealthy\":false,\"crashCollector\":{\"disable\":false},\"csi\":{\"cephfs\":null,\"readAffinity\":{\"enabled\":false}},\"dashboard\":{\"enabled\":true,\"ssl\":true},\"dataDirHostPath\":\"/var/lib/rook\",\"disruptionManagement\":{\"managePodBudgets\":true,\"osdMaintenanceTimeout\":30,\"pgHealthCheckTimeout\":0},\"healthCheck\":{\"daemonHealth\":{\"mon\":{\"disabled\":false,\"interval\":\"45s\"},\"osd\":{\"disabled\":false,\"interval\":\"60s\"},\"status\":{\"disabled\":false,\"interval\":\"60s\"}},\"livenessProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}},\"startupProbe\":{\"mgr\":{\"disabled\":false},\"mon\":{\"disabled\":false},\"osd\":{\"disabled\":false}}},\"labels\":null,\"logCollector\":{\"enabled\":true,\"maxLogSize\":\"500M\",\"periodicity\":\"daily\"},\"mgr\":{\"allowMultiplePerNode\":true,\"count\":2,\"modules\":null},\"mon\":{\"allowMultiplePerNode\":true,\"count\":3},\"monitoring\":{\"enabled\":false,\"metricsDisabled\":false},\"network\":{\"connections\":{\"compression\":{\"enabled\":false},\"encryption\":{\"enabled\":false},\"requireMsgr2\":false}},\"priorityClassNames\":{\"mgr\":\"system-cluster-critical\",\"mon\":\"system-node-critical\",\"osd\":\"system-node-critical\"},\"removeOSDsIfOutAndSafeToRemove\":false,\"resources\":null,\"skipUpgradeChecks\":false,\"storage\":{\"config\":null,\"nodes\":[{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme1n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage1.<redacted>\"},{\"devices\":[{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme0n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme2n1\"},{\"config\":{\"osdsPerDevice\":\"4\"},\"name\":\"nvme3n1\"}],\"name\":\"storage2.<redacted>\"}],\"onlyApplyOSDPlacement\":false,\"useAllDevices\":false,\"useAllNodes\":false},\"waitTimeoutForHealthyOSDInMinutes\":10}}\n"}},"spec":{"cephVersion":{"image":"quay.io/ceph/ceph:v18.2.2"},"csi":{"cephfs":null,"readAffinity":{"enabled":false}},"mgr":{"modules":null}}}
to:
Resource: "ceph.rook.io/v1, Resource=cephclusters", GroupVersionKind: "ceph.rook.io/v1, Kind=CephCluster"
Name: "rook-ceph", Namespace: "rook-ceph"
for: "04-cluster-prod.yaml": error when patching "04-cluster-prod.yaml": Internal error occurred: failed calling webhook "cephcluster-wh-rook-ceph-admission-controller-rook-ceph.rook.io": failed to call webhook: Post "https://rook-ceph-admission-controller.rook-ceph.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s": service "rook-ceph-admission-controller" not found

I've also listed all resources in the namespace list_rook_ceph.txt and can find some admission controller resources:

$ grep admission list_rook_ceph.txt
secret/rook-ceph-admission-controller               kubernetes.io/tls                     3      2y3d
secret/rook-ceph-admission-controller-token-s47d8   kubernetes.io/service-account-token   3      3y105d
serviceaccount/rook-ceph-admission-controller   1         3y105d

@subhamkrai
Copy link
Contributor

try deleting the resources mentioned above

@maon-fp
Copy link
Author

maon-fp commented Apr 26, 2024

As stated before: the resource are already deleted. But now it complains about: service "rook-ceph-admission-controller" not found instead of a timeout.

@subhamkrai
Copy link
Contributor

subhamkrai commented Apr 26, 2024

kubectl get validatingwebhookconfigurations -A (search this in all namespace once). Also I'm on holiday today so will look on Monday.

Edit: I hope it's not something blocking you

@maon-fp
Copy link
Author

maon-fp commented Apr 26, 2024

Thank you. Take your free time! I'm not really blocked.

$ kubectl get validatingwebhookconfigurations -A
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          3y116d
ingress-nginx-admission         1          432d
metallb-webhook-configuration   7          432d
rook-ceph-webhook               5          2y3d

@subhamkrai
Copy link
Contributor

Thank you. Take your free time! I'm not really blocked.

$ kubectl get validatingwebhookconfigurations -A
NAME                            WEBHOOKS   AGE
cert-manager-webhook            1          3y116d
ingress-nginx-admission         1          432d
metallb-webhook-configuration   7          432d
rook-ceph-webhook               5          2y3d

I see the issue you need to delete the rook-ceph-webhook (I forgot that webhooks are cluster based resouce) also here is the code

func deleteWebhookResources(ctx context.Context, certMgrClient *cs.CertmanagerV1Client, clusterdContext *clusterd.Context) {
logger.Infof("deleting validating webhook %s", webhookConfigName)
err := clusterdContext.Clientset.AdmissionregistrationV1().ValidatingWebhookConfigurations().Delete(ctx, webhookConfigName, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
logger.Errorf("failed to delete validating webhook %s. %v", webhookConfigName, err)
}
logger.Infof("deleting webhook cert manager Certificate %s", certificateName)
err = certMgrClient.Certificates(namespace).Delete(ctx, certificateName, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
logger.Errorf("failed to delete webhook cert manager Certificate %s. %v", certificateName, err)
}
logger.Infof("deleting webhook cert manager Issuer %q", issuerName)
err = certMgrClient.Issuers(namespace).Delete(ctx, issuerName, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
logger.Errorf("failed to delete webhook cert manager Issuer %s. %v", issuerName, err)
}
logger.Infof("deleting validating webhook service %q", admissionControllerAppName)
err = clusterdContext.Clientset.CoreV1().Services(namespace).Delete(ctx, admissionControllerAppName, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
logger.Errorf("failed to delete validating webhook service %s. %v", admissionControllerAppName, err)
}
}
that delete everything related to webhook in rook

@maon-fp
Copy link
Author

maon-fp commented Apr 26, 2024

Alright. I'm not into Go but I'll figure it out. Thank you for your help!

@maon-fp
Copy link
Author

maon-fp commented Apr 29, 2024

Just to be 100% sure. Are you asking to run:

kubectl delete validatingwebhookconfigurations rook-ceph-webhook

? I'm a bit worried as I can see 5 webhooks there.

@subhamkrai
Copy link
Contributor

Just to be 100% sure. Are you asking to run:

kubectl delete validatingwebhookconfigurations rook-ceph-webhook

? I'm a bit worried as I can see 5 webhooks there.

yess, delete rook-ceph-webhook only

@maon-fp
Copy link
Author

maon-fp commented Apr 30, 2024

It worked. Thanks a lot for the quick and competent answers! 🙇

@maon-fp maon-fp closed this as completed Apr 30, 2024
@subhamkrai
Copy link
Contributor

Good to know it is working now @maon-fp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants