Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops v1.26 upgrade fails due to serviceAccountIssuer changes #16488

Closed
elliotdobson opened this issue Apr 23, 2024 · 6 comments
Closed

kops v1.26 upgrade fails due to serviceAccountIssuer changes #16488

elliotdobson opened this issue Apr 23, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@elliotdobson
Copy link
Contributor

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.26.6 (git-v1.26.6)
or
Client version: 1.25.4 (git-v1.25.4)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Server Version: v1.25.16

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Upgrading from kops v1.25.4 to v1.26.6:

  1. Use a custom masterInternalName in the kops cluster spec.
  2. kops update cluster.
  3. roll the first control-plane instance with kops rolling-update cluster.

5. What happened after the commands executed?
The cluster fails to validate since calico-node fails to start on the new control-plane instance.

The install-cni init container in the calico-node pod fails with an error message:

2024-04-22 23:06:36.650 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.24.5

2024-04-22 23:06:36.650 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
W0422 23:06:36.651145       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-04-22 23:06:36.664 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized
2024-04-22 23:06:36.664 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Unauthorized

Other pods on the new control-plane node that depend on the k8s-api also have similar unauthorized logs.

6. What did you expect to happen?
kops to be upgraded from v1.25.4 to v1.26.6 successfully.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: cluster.domain
spec:
  api:
    loadBalancer:
      class: Classic
      idleTimeoutSeconds: 3600
      type: Internal
  authorization:
    rbac: {}
...
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    authorizationMode: Node,RBAC
    featureGates:
      HPAScaleToZero: "true"
  kubeControllerManager:
    horizontalPodAutoscalerDownscaleStabilization: 15m0s
  kubeDNS:
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubeScheduler: {}
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesVersion: 1.25.16
  masterInternalName: master.cluster.domain
  masterPublicName: api.cluster.domain
...

The cluster spec update diff from kops update cluster shows the masterInternalName being removed and changes to the serviceAccountIssuer & serviceAccountJWKSURI fields:

ManagedFile/cluster-completed.spec
Contents            
...
	- X-Remote-User
	securePort: 443
+     serviceAccountIssuer: https://api.internal.cluster.domain
-     serviceAccountIssuer: https://master.cluster.domain
+     serviceAccountJWKSURI: https://api.internal.cluster.domain/openid/v1/jwks
-     serviceAccountJWKSURI: https://master.cluster.domain/openid/v1/jwks
	serviceClusterIPRange: 100.64.0.0/13
	storageBackend: etcd3
...
kubernetesVersion: 1.25.16
-   masterInternalName: master.cluster.domain
masterKubelet:
	anonymousAuth: false
...

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
It seems that in kops v1.26 the ability to specify masterInternalName was removed (#14507) which is why we are noticing this problem during upgrade.

However I can reproduce the same issue as above by simply removing masterInternalName from our cluster spec, apply it to the cluster and roll the first control-plane node (all using kops v1.25.4). This update also produces the same update diff as posted above.

To move forward it seems like changing the serviceAccountIssuer & serviceAccountJWKSURI fields is inevitable but it is not clear what steps to take in order to roll this change out. Is anyone able to advise?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 23, 2024
@hakman
Copy link
Member

hakman commented Apr 23, 2024

@elliotdobson You definitely need to specify serviceAccountIssuer and serviceAccountJWKSURI to the old value.
I would add that you may want to set additionalSANs (or additionalSans) to master.cluster.domain.
My suggestion would be to try this on a new small test cluster.

@elliotdobson
Copy link
Contributor Author

Thanks for your response @hakman.

We decided to move forward with the change to serviceAccountIssuer & serviceAccountJWKSURI rather than fight it by specifying our existing value.

Based on this comment we were able to come up with the following procedure that allowed us to migrate the Service Account Issuer (SAI) non-disruptively on two clusters:

  1. Add additionalSans with masterInternalName value to the kops cluster spec
  2. Remove masterInternalName from the kops cluster spec
  3. Add the modify-kube-api-manifest (existing SAI as primary) hook to the master instancegroups
  hooks:
  - name: modify-kube-api-manifest
    before:
      - kubelet.service
    manifest: |
      User=root
      Type=oneshot
      ExecStart=/bin/bash -c "until [ -f /etc/kubernetes/manifests/kube-apiserver.manifest ];do sleep 5;done;sed -i '/- --service-account-issuer=https:\/\/api.internal.cluster.domain/i\ \ \ \ - --service-account-issuer=https:\/\/master.cluster.domain' /etc/kubernetes/manifests/kube-apiserver.manifest"
  1. Apply the changes to the cluster
  2. Roll the masters
  3. Manually create the master.cluster.domain DNS record pointing at the masters IP addresses
  4. (If the kubernetes-services-endpoint configMap exists (& using calico CNI) then) Update the KUBERNETES_SERVICE_HOST value to api.internal.cluster.domain in the kubernetes-services-endpoint configMap (kube-system namespace) & Roll the calico daemonset
  5. Update the modify-kube-api-manifest (switch the primary/secondary SAI) hook on the master instancegroups
  hooks:
  - name: modify-kube-api-manifest
    before:
      - kubelet.service
    manifest: |
      User=root
      Type=oneshot
      ExecStart=/bin/bash -c "until [ -f /etc/kubernetes/manifests/kube-apiserver.manifest ];do sleep 5;done;sed -i '/- --service-account-issuer=https:\/\/api.internal.cluster.domain/a\ \ \ \ - --service-account-issuer=https:\/\/master.cluster.domain' /etc/kubernetes/manifests/kube-apiserver.manifest"
  1. Apply the changes to the cluster
  2. Roll the masters (remember to update the master.cluster.domain DNS record with new IP addresses of the masters)
  3. Roll all the nodes in the cluster
  4. Wait 24 hours until the dynamic SA tokens have refreshed
  5. Remove the modify-kube-api-manifest hook on the master instancegroups
  6. Remove additionalSans from the kops cluster spec
  7. Apply the changes to the cluster
  8. Roll the masters
  9. Delete the master.cluster.domain DNS record

Care needs to be taken around the master.cluster.domain DNS record as without it unexpected things can occur (kubelet unable to contact API, etc) until the migration is complete.

I think the same procedure could be used to enable IRSA non-disruptively (which we will try in the future).

Do you think it is worth adding this to the kops documentation and/or updating kops cluster spec to allow multiple serviceAccountIssuer to be specified?

@hakman
Copy link
Member

hakman commented Apr 30, 2024

Thank you for the detailed guide @elliotdobson. This would be useful to be added to the kOps docs.
Regarding multiple service account issuers, see #16497. Would that be what you were looking for?

@elliotdobson
Copy link
Contributor Author

Where would a procedure like that live in the kOps docs? Under Operations perhaps?

Yes #16497 is the addition to the cluster spec that would've been required for this migration. Will that feature will be back-ported to previous kOps releases?

@hakman
Copy link
Member

hakman commented Apr 30, 2024

Where would a procedure like that live in the kOps docs? Under Operations perhaps?

Yes, under Operations.

Yes #16497 is the addition to the cluster spec that would've been required for this migration. Will that feature will be back-ported to previous kOps releases?

Seems pretty simple to back-port. Probably it will be in kOps 1.28+.

@elliotdobson
Copy link
Contributor Author

Closing this issue as we've worked around the issue and documented the workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants