Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing secret after operator configuration update #444

Closed
megabreit opened this issue Sep 8, 2020 · 2 comments
Closed

Missing secret after operator configuration update #444

megabreit opened this issue Sep 8, 2020 · 2 comments

Comments

@megabreit
Copy link

Describe the bug
After updating the tridentprovisioner and add silenceAutosupport: true the configured trident-csi-token went missing. It's unclear how this happened. The trident-csi pods kept running, complaining about the missing secret:

2m6s        Warning   FailedMount   pod/trident-csi-74d6n        MountVolume.SetUp failed for volume "trident-csi-token-qkdm2" : secret "trident-csi-token-qkdm2" not found

The token was not present any more, but there were 2 other tokens present. 1 unused.
Turned out that trident-csi-token-29c5f was the "correct" one used after killing the pods later.

trident-csi-token-29c5f                    kubernetes.io/service-account-token   4      2d16h
trident-csi-token-jjsjt                    kubernetes.io/service-account-token   4      2d16h

See discussion in Netapp slack https://netapppub.slack.com/archives/C1E3QH84C/p1599221156108100

Trident operation was prevented because communication of trident main with the csi pods was not possible any more.

Manual workaround: Deleting all daemonset pods. They were recreated with the correct secret.

Environment

  • Trident version: 20.07.0
  • Trident installation flags used: CR: debug: true
  • Container runtime: cri-o
  • Kubernetes version: 1.18
  • Kubernetes orchestrator: OpenShift v4.5.7
  • Kubernetes enabled feature gates: -
  • OS: RHCOS
  • NetApp backend types: ontap-nas, ontap-nas-economy, ontap-san
  • Other:

To Reproduce
Unsure. Possible causes: Deletion of the secret or deployment. See slack discussion.

Expected behavior
In general: The operator should be able to handle and correct such events.
Suggested in Slack:
Operator should re-create the daemonset as well as the deployment pods if the service account is re-created.
Or in each reconcile loop operator should verify that the secrets in the pods associated with the daemonset and the deployment are correct and matches the service account secrets.

@megabreit megabreit added the bug label Sep 8, 2020
@gnarl gnarl added the tracked label Sep 8, 2020
@rohit-arora-dev
Copy link
Contributor

Thank you @megabreit for posting the issue here.

Just want to add more context here:
If a service account is re-created, its tokens are not automatically updated on an already created pod. So, in an event of a re-creation of a service account, the service account token in never refreshed on the Trident pods, which can lead to the above issue.

I am not sure what lead to the above issue in the customer environment where service account token was not updated on the daemonset but I was able to re-create this issue by deleting the service account, and the operator re-created the service account as part of its auto-heal functionality but did not update the deployment or the daemonset pods.

As part of each reconcile loop, the operator should also start recognizing if the service account secrets do not match the secrets of the daemonset or the deployment pods and act on it.

@gnarl
Copy link
Contributor

gnarl commented Oct 28, 2020

This fix will be included in the Trident 20.10.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants