Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trident-csi pods stuck in ContainerCreating after node reboots #585

Closed
gorantornqvist opened this issue Jun 11, 2021 · 10 comments
Closed

trident-csi pods stuck in ContainerCreating after node reboots #585

gorantornqvist opened this issue Jun 11, 2021 · 10 comments
Labels

Comments

@gorantornqvist
Copy link

Describe the bug
multiple trident-csi pods are stuck in ContainerCreating after node reboots with the error:
Generated from kubelet on node:
2 times in the last 3 minutes
(combined from similar events): Unable to attach or mount volumes: unmounted volumes=[trident-csi-token-z59nn], unattached volumes=[pods-mount-dir dev-dir host-dir trident-tracking-dir plugin-dir sys-dir certs trident-csi-token-z59nn plugins-mount-dir registration-dir]: timed out waiting for the condition

Generated from kubelet on node
19 times in the last 23 minutes
MountVolume.SetUp failed for volume "trident-csi-token-z59nn" : secret "trident-csi-token-z59nn" not found

If I delete the pod a new trident-csi pod is created and started ok but without manual intervention the original pod hangs forever and other pods on that node that use trident persistent storage fails to start.

I also noted that when it hangs it references a secret trident-csi-token-z59nn that doesnt exist and after I manually delete the pod and it starts up that pod references another secret that actually exists.

Environment
Openshift 4.7.13

  • Trident version: 21.04.0
  • Trident installation flags used: default install using helm
  • Kubernetes version: v1.20.0+df9c838
  • Kubernetes orchestrator: Openshift 4.7.13
  • NetApp backend types: ONTAP-NAS

To Reproduce
Steps to reproduce the behavior:
Reboot node

Expected behavior
The trident-csi pod to start by using the correct secret

Additional context
Add any other context about the problem here.

@rohit-arora-dev
Copy link
Contributor

rohit-arora-dev commented Jun 11, 2021

Hello @gorantornqvist

Thanks for reporting this issue. To give you some background, the secret token trident-csi-token-z59nn is created when Trident creates a service account name trident-csi. Trident deployment and daemonset pods use the service account token for API authentication.
The behaviour that exists in Kubernetes is that if a service account is re-created the corresponding token is refreshed but the pods using the old token are not automatically updated. So, what Trident does is automatically re-creates Trident deployment and the daemonset pods on service account recreation.

In your case, I am trying to understand:

  1. If the service account trident-csi was re-created?
    a. If yes, was it before node reboot, during node reboot or after the node reboot?
    b. If not, can you consistently reproduce the behaviour and does it involves just rebooting the Kubernetes node?
  2. The Trident operator logs may also be useful in getting some insights, if you can share them here or on Slack or via Support case, that would help as well. Using kubectl -n <trident installation namespace> logs <trident_operator_pod>.

Please let us know.

Thank you!

@megabreit
Copy link

Reincarnation of #444 ?

@gnarl
Copy link
Contributor

gnarl commented Jun 14, 2021

@gorantornqvist, can you provide more information based on @ntap-arorar's comments?

@gorantornqvist
Copy link
Author

Hi,
Nothing was really done with the trident configuration before this.
I actually encountered the same issue on 2 different clusters but after the restart of the pods the issue was resolved.
I tried restarting each node in these 2 cluster and the issue didnt occur again - it cant be reproduced.

So I guess this could be hard to troubleshoot.

I am OK with closing this and if it occurrs again I will gather all logs from the trident operator ...

@gnarl
Copy link
Contributor

gnarl commented Jun 21, 2021

@gorantornqvist, thanks for the feedback. We will reopen this issue if you encounter the problem again.

@gnarl gnarl closed this as completed Jun 21, 2021
@gorantornqvist
Copy link
Author

Hi,
We encountered this issue again today when updating 2 different openshift clusters.
If I deleted a trident-csi pod it started working (no need for a operator pod restart)

Attaching operator logs from one of the clusters

trident-operator-86c5b968cb-gz6p9.log

@gnarl gnarl reopened this Jul 20, 2021
@gnarl
Copy link
Contributor

gnarl commented Oct 22, 2021

@gorantornqvist, we looked at the provided logs and it seems that the Cluster was already in a bad state. Our team hasn't been able to reproduce this issue yet. Please let us know if you are still concerned about this issue.

@gnarl
Copy link
Contributor

gnarl commented Feb 21, 2023

@gorantornqvist, where you able to resolve your issue?

@gorantornqvist
Copy link
Author

We havent encountered this problem again so this issue can be closed :)

@gnarl
Copy link
Contributor

gnarl commented Feb 22, 2023

Thanks, for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants