Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd Pending status (volume mount timed out) #280

Closed
garloff opened this issue Sep 18, 2022 · 3 comments
Closed

etcd Pending status (volume mount timed out) #280

garloff opened this issue Sep 18, 2022 · 3 comments
Labels
bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling on hold Is on hold Sprint Montreal Sprint Montreal (2023, cwk 40+41)

Comments

@garloff
Copy link
Contributor

garloff commented Sep 18, 2022

I have seen the etcd container in a Pending status for extended amounts of time (hours) and the event log indicating that the two volumes could not be mounted. (Which is ridiculous for

  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate

(and etcd-data which is /var/lib/etcd)).
It does only happen occasionally; if we're lucky, the etcd container does fail a liveness probe and gets restarted and works just fine then. But we can't rely on it.
I'll look into reproducing this, so I can insert proper logs here.

@garloff
Copy link
Contributor Author

garloff commented Sep 18, 2022

Here's what I see in the error case:

kubectl describe -n kube-system etcd-$PREFIX-${CLUSTER_NAME}-control-plane-genc01-XXXX
[...]
  Type     Reason       Age   From     Message
  ----     ------       ----  ----     -------
  Warning  FailedMount  65s   kubelet  Unable to attach or mount volumes: unmounted volumes=[etcd-data etcd-certs], unattached volumes=[etcd-data etcd-certs]: timed out waiting for the condition

@garloff
Copy link
Contributor Author

garloff commented Sep 18, 2022

Restarting the kubelet after kubeadm init avoids the issue, though this feels more like a workaround than a real fix..
If we tweak the heartbeat intervals on the other hand, this is even required to make the parameter take effect, so in some way it's the right thing to do.

garloff added a commit that referenced this issue Sep 18, 2022
We need to make sure the new parameters take effect.
We do it unconditionally for now; as a side-effect, it does also take
care of issue #280, where the etcd pod ridiculously remains in pending
due to an allegedly failed mount (on a DirectoryOrCreate HostPath!).

Signed-off-by: Kurt Garloff <kurt@garloff.de>
@garloff garloff added bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling labels Sep 18, 2022
@garloff
Copy link
Contributor Author

garloff commented Sep 20, 2022

We no longer tweak etcd's manifest after it has (potentially) started thanks to merging #282, so there is no real reason for restarting the kubelet any longer. We still have the restart however, as workaround for this issue.
TODO:

  • Is it really still required?
  • If so, debug the reasoning and search for real fix, likely upstream somewhere ....

@jschoone jschoone added the on hold Is on hold label Oct 10, 2023
@jschoone jschoone reopened this Oct 11, 2023
@jschoone jschoone closed this as not planned Won't fix, can't repro, duplicate, stale Oct 11, 2023
@jschoone jschoone added the Sprint Montreal Sprint Montreal (2023, cwk 40+41) label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Container Issues or pull requests relevant for Team 2: Container Infra and Tooling on hold Is on hold Sprint Montreal Sprint Montreal (2023, cwk 40+41)
Projects
Archived in project
Development

No branches or pull requests

2 participants