EOF Error from AWS api while validating cluster which was in running state #16548

teocrispy91 · 2024-05-09T07:16:40Z

We have a kops cluster with version 1.15.2 and everything was working fine until i did a helm upgrade deployment to one of my namespace in the cluster after that i can't run kubectl commands when i run it's showing "unable to connect to server EOF". also i have a dashboard hosted for kubernetes like example.com/dashboard that page is also showing 502 nginx error. when i checked the elb in aws its showing out of servivce but my master node is running. since we are not able to connect to cluster we couldn't identify the issue.

when i run kops validate cluster i am getting the below error.
unexpected error during validation: error listing nodes: Get https://MY_LOAD_BALANCER_DNS_NAME.us-west-2.elb.amazonaws.com/api/v1/nodes: EOF
with MY_LOAD_BALANCER_DNS_NAME replaced by the value under the "DNS name" field on the AWS console

also i can browse my applications hosted in the kops cluster not sure if apiserver is down or some issue with master.

It would be great help if someone could really help on this.

hakman · 2024-05-10T05:23:41Z

@teocrispy91 Could you share why using kOps v1.15.2 which is 4-5 years old when creating new clusters?
Please try to go through https://kops.sigs.k8s.io/operations/troubleshoot/. Should help understand where the problem comes from.

teocrispy91 · 2024-05-10T06:31:28Z

@hakman the cluster was created 4 years back and was running without much problems.

hakman · 2024-05-10T06:43:35Z

@hakman the cluster was created 4 years back and was running without much problems.

The title says "while validating new cluster" 😄

teocrispy91 · 2024-05-10T07:19:37Z

@hakman sorry for the typo i have edited the same

hakman · 2024-05-10T07:23:22Z

No worries, the suggestion still stands, you need to look on the master nodes for logs.
Generally speaking, certs expire. Nodes have to be rotated once in a while at least.

teocrispy91 · 2024-05-10T07:26:47Z

@hakman since i am new to kops willa restart to master node cause any issues? Also are you talking about api server cert?

hakman · 2024-05-10T08:58:01Z

I don't think that restarting the master node will do any damage, but probably it will not help much either.
Unless you SSH to the node and look for the issue in logs, this is just guesswork.
You should read the troubleshooting guide and check what happened.

teocrispy91 · 2024-05-16T10:50:39Z

@hakman i just logged into my master node and while doing kubectl get ns or pods it's showing connection to server localhsot was refused port8080. when i do netstat i can see niether 443 or 8080 is opened in my master node will it be because of that. when running docker logs i could see my api-server pod restarting and going to exited state continously.

This is some log i can see inside the api-server pod.i have checked the cert they are valid
= "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0515 13:17:42.270070 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0 }. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...

hakman · 2024-05-20T14:08:16Z

Most likely the etcd certs expired and API server cannot connect to it anymore.
This might have helped https://github.com/kubernetes/kops/blob/master/docs/advisories/etcd-manager-certificate-expiration.md.

teocrispy91 · 2024-05-20T14:14:51Z

@hakman But etcd container seems to be running will it run if the cert has expired.

When i run the below command i can see it's showing up to march28th 2024. But the image version seems to be kopeio/etcd-manager:3.0.20200429 which is higher than the one mentioned.

find /mnt/ -type f -name me.crt -print -exec openssl x509 -enddate -noout -in {} ;

hakman · 2024-05-20T14:43:08Z

Seems so, but you have to do rolling updates from time to time on the cluster.
There is no mechanism dealing with cert rotation automatically.,

teocrispy91 · 2024-05-20T14:46:56Z

when i ran this command find /mnt/ -type f -name me.crt -print -exec openssl x509 -enddate -noout -in {} ; i could see that the certs has expired. This is the result i get

find /mnt/ -type f -name me.crt -print -exec openssl x509 -enddate -noout -in {} ;
/mnt/master-vol-01399aaec42e241cd/pki/etcd-cluster-token-etcd-events/peers/me.crt
notAfter=Mar 28 06:06:18 2024 GMT
/mnt/master-vol-0e73c716447126a30/pki/etcd-cluster-token-etcd/peers/me.crt
notAfter=Mar 28 06:07:30 2024 GMT

So how can i renew this what would be the next steps.

hakman · 2024-05-20T15:05:08Z

This may work:

kops rolling-update cluster --instance-group-roles=Master --force --cloudonly

teocrispy91 · 2024-05-21T13:55:32Z

@hakman this will recreate a new master node right doesn't upgrade the cluster? Also from where i need to run this in master i doubt whether kops command will work.

hakman · 2024-05-21T14:41:18Z

@hakman this will recreate a new master node right doesn't upgrade the cluster? Also from where i need to run this in master i doubt whether kops command will work.

You need to run it from your computer that has admin permissions on the AWS account that hosts the cluster, using the kOps v1.15.2 binary. It will destroy and re-create the master.
Similarly, you can terminate the master instance and it will be re-created.

teocrispy91 · 2024-05-21T16:10:41Z

@hakman so terminating controlplane ec2 instance and it will be created automatically right by the autoscaling

hakman · 2024-05-21T16:12:33Z

@hakman so terminating controlplane ec2 instance and it will be created automatically right by the autoscaling

yes

teocrispy91 · 2024-05-22T16:23:35Z

@hakman Thanks a ton. After running the command you mentioned cluster seems to be up now. also in kube-system my aws-iam-authenticator pod is in imgpullback (do i need to update with latest image)also metrics pod is in crashloopback any idea why?

teocrispy91 changed the title ~~EOF Error from AWS api while validating new cluster~~ EOF Error from AWS api while validating cluster which was in running state May 10, 2024

teocrispy91 closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOF Error from AWS api while validating cluster which was in running state #16548

EOF Error from AWS api while validating cluster which was in running state #16548

teocrispy91 commented May 9, 2024

hakman commented May 10, 2024

teocrispy91 commented May 10, 2024

hakman commented May 10, 2024 •

edited

Loading

teocrispy91 commented May 10, 2024

hakman commented May 10, 2024

teocrispy91 commented May 10, 2024 •

edited

Loading

hakman commented May 10, 2024

teocrispy91 commented May 16, 2024 •

edited

Loading

hakman commented May 20, 2024

teocrispy91 commented May 20, 2024 •

edited

Loading

hakman commented May 20, 2024

teocrispy91 commented May 20, 2024

hakman commented May 20, 2024 •

edited

Loading

teocrispy91 commented May 21, 2024

hakman commented May 21, 2024

teocrispy91 commented May 21, 2024

hakman commented May 21, 2024

teocrispy91 commented May 22, 2024

EOF Error from AWS api while validating cluster which was in running state #16548

EOF Error from AWS api while validating cluster which was in running state #16548

Comments

teocrispy91 commented May 9, 2024

hakman commented May 10, 2024

teocrispy91 commented May 10, 2024

hakman commented May 10, 2024 • edited Loading

teocrispy91 commented May 10, 2024

hakman commented May 10, 2024

teocrispy91 commented May 10, 2024 • edited Loading

hakman commented May 10, 2024

teocrispy91 commented May 16, 2024 • edited Loading

hakman commented May 20, 2024

teocrispy91 commented May 20, 2024 • edited Loading

hakman commented May 20, 2024

teocrispy91 commented May 20, 2024

hakman commented May 20, 2024 • edited Loading

teocrispy91 commented May 21, 2024

hakman commented May 21, 2024

teocrispy91 commented May 21, 2024

hakman commented May 21, 2024

teocrispy91 commented May 22, 2024

hakman commented May 10, 2024 •

edited

Loading

teocrispy91 commented May 10, 2024 •

edited

Loading

teocrispy91 commented May 16, 2024 •

edited

Loading

teocrispy91 commented May 20, 2024 •

edited

Loading

hakman commented May 20, 2024 •

edited

Loading