Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no connection for cached dial! for eks cluster #2677

Open
exinos-git opened this issue Apr 25, 2024 · 15 comments
Open

no connection for cached dial! for eks cluster #2677

exinos-git opened this issue Apr 25, 2024 · 15 comments

Comments

@exinos-git
Copy link




Describe the bug
cannot connect to EKS cluster after credentials expire and are refreshed
get "no connection for cached dial!"

To Reproduce
Steps to reproduce the behavior:

  1. connect to an EKS cluster
  2. wait for credentials to expire
  3. login to AWS again to refresh creds
  4. try to connect to EKS cluster with k9s

Historical Documents
1932 9:00AM INF <U+2705> Kubernetes connectivity 1933 9:00AM ERR Fail to load global/context configuration error="the server has asked for the client to provide credentials\nk9s config file "/home/someuser/.config/k9s/config.yaml" load failed:\nAdditional pr 1933 operty fullScreen is not allowed\ncannot connect to context: arn:aws:eks:someregion::cluster/blahblah\nk8s connection failed for context: arn:aws:eks:somregeion::cluster/blahblah" 1934 9:00AM ERR Load cluster resources - No API server connection
1935 9:00AM ERR failed to list contexts error="no connection"
1936 9:00AM WRN Unable to dial discovery API error="no connection to dial"
1937 9:00AM ERR can't connect to cluster error="the server has asked for the client to provide credentials"
1938 9:00AM ERR Load cluster resources - No API server connection
1939 9:00AM WRN Unable to dial discovery API error="no connection to dial"
1940 9:00AM ERR Context switch failed error="no connection to cached dial"
1941 9:00AM ERR no connection to cached dial
1942 9:00AM ERR Context switch failed error="no connection to cached dial"
1943 9:00AM ERR no connection to cached dial
1944 9:00AM ERR Context switch failed error="no connection to cached dial"
1945 9:00AM ERR no connection to cached dial
1946 9:00AM ERR Context switch failed error="no connection to cached dial"
1947 9:00AM ERR no connection to cached dial
1948 9:00AM ERR Context switch failed error="no connection to cached dial"
1949 9:00AM ERR no connection to cached dial
1950 9:00AM ERR Context switch failed error="no connection to cached dial"
1951 9:00AM ERR no connection to cached dial
1952 9:00AM ERR Context switch failed error="no connection to cached dial"
1953 9:00AM ERR no connection to cached dial
1954 9:00AM ERR Context switch failed error="no connection to cached dial"
1955 9:00AM ERR no connection to cached dial
1956 9:00AM ERR Context switch failed error="no connection to cached dial"
1957 9:00AM ERR no connection to cached dial
1958 9:00AM ERR Context switch failed error="no connection to cached dial"
1959 9:00AM ERR no connection to cached dial
1960 9:00AM ERR Context switch failed error="no connection to cached dial"
1961 9:00AM ERR no connection to cached dial
1962 9:00AM ERR Context switch failed error="no connection to cached dial"
1963 9:00AM ERR no connection to cached dial
1964 9:00AM ERR Context switch failed error="no connection to cached dial"
1965 9:00AM ERR no connection to cached dial
1966 9:00AM ERR Context switch failed error="no connection to cached dial"
1967 9:00AM ERR no connection to cached dial
1968 9:00AM ERR Context switch failed error="no connection to cached dial"
1969 9:00AM ERR no connection to cached dial
1970 9:00AM ERR Context switch failed error="no connection to cached dial"
1971 9:00AM ERR no connection to cached dial
1972 9:00AM ERR Context switch failed error="no connection to cached dial"
1973 9:00AM ERR no connection to cached dial
1974 9:00AM ERR Context switch failed error="no connection to cached dial"
1975 9:00AM ERR no connection to cached dial

Expected behavior
it refreshes the connection with new creds

Screenshots

Versions (please complete the following information):

  • OS: [e.g. WSL2]
  • K9s: [e.g. v0.32.4]
  • K8s: [e.g. 1.27.12]

Additional context
the only way i could work around this was by moving mv /home/someuser/.local/share/k9s/clusters /home/someuser/.local/share/k9s/clustersbad

@pdfrod
Copy link

pdfrod commented Apr 30, 2024

A few weeks ago I also started to have "no connection for cached dial" errors all of the sudden. I've used k9s for more than a year and never had that problem before. In my case I'm connecting to GKE clusters.

If I try to reach the clusters using kubectl it works perfectly, but for some reason I need to do a lot of retries in k9 before it will let me access the clusters. I tried upgrading to the latest k9s version, but the issue persists.

@exinos-git
Copy link
Author

@pdfrod did you try the workaround i mention mv /home/$USER/.local/share/k9s/clusters /home/$USER/.local/share/k9s/clustersbad

@pdfrod
Copy link

pdfrod commented Apr 30, 2024

Just tried it, but it didn't make any difference for me unfortunately.

@cablekevin
Copy link

Unfortunately I'm also running into this same issue.

After sourcing my new AWS temp credentials with MFA if i start k9s i have to wait several seconds for the context to be loaded properly and it starts working. However sometimes it doesn't load properly and I'm stuck with: "no connection to cached dial".

Version: v0.32.4
Commit: d3027c8
Date: 2024-03-20T19:16:59Z

@olivierlacan
Copy link

Having the same issue, in some cases k9s appears to reload itself and somehow the issue resolves itself but I'm not quite sure how to trigger it. I tried switching between clusters or hitting ctrl + r.

I even tried to re-authenticate outside of k9s but the UI eventually seemed to refresh on its own after several seconds. It might be helpful to be able to trigger whatever refresh process seemingly happens in the background manually either when refreshing with ctrl + r or with another command.

@eric-gt
Copy link

eric-gt commented May 7, 2024

I ran into this problem today with clusters in both EKS and GKE, and here's how I solved it:

  1. rename the current k9s config clusters folder to clustersbad with @exinos-git 's mv command. or delete it. your choice
    a. N.B.: if you're on OSX, the default K9s config directory is at ~/Library/Application\ Support/k9s
  2. re-authenticate to your clusters out-of-band and update the kubeconfig
    a. for EKS aws eks update-kubeconfig --name {cluster name} --region {cluster region}
    b. for GKE gcloud container clusters get-credentials {cluster name} --region {cluster region}
  3. run K9s

After following these three steps, k9s automatically boots into the last context I connected to.

I believe what happened in my case was that I updated the names of my contexts in my ~/.kube/config file directly, instead of renaming them in k9s, and that screwed up the mappings between my kubeconfig contexts and the cluster configurations in k9s

@wolffberg
Copy link

Most likely a duplicate of #2651

@pdfrod
Copy link

pdfrod commented May 15, 2024

Most likely a duplicate of #2651

Yes, in my case #2651 was exactly the problem I was having. Setting a current-context fixed the problem for me, although it would be nice to not have to set one, as I have multiple clusters and I prefer to be explicit about the cluster I'm currently using.

@syselement
Copy link

Most likely a duplicate of #2651

Yes, in my case #2651 was exactly the problem I was having. Setting a current-context fixed the problem for me, although it would be nice to not have to set one, as I have multiple clusters and I prefer to be explicit about the cluster I'm currently using.

+1

@zolv
Copy link

zolv commented May 31, 2024

Maybe not related, but: we are using GCP and what has helped me was:

gcloud components install kubectl

@anvy2
Copy link

anvy2 commented Jul 15, 2024

Same issue here with GCP and AZ clusters

@heamaral
Copy link

Same issue here with AZ and Openshift clusters

@jose-lpa
Copy link

jose-lpa commented Aug 21, 2024

Most likely a duplicate of #2651

Actually not, since I am having the problem as well and I do have current-context set in my ~/.kube/config. This bug seems to happen in so many different places, with different configurations, and people seem to be "resolving" it in different ways (probably it just disappeared for some random reason too, and people think they resolved it). In my case, I have a bunch of clusters there, Google Cloud and AWS. I tried every "solution" mentioned here and nothing works.

Edit: I even moved my ~/.kube/config to ~/kube.config.bak and re-connected with a single cluster in AWS, so a brand new ~/.kube/config file with only one entry, and it still fails, same error.

@marchchad
Copy link

FWIW, I just ran in to this. I was deleting a cluster while I had k9s open so I could watch the nodes being drained. I didn't exit k9s by the time the cluster was being deleted.

I tried the workarounds listed about, but ultimately I had to switch the current-context from the cluster that was just deleted to a valid context.

@RealOrangeOne
Copy link

It seems "no connection to cached dial" can happen for a number of reasons. I eventually narrowed down the cause by looking in k9s's debug logs (~/.local/state/k9s/k9s.log). The issue was glaringly obvious then!

My instance of this error was caused by DNS failing to resolve through my SOCKS proxy (it's always DNS). It's clearly different for different people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests