Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

az aks and kubectl issue: Unable to connect to the server: net/http: TLS handshake timeout #164

Closed
chetanku opened this issue Feb 4, 2018 · 52 comments

Comments

@chetanku
Copy link

chetanku commented Feb 4, 2018

I am trying to connect to my existing Kuberentes cluster from Windows 2016 Server.
1st issue:
On windows 2016 server, I run az aks get-credentials with resource group and cluster name, I see an issue with the case.
If I specify my cluster name as --name=testscluster and run it,
It somehow changes the case to testScluster.

2nd Issue:
Then after I run kubectl get nodes I get Unable to connect to the server: net/http: TLS handshake timeout . I manually changed the case and tried but still the same issue.

When I try to do the same from Ubuntu server it works completely fine.
Question: Does aks support windows ? or is it for only Linux?
I am on latest az version of 2.0.26.

Any help will be great!!

@matkam
Copy link

matkam commented Feb 6, 2018

Running into the same issue on Mac OS. The kubectl command was working fine just yesterday, and now throws an error:

~ $ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout

I've tried upgrading the cluster, but that also throws an error and puts the cluster into a failed state:

~ $ az aks upgrade --resource-group my-resource-group --name my-aks-cluster --kubernetes-version 1.8.7
Kubernetes may be unavailable during cluster upgrades.
Are you sure you want to perform this operation? (y/n): y
Deployment failed. Correlation ID: some-id-here. Internal server error

@chetanku
Copy link
Author

chetanku commented Feb 6, 2018

~ $ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout
Can you try deleting the context and getting it again? this helped me fix it.

@matkam
Copy link

matkam commented Feb 6, 2018

Still the same:

~ $ rm /Users/matt/.kube/config
~ $ az aks get-credentials --resource-group xxx --name xxx
Merged "xxx" as current context in /Users/xxx/.kube/config
~ $ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout

@chetanku
Copy link
Author

chetanku commented Feb 6, 2018

Is your cluster up and functional or it is in a failed state?
Can you also check the fqdn of the cluster in the config and match it with azure portal?

@jchauncey

@matkam
Copy link

matkam commented Feb 6, 2018

It is in a failed state. But I can't tell if that's from the failed upgrade, or if it was failing earlier. The previously set up services/pods seem to work still.

@jchauncey
Copy link

@matkam what region is this in?

cc @mboersma

@matkam
Copy link

matkam commented Feb 6, 2018

@jchauncey Central US

@chetanku
Copy link
Author

chetanku commented Feb 6, 2018

can you try upgrading it again?

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm-upgrade-1-8/#recovering-from-a-bad-state

If kubeadm upgrade somehow fails and fails to roll back, due to an unexpected shutdown during execution for instance, you may run kubeadm upgrade again as it is idempotent and should eventually make sure the actual state is the desired state you are declaring.

You can use kubeadm upgrade to change a running cluster with x.x.x --> x.x.x with --force, which can be used to recover from a bad state.

@matkam
Copy link

matkam commented Feb 6, 2018

How does kubeadm work with AKS? I have been using az aks upgrade to perform upgrades, but there is no --force option there.

@chetanku
Copy link
Author

chetanku commented Feb 6, 2018

Sorry, Can you try upgrading it again with az aks upgrade?

az aks upgrade --name myAKSCluster --resource-group myResourceGroup --kubernetes-version 1.8.7

@matkam
Copy link

matkam commented Feb 6, 2018

I've tried the upgrade several times with the same result:

~ $ az aks upgrade --resource-group xxx --name xxx --kubernetes-version 1.8.7
Kubernetes may be unavailable during cluster upgrades.
Are you sure you want to perform this operation? (y/n): y
Deployment failed. Correlation ID: 068b7fae-6d54-41d7-8b2d-0ee308e22674. Internal server error

@chetanku
Copy link
Author

chetanku commented Feb 6, 2018

Is your subscription good? do you have enough credits?

https://stackoverflow.com/questions/48443320/upgrade-failed-azure-aks-from-1-8-1-to-1-8-6

@matkam
Copy link

matkam commented Feb 6, 2018

My subscription is good with plenty of credits. I was able to start another AKS cluster.

Output from your linked stackoverflow:

~ $ az aks show --name xxx --resource-group xxx --output table
Name            Location    ResourceGroup              KubernetesVersion    ProvisioningState    Fqdn
--------------  ----------  -------------------------  -------------------  -------------------  -------------------------------------------------------------------
xxx  centralus   xxx  1.8.7                Failed               xxx.hcp.centralus.azmk8s.io
~ $ az aks get-versions --name xxx --resource-group xxx --output table
Name     ResourceGroup              MasterVersion    MasterUpgrades    NodePoolVersion    NodePoolUpgrades
-------  -------------------------  ---------------  ----------------  -----------------  ------------------
default  xxx  1.8.7            1.8.7             1.8.7              1.8.7

@chetanku
Copy link
Author

chetanku commented Feb 6, 2018

can you try this on powershell and see if there are any logs.
Get-AzureRmLog -CorrelationId "068b7fae-6d54-41d7-8b2d-0ee308e22674"

@matkam
Copy link

matkam commented Feb 6, 2018

Unfortunately not:

PS Azure:\> Get-AzureRmLog -CorrelationId "068b7fae-6d54-41d7-8b2d-0ee308e22674"
WARNING: [Get-AzureRmLog] Parameter deprecation: The DetailedOutput parameter will be deprecated in a future breaking change release.
WARNING: [Get-AzureRmLog] Parameter name change: The parameter plural names for the parameters will be deprecated in a future breaking change release in favor of the singular versions of the same names.
WARNING: [Get-AzureRmLog] Output change: The field EventChannels from the EventData object is being deprecated in the release 5.0.0 - November 2017 - since it now returns a constant value (Admin,Operation)
Azure:\

@matkam
Copy link

matkam commented Feb 7, 2018

I found this on one of the Kubernetes nodes, /var/log/syslog:

Feb  6 23:53:29 aks-nodepool1-12029634-1 docker[1781]: I0206 23:53:29.889440    1867 kubelet_node_status.go:83] Attempting to register node aks-nodepool1-12029634-1
Feb  6 23:53:31 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:31.329906    1867 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443/api/v1/pods?fieldSelector=spec.nodeName%3Daks-nodepool1-12029634-1&resourceVersion=0: net/http: TLS handshake timeout
Feb  6 23:53:31 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:31.330167    1867 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: Failed to list *v1.Node: Get https://evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443/api/v1/nodes?fieldSelector=metadata.name%3Daks-nodepool1-12029634-1&resourceVersion=0: net/http: TLS handshake timeout
Feb  6 23:53:31 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:31.332125    1867 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:413: Failed to list *v1.Service: Get https://evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443/api/v1/services?resourceVersion=0: net/http: TLS handshake timeout
Feb  6 23:53:32 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:32.964445    1867 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node 'aks-nodepool1-12029634-1' not found
Feb  6 23:53:32 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:32.985870    1867 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Feb  6 23:53:37 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:37.986996    1867 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
Feb  6 23:53:39 aks-nodepool1-12029634-1 docker[1781]: E0206 23:53:39.894470    1867 kubelet_node_status.go:107] Unable to register node "aks-nodepool1-12029634-1" with API server: Post https://evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443/api/v1/nodes: net/http: TLS handshake timeout

@chetanku
Copy link
Author

chetanku commented Feb 7, 2018

what version of az are you using? Can you update it to latest and try?

@matkam
Copy link

matkam commented Feb 7, 2018

Version 2.0.26 installed via homebrew

@chetanku
Copy link
Author

chetanku commented Feb 7, 2018

throwing in some thoughts:
what does az aks list give? are you able to telnet to fqdn:443 ?

@jchauncey @mboersma

@matkam
Copy link

matkam commented Feb 7, 2018

~ $ az aks list
[
  {
    "agentPoolProfiles": [
      {
        "count": 3,
        "name": "nodepool1",
        "osType": "Linux",
        "storageProfile": "ManagedDisks",
        "vmSize": "Standard_D1_v2"
      }
    ],
    "dnsPrefix": "evieAksClu-microserviceReso-51c2f9",
    "fqdn": "evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io",
    "id": "/subscriptions/xxx/resourcegroups/microserviceResourceGroup/providers/Microsoft.ContainerService/managedClusters/evieAksCluster",
    "kubernetesVersion": "1.8.7",
    "linuxProfile": {
      "adminUsername": "xxx",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "xxx"
          }
        ]
      }
    },
    "location": "centralus",
    "name": "evieAksCluster",
    "provisioningState": "Failed",
    "resourceGroup": "microserviceResourceGroup",
    "servicePrincipalProfile": {
      "clientId": "a4c334fc-c226-4e6d-a43d-050e60c6a23e"
    },
    "type": "Microsoft.ContainerService/ManagedClusters"
  }
]
~ $ curl -v https://evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443
* Rebuilt URL to: https://evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443/
*   Trying 52.173.89.237...
* TCP_NODELAY set
* Connected to evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io (52.173.89.237) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443 
* stopped the pause stream!
* Closing connection 0
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to evieaksclu-microservicereso-51c2f9-4f6fe944.hcp.centralus.azmk8s.io:443 

@danielcoman
Copy link

Same issue... Created a cluster yesterday in West Europe. Can't connect.
Unable to connect to the server: net/http: TLS handshake timeout

@matkam
Copy link

matkam commented Feb 7, 2018

Now I'm not able to create a new cluster at all in CentralUS. The deployment fails without creating the second resource group (or any AKS nodes).

@kamoljan
Copy link

Having the same issue in Central US :(
If I remember, I had the same issue in other US region as well. ;(

@matkam
Copy link

matkam commented Feb 13, 2018

After a week of contacting Azure support, they could not tell what was wrong and recommended deleting and recreating the AKS cluster 👎

@amanohar
Copy link

@matkam @kamoljan regarding deployment failing on new creates while creating second RG. Can you share your resource group and resource name?

@matkam
Copy link

matkam commented Feb 13, 2018

@amanohar upgraded and new deployments seem to not fail any more. However, upgraded and newly deployed clusters are not functioning.

Old upgraded cluster, still shows TLS handshake timeout:
RG: microserviceResourceGroup
Name: evieAksCluster

Newly created cluster info, shows new error Error: forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out:
RG: galleon-group
Name: galleon-aks-cluster

@joukojoutomies
Copy link

joukojoutomies commented Feb 13, 2018 via email

@amanohar
Copy link

@matkam thanks for providing the details. A follow up question to failure on:

RG: galleon-group
Name: galleon-aks-cluster

Were you able to connect to the cluster after create and it stopped working after scale? Or was it not working after create as well?

@matkam
Copy link

matkam commented Feb 13, 2018

@amanohar It was not working after a create

@shrutir25
Copy link

@matkam - I have resolved the tls handshake timeout error on your old cluster with resource group microserviceResourceGroup. I am still looking into the new cluster.

@danielcoman
Copy link

My issue went away after about one day in West Europe. Is this still related to capacity?

@matkam
Copy link

matkam commented Feb 14, 2018

Thanks for looking into it @shrutir25 but I'm still seeing the same error

~ $ az aks get-credentials --resource-group microserviceResourceGroup --name evieAksCluster
Merged "evieAksCluster" as current context in /Users/matt/.kube/config
~ $ kubectl get nodes
Unable to connect to the server: net/http: TLS handshake timeout

As for the new cluster, it looks like the provisioning scripts are either exiting early, or not provisioning everything they need to. I'm seeing missing routes on the routing table.

@mjrousos
Copy link
Member

Same issue in East US for me, too. Worked a couple days ago, doesn't work now. Hosted services appear to still be up but kubectl commands (get nodes, proxy, etc.) are failing as described above.

In case it's useful:
RG: mjr-aks
Name: mjr-aks
Location: eastus

@mjrousos
Copy link
Member

My East US AKS instance seems to be healthy again. I didn't do anything so I'm guessing it was just some transient issue with the master

@derekperkins
Copy link
Contributor

@matkam Did you ever figure out this? Error: forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out

I have a cluster that has been working for a month, and I can run kubectl commands, but I can't use kubectl proxy or use helm without seeing that error.

@matkam
Copy link

matkam commented Feb 19, 2018

@derekperkins I have not figured it out. Azure support could only recommend starting a new AKS cluster (which I am now able to do).

@derekperkins
Copy link
Contributor

@matkam thanks for the quick response. I tried doing that, but I'm currently running on 1.9.1, but I can only create a 1.8.x cluster, and my app requires 1.9. Hopefully they make that available soon. :(

@Nirmalyasen
Copy link

Is there a solution around:
Error: forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out

I have started getting it since yesterday and getting it even in newly created clusters.

@jchauncey
Copy link

submit a support ticket so our on calls can take a look. This error can be caused by different problems so we need to investigate whats happening before providing a mitigation.

@kkorsakov
Copy link

kkorsakov commented Apr 10, 2018

The same thing today.
TB_RG_KUBE_PROD
containerservice-TB_RG_KUBE_PROD

@dharmeshkakadia
Copy link

I am seeing the same thing in East US. Our cluster has been running fine for sometime and suddenly seeing "nable to connect to the server: net/http: TLS handshake timeout" today.

@Jagdeep1
Copy link

Unable to connect to the server: net/http: TLS handshake timeout for west europe

@jchauncey
Copy link

If you are seeing issues please submit a support ticket so our oncall engineers can take a look

@hholle
Copy link

hholle commented Apr 24, 2018

We encouter exactly the same issue (Unable to connect to the server: net/http: TLS handshake timeout) / west europe). Our subscription does not allow us to open support tickets. Do we have any other option?

We already connected via SSH to a cluster node. The managed control/master node seems to be down/broken.

@dazdaz
Copy link

dazdaz commented Apr 25, 2018

Re-deployed AKS which was in eastus and deployed in centralus and all works again.

@hholle
Copy link

hholle commented Apr 26, 2018

Azure Support fixed the "Unable to connect to the server: net/http: TLS handshake timeout" error for our cluster :-)

@snekcz
Copy link

snekcz commented May 23, 2018

We have the same problem with Aks cluster in West Europe today. It worked in the morning, but currently it returns the "unable to connect message" for several hours.

@chetanku
Copy link
Author

I am facing the same issue with my aks cluster in East US. It is intermittent.

@necevil
Copy link

necevil commented Jun 6, 2018

@chetanku @snekcz @hholle @dazdaz @Jagdeep1 @dharmeshkakadia @Nirmalyasen @kkorsakov @matkam @mjrousos @danielcoman I am starting to collect info on this issue in a question over on StackOverflow:
https://stackoverflow.com/questions/50726534/why-cant-kubectl-connect-to-the-azure-aks-server-managing-my-cluster-net-http

If you guys could answer a couple of these (here or above on SOF)

  1. Have you had the TLS issue impact your Clusters more than once?
  2. How long did it take before your Cluster was back up and could be connected to (what was the outage duration)?
  3. Did it 'heal' itself / resolve / come back to life on it's own or did you have to post a support ticket — and wait for MS to fix it?
  4. If support, how long was that ticket posted for before the problem was solved on your Cluster?
  5. Did you notice anything weird with your CPU / Network IO dropping but your disk usage jumping higher?
  6. Would you be in favor of allowing AKS preview customers to create higher Severity Help Tickets (For this specific issue only) regardless of their support plan in order to achieve a more timely solution?

Feel free to respond here or (as mentioned) hit StackOverflow and I will collect the responses.

@qiangli
Copy link

qiangli commented Jun 12, 2018

Just for info here. for me, it is a combination of problems of corporate firewall, Azure service, and golang http library support for self signed certificate (kubectl depends on)

"-v=9" will print more debugging info. e.g.: kubectl -v=9 get nodes
in the output, there is a curl equivalent of the failing request, you can simply run the curl command and investigate.

@seanknox
Copy link
Contributor

seanknox commented Aug 2, 2018

Closing due inactivity. Feel free to re-open if still an issue.

@seanknox seanknox closed this as completed Aug 2, 2018
@tbithell
Copy link

I was able to get this to work after running az aks get-credentials --resource-group rsgnamehere --name clusternamehere - this was my error though... Unable to connect to the server: dial tcp [::1]:8080: connectex: No connection could be made because the target machine actively refused it

Really just putting this here in case someone else runs into it.

@Azure Azure locked as resolved and limited conversation to collaborators Aug 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests