Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS (with http addon + custom VNet) can't get public IP for LB #427

Closed
WeidongZhou opened this issue Jun 13, 2018 · 22 comments

Comments

Projects
None yet
7 participants
@WeidongZhou
Copy link

commented Jun 13, 2018

Hi,
I am using the example code, Kafka-aks-test, from this URL: https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-azure-container-services. Everything looks fine except getting external IP. Initially we run into the error like "autorest/azure: Service returned an error. Status=403". After some research, we added SP to the subnet and vnet for the AKS cluster. This error went away. But still still have issue in getting external IP. Currently with the following warning message:
Error creating load balancer (will retry): failed to ensure load balancer for service default/kafka-aks-test2: timed out waiting for the condition.
I made several different deployments using the same code (just change app name). The interesting part is that one deployment got external IP after 22 hours in pending state. Other deployments are still in pending state. We are using latest AKS release and just created the AKS cluster less than a week ago. Here are the output from kubectl get service command:
[root@exa-dev01-ue1-kfclient1-vm Kafka-AKS-Test]# kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
azure-vote-back ClusterIP 192.168.126.17 6379/TCP 1d
azure-vote-front LoadBalancer 192.168.98.14 80:32062/TCP 17h
foundationservice LoadBalancer 192.168.45.233 3000:32046/TCP 17h
kafka-aks-test LoadBalancer 192.168.130.97 23.100.17.58 80:32656/TCP 1d
kafka-aks-test2 LoadBalancer 192.168.120.61 80:30686/TCP 54m
kafka-aks-test3 LoadBalancer 192.168.241.28 10.2.1.97 80:30877/TCP 11h
kubernetes ClusterIP 192.168.0.1 443/TCP 18h

Any recommendations?

@VincentSurelle

This comment has been minimized.

Copy link

commented Jun 13, 2018

Hi,
Looks like #422

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 13, 2018

@VincentSurelle Thank you for your reply. I checked out #422. Actually I tried this trick of using annotations. It worked to get load balancer for internal ip address, but not external ip. See the output of kafka-aks-test3 service, it got 10.2.x.x IP, which is the AKS subnet. What I need is to get external IP that can be accessed from internet. Thanks.

@VincentSurelle

This comment has been minimized.

Copy link

commented Jun 13, 2018

@WeidongZhou Sorry didn't see it at first.
Can you provide your yml file ?

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 13, 2018

@VincentSurelle
here it is
[root@exa-dev01-ue1-kfclient1-vm Kafka-AKS-Test]# cat k2.yaml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: kafka-aks-test2
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
minReadySeconds: 5
template:
metadata:
labels:
app: kafka-aks-test2
spec:
containers:
- name: kafka-aks-test2
image: foundationcontainerregistry.azurecr.io/kafka-aks-test:v1
ports:
- containerPort: 80
resources:
requests:
cpu: 250m
limits:
cpu: 500m

apiVersion: v1
kind: Service
metadata:
name: kafka-aks-test2
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: kafka-aks-test2

@VincentSurelle

This comment has been minimized.

Copy link

commented Jun 13, 2018

Did you checked the Http application routing box on cluster creation ?

http application routing

Seems that since 2-3 days it's difficult to work with external ips

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 13, 2018

Yes, I selected Yes for Http application routing and Advanced for Network configuration. We do have a few subnet under the same VNet.
Also even for one case that I got an external IP after 22 hrs in pending state, I can see this IP is an Public IP type resource. I would expect it should be something like Load Balancer type resource. In my deployment, I have only one replicate. If I need multiple replicates in my deployment, I don't know how this single public IP is going to work.

@brusMX

This comment has been minimized.

Copy link

commented Jun 13, 2018

I'm also experiencing a problem like this. Apparently when you check both: Http Application routing and Custom VNets I cant seem to get public IPs and I get these logs:

k get all --all-namespaces

kube-system   pod/addon-http-application-routing-nginx-ingress-controller-64mdqfn   0/1       CrashLoopBackOff   597        2d

kube-system   service/addon-http-application-routing-nginx-ingress          LoadBalancer   10.0.99.228   <pending>     80:31312/TCP,443:31735/TCP   2d

kube-system   deployment.extensions/addon-http-application-routing-nginx-ingress-controller   1         1         1            0           2d

k describe pod/addon-http-application-routing-nginx-ingress-controller-64mdqfn -n kube-system

Name:           addon-http-application-routing-nginx-ingress-controller-64mdqfn
Namespace:      kube-system
Node:           aks-agentpool-19807500-1/10.1.0.4
Start Time:     Mon, 11 Jun 2018 11:46:02 -0700
Labels:         app=addon-http-application-routing-nginx-ingress
                pod-template-hash=2005983595
Annotations:    <none>
Status:         Running
IP:             10.1.0.11
Controlled By:  ReplicaSet/addon-http-application-routing-nginx-ingress-controller-6449fd79f9
Containers:
  addon-http-application-routing-nginx-ingress-controller:
    Container ID:  docker://19dc0ec2fe04c77d943eb8016fe1c205c53ae13726757c1f4a31cfca840a6941
    Image:         quayio.azureedge.net/kubernetes-ingress-controller/nginx-ingress-controller:0.13.0
    Image ID:      docker-pullable://quayio.azureedge.net/kubernetes-ingress-controller/nginx-ingress-controller@sha256:8f3a3bf373e64d8b29e502faf58dd1b212ceb2a69627ccc1add5b9aca24e273b
    Ports:         80/TCP, 443/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --ingress-class=addon-http-application-routing
      --default-backend-service=$(POD_NAMESPACE)/addon-http-application-routing-default-http-backend
      --configmap=$(POD_NAMESPACE)/addon-http-application-routing-nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/addon-http-application-routing-tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/addon-http-application-routing-udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
      --publish-service=$(POD_NAMESPACE)/addon-http-application-routing-nginx-ingress
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Wed, 13 Jun 2018 15:35:07 -0700
      Finished:     Wed, 13 Jun 2018 15:35:08 -0700
    Ready:          False
    Restart Count:  606
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       addon-http-application-routing-nginx-ingress-controller-64mdqfn (v1:metadata.name)
      POD_NAMESPACE:  kube-system (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from addon-http-application-routing-nginx-ingress-serviceaccoun4cmsx (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  addon-http-application-routing-nginx-ingress-serviceaccoun4cmsx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  addon-http-application-routing-nginx-ingress-serviceaccoun4cmsx
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                  From                               Message
  ----     ------   ----                 ----                               -------
  Warning  BackOff  4m (x14145 over 2d)  kubelet, aks-agentpool-19807500-1  Back-off restarting failed container

k logs pod/addon-http-application-routing-nginx-ingress-controller-64mdqfn -n kube-system

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.13.0
  Build:      git-4bc943a
  Repository: https://github.com/kubernetes/ingress-nginx
-------------------------------------------------------------------------------

I0613 21:48:35.281783       8 flags.go:162] Watching for ingress class: addon-http-application-routing
W0613 21:48:35.281957       8 flags.go:165] only Ingress with class "addon-http-application-routing" will be processed by this ingress controller
W0613 21:48:35.282410       8 client_config.go:533] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0613 21:48:35.282652       8 main.go:181] Creating API client for https://10.0.0.1:443
I0613 21:48:35.330053       8 main.go:225] Running in Kubernetes Cluster version v1.9 (v1.9.6) - git (clean) commit 9f8ebd171479bec0ada837d7ee641dec2f8c6dd1 - platform linux/amd64
I0613 21:48:35.339903       8 main.go:84] validated kube-system/addon-http-application-routing-default-http-backend as the default backend
F0613 21:48:35.355390       8 main.go:102] service kube-system/addon-http-application-routing-nginx-ingress does not (yet) have ingress points
@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

@brusMX interesting, I got similar result like yours

kube-system pod/addon-http-application-routing-nginx-ingress-controller-647t6s4 0/1 CrashLoopBackOff 1519 5d
kube-system service/addon-http-application-routing-nginx-ingress LoadBalancer 192.168.205.83 80:32704/TCP,443:32663/TCP 5d

I created this cluster 5 days ago. So this issue has been there from the time the cluster stand up. It has restart 1,519 times in these 5 day timeframe. It looks like a bug to me.

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

@brusMX I have been on a call with Microsoft support for about 8 hrs on this issue. Finally we had the resolution for this issue. The trick to add SP to AKS cluster VNet or subnet was not working. The key to resolve this issue is to add the SP for the cluster's MC resource group as owner. By default, the installation of AKS cluster add this SP as contributor. It has to be owner. We also built a new cluster during the call.

By the way, Microsoft AKS technical support is excellent. I got a call within 10 minutes I opened the ticket and the support engineer was with me all the time to try out different workaround. It was a great experience in the support.

@brusMX

This comment has been minimized.

Copy link

commented Jun 14, 2018

Thank you @WeidongZhou for your response.
Can you please update the title of this issue to something more informative?
AKS (with http addon + custom VNet) can't get public IP for LB

Ok, so according to your comment. Can this be achieved today in the portal (without the CLI)?
Basically, the team in charge of the portal deploying solution should check that this case doesn't happen, either just let you check one of those 2 ticks at a time (disable one) or catch this case and create a SP with owner access on the fly.

@WeidongZhou WeidongZhou changed the title AKS can not get external IP in the service AKS (with http addon + custom VNet) can't get public IP for LB Jun 14, 2018

@brusMX

This comment has been minimized.

Copy link

commented Jun 14, 2018

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

@brusMX I updated the title to what you recommended.

Yes, you can achieved this from portal. Here are the steps:

  1. Identify the SP you used in your cluster. If creating the cluster from portal, you can not specify SP name to what you want, it will create a new one. Run the following commands where -n to use your aks cluster name and -g to use the resource group name when created the cluster.
    [root@ Kafka-AKS-Test]# az aks show -n exa-aksc2 -g exa-dev01-ue1-aksc2-vnet2-rg | grep clientId
    "clientId": "27ae6273-9706-4156-b546-607279623990"

Then click Azure Active Directory to use this client id to find out what is exact the SP used for the cluster. Unfortunately, I couldn't remember the exact step to get the SP.

  1. In case you don't know the SP, it is still ok. Click Resource Groups, find the MC_yourcluster name something, click it. Then click Access control (IAM) in the menu, you should see something like SP-201806something in the contributor list. Click Add, Select Owner for role, then select the same SP in the contributor list, click Save. It's done.

Good luck.

@nphmuller

This comment has been minimized.

Copy link

commented Jun 14, 2018

Seems to be the same problem (and workaround) that I mentioned in this issue a couple of weeks ago: #357

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

Yes, similar. For our case, it seems adding SP to VNet and subnet was not enough for us and we got the same issue. Only after we added the same SP to MC_* resource group as Owner role, it fixed the issue. But to your credit, the link above did give us the idea to add SP as owner role. Microsoft support also helped us to resolve some other issues in our environment and applications. That is why it took so long in the call.

@nphmuller

This comment has been minimized.

Copy link

commented Jun 14, 2018

It's always in the small details. Good to know! 👍

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

:)

@sukrit007

This comment has been minimized.

Copy link

commented Jul 25, 2018

@WeidongZhou We were having the same issue. After adding SP to MC_* resource group, did you have to recreate the cluster?

@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jul 25, 2018

@sukrit007 I did recreate the cluster during the test. However, got the exact same issue in the new cluster. I provided the solution in my blog at
https://weidongzhou.wordpress.com/2018/06/27/could-not-get-external-ip-for-load-balancer-on-azure-aks/

Please note, even with my workaround solution, it could still take 10 minutes to get an IP for internal balancer if doing internal cluster IP or external IP. Just be patient to wait to get the IP.

@sukrit007

This comment has been minimized.

Copy link

commented Jul 26, 2018

@WeidongZhou Thanks for the blog post. That worked.

@EamonKeane

This comment has been minimized.

Copy link

commented Jul 31, 2018

Thanks @WeidongZhou this got me half-way there. As I had a subnet, I also had to add the cluster as an owner of this. I debugged this by using kubectl get events and saw similar to the below. Deleting the nginx service and running helm upgrade --install mynginx nginx did the trick to kick start the installation.

23s         1m           3         cluster-svc-nginx-ingress-controller.15467bc11fee1504                   Service                                               
   Warning   CreatingLoadBalancerFailed   service-controller           
    Error creating load balancer (will retry): failed to ensure load balancer for service cluster-svc/cluster-svc-nginx-ingress-controller: 
    [ensure(cluster-svc/cluster-svc-nginx-ingress-controller): backendPoolID(/subscriptions/<SUBSCRIPTION-ID>
    /resourceGroups/MC_squareroute-develop_squareroute-develop_westeurope/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes) 
    - failed to ensure host inpool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- 
    Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client 
    '7bad65c0-cb9f-4be9-9715-8ef865156338' with object id '7bad65c0-cb9f-4be9-9715-8ef865156338' has permission to perform 
    action 'Microsoft.Network/networkInterfaces/write' on scope 
    '/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/MC_squareroute-develop_squareroute-develop_westeurope/providers/Microsoft.Network/networkInterfaces/aks-nodepool1-47278868-nic-1';
     however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) 
     '/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/squareroute-develop/providers/Microsoft.Network/virtualNetworks/squareroute-develop/subnets/squareroute-develop'.\"", 
     ensure(cluster-svc/cluster-svc-nginx-ingress-controller): 
@WeidongZhou

This comment has been minimized.

Copy link
Author

commented Jul 31, 2018

@EamonKeane Interesting tips. Thanks for sharing the tips.

@lgomezgonz

This comment has been minimized.

Copy link

commented Aug 2, 2018

Thanks all for your help.
I have followed your tips, but still got 401 error when the nginx service trying to generate the LB.
Finally I found the cause: the key for the service principal I was using expired one day ago (the cluster was one week old).
I tried to find a way to update the key in the cluster using the console, az CLI...but there seems to be no available oprions to do this.
This product is so new that very few people (only the ones reusing old SP like me) are likely to have hit this issue.
I hope microsoft adds a way to update the SP keys in the clusters because otherwise, in one year many clusters will have to be deleted in order to set up new load balancers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.