Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreatingLoadBalancerFailed on AKS cluster with advanced networking #357

Closed
nphmuller opened this Issue May 9, 2018 · 28 comments

Comments

Projects
None yet
@nphmuller
Copy link

nphmuller commented May 9, 2018

kubectl version

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Repro:

  • Deploy new AKS cluster.
    • Used latest version at the time (1.9.6)
    • In networking tab pick advanced
  • kubectl run nginx --image=nginx --replicas=1 --port=80
  • kubectl expose deployment nginx --port=80 --target-port=80 --type=LoadBalancer
  • kubectl get service nginx -w: EXTERNAL-IP stuck at <pending>
  • kubectl describe service nginx will show the following events:
Type     Reason                      Age               From                Message
  ----     ------                      ----              ----                -------
  Normal   EnsuringLoadBalancer        1m (x9 over 16m)  service-controller  Ensuring load balancer
  Warning  CreatingLoadBalancerFailed  1m (x9 over 16m)  service-controller  Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx: ensure(default/nginx): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client 'XXX' with object id 'XXX' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope '/subscriptions/XXX/resourceGroups/MC_XXX/providers/Microsoft.Network/networkInterfaces/aks-agentpool-XXX-nic-0'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/XXX/resourceGroups/XXX-OTHER/providers/Microsoft.Network/virtualNetworks/XXX/subnets/XXX'.\""

Workaround:

Manually give Owner permission (Contributor doesn't work) to the service principal for the subnet.

@nphmuller

This comment has been minimized.

Copy link
Author

nphmuller commented May 11, 2018

Someone asked me to eleborate on the workaround. I guess they figured it out, since they deleted their comment, but here are the more detailed steps anyway:

  • In the Azure portal, go to Virtual Networks and select the VN (not the generated aks-vnet-xxx VN).
  • Go to Subnets and select the subnet you choose in the advanced network option during creation of your AKS cluster.
  • Pick ‘Manage Users’ and add a user.
  • Pick the role Owner (Contributor won’t resolve the error). Update: Network Contributor
  • Select the Service Principal that was created for your AKS cluster. It doesn’t appear in the list by default, but you can search by the first few chars of your AKS cluster’s name.
@slack

This comment has been minimized.

Copy link
Member

slack commented May 17, 2018

Thanks for catching this, we overlooked it in docs. Will update instructions for existing VNet.

@sabbour

This comment has been minimized.

Copy link

sabbour commented May 24, 2018

I'm also experiencing this behavior on an AKS cluster with Advanced Networking (using a new VNet).

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Jun 4, 2018

@nphmuller @sabbour

I tried and can reproduce this issue, my procedures are:

  1. a new default service principle is created, and
  2. a new VNET/Subnet is created in place.

Also checked the events and something I noticed are:

  • a new default SP is created, in my case, it is xyz-azureSP-20180603214911, this is expected and this SP is used to create the new external LB.
  • the subnet is created by a different SP, seems belong a resource group named cleanupservice. This seems strange.

Btw, another mitigation works for me is to provide a pre-created SP instead of create new SP in place.

Will continue to investigate this and update.

@ppadial

This comment has been minimized.

Copy link

ppadial commented Jun 4, 2018

I'm having the same issue here, the command kubectl describe service myservice says:

  Normal   EnsuringLoadBalancer        29s (x5 over 1m)  service-controller  Ensuring load balancer
  Warning  CreatingLoadBalancerFailed  28s               service-controller  Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx-demo: [ensure(default/nginx-demo): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client '-----' with object id '----' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope '/subscriptions/--------/resourceGroups/MC_myresourcegroup_myclustername_westeurope/providers/Microsoft.Network/networkInterfaces/aks-agentpool-xxxxx-nic-0'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/-------/resourceGroups/myresourcegroup/providers/Microsoft.Network/virtualNetworks/MyVNETNAME/subnets/k8s'.\"", ensure(default/nginx-demo): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client '0-------' with object id '0--------' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope '/subscriptions/------/resourceGroups/MC_myresoruecegroup_myclustername_westeurope/providers/Microsoft.Network/networkInterfaces/aks-agentpool------nic-2'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/------/resourceGroups/myresourecegroup/providers/Microsoft.Network/virtualNetworks/MyVMNet/subnets/k8s'.\"", ensure(default/nginx-demo): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client '-----' with object id '-----' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope '/subscriptions/------/resourceGroups/MC_RGName_ClusterName_westeurope/providers/Microsoft.Network/networkInterfaces/aks-agentpool------nic-1'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/-----/resourceGroups/myresourcegroup/providers/Microsoft.Network/virtualNetworks/MyVNet/subnets/k8s'.\""]

Updated: i Fixed my issue giving access as contributor to the subnet on the VMNet for the app account nameofyoucluster-somenumbers. After that, the Loadbalancer connection works fine, a New public ip is generated and the service is deployed correctly (Even the NSG is updated properly)

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Jun 5, 2018

Ok, I believe the issue is if "(new) default service principal" is chosen when create AKS cluster, the newly created SP will only have "Contributor" role for the created AKS resource group "MC_xxx". There is no additional role assignment for this SP toward the Vnet/Subnet resource. When creating external LB, this SP does not have permission to interact with Subnet and it ends up with the permission error we saw.

I see two solutions to correct this issue programatically:

  1. Add a role assignment in the AKS creating logic, so this SP have "Contributor" role for Subnet, just as the mitigation pointed out by @nphmuller . The cons is it requires an additional logic on the AKS client code, including Portal and Azure CLI.
  2. When create this SP, assign a larger scope, say on the subscription level (which includes the Vnet/Subnet). This should be easy to do, but the cons is that it kind of violates the minimum privilege rules. But somehow we allow customers to use a pre-created SP, so is that rule already violated?

I will bring this to the team and discuss for a solution.

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Jun 7, 2018

We decide to pick solution 1 from the above comment. The requests have been sent to Portal and CLI team. For now, please follow the @nphmuller 's mitigation steps: for the subnet, add "contributor" role for the newly created SP.

@damadei

This comment has been minimized.

Copy link

damadei commented Jun 19, 2018

+1 here having this issue.

@maniSbindra

This comment has been minimized.

Copy link

maniSbindra commented Jun 19, 2018

When using the portal to create an AKS cluster, and using advanced networking and specifying a custom subnet, it will help if the tooltip explicitly mentions that the SP provided when creating the cluster needs to have contributor rights on the subnet.

@jalberto

This comment has been minimized.

Copy link

jalberto commented Jun 20, 2018

I have same Issue, I followed @nphmuller workaround (with owner roles), deleted and created the service again, and I have this:

get services
vpn-openvpn-pub   LoadBalancer   10.41.85.190   1.1.1.1   443:32161/TCP   19m

describe services vpn-openvpn-pub
[...]
  Type     Reason                      Age                From                Message
  ----     ------                      ----               ----                -------
  Warning  CreatingLoadBalancerFailed  17m                service-controller  Error creating load balancer (will retry): failed to ensure load balancer for service default/vpn-openvpn-pub: ensure(default/vpn-openvpn-pub): lb(kubernetes) - failed to ensure host in pool: "network.InterfacesClient#CreateOrUpdate: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code=\"LinkedAuthorizationFailed\" Message=\"The client 'db76e0a3-ba1a-4825-a36e-67fed3b9f827' with object id 'db76e0a3-ba1a-4825-a36e-67fed3b9f827' has permission to perform action 'Microsoft.Network/networkInterfaces/write' on scope '/subscriptions/3dd693dd-25f9-471b-a404-19e064398e85/resourceGroups/MC_vlaks01_vlaks01_westeurope/providers/Microsoft.Network/networkInterfaces/aks-agentpool-96971622-nic-0'; however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s) '/subscriptions/3dd693dd-25f9-471b-a404-19e064398e85/resourceGroups/vlcommon/providers/Microsoft.Network/virtualNetworks/VlCommonVNET/subnets/vlaks01'.\""
  Normal   EnsuringLoadBalancer        17m (x2 over 19m)  service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer         16m                service-controller  Ensured load balancer
@jalberto

This comment has been minimized.

Copy link

jalberto commented Jun 20, 2018

Adding contributor permission to whole VNet (not only subnet) seems to work (no more warnings) but I still cannot contact the service exposed (timeout).

Maybe AKS adds some NSG that needs to be modified?

I checked NSG created by AKS (in MC* RG) and I can see an entry allowing the traffic to that IP in correct port.

Tried several services, in different namespaces, even allowing AKS to create the public IP, still I am not able to communicate with the service.

Note "HTTP application routing" is disabled in AKS creation time

@jalberto

This comment has been minimized.

Copy link

jalberto commented Jun 20, 2018

So, destroyed AKS, created again, follow workaround (before creating nay service) and now it works.

Timing issue?

@lmcarreiro

This comment has been minimized.

Copy link

lmcarreiro commented Jun 20, 2018

@jalberto

Adding contributor permission to whole VNet (not only subnet) seems to work (no more warnings) but I still cannot contact the service exposed (timeout).

I put the Contributor permission to the whole resource group where I created the AKS Cluster and the VNet. It is working now.

When you run kubectl service ls does it show the external ip column with the IP address that you specified in the yml of your service?

When you run kubectl service describe <your-service-name>, does it show any errors on the events?

@lmcarreiro

This comment has been minimized.

Copy link

lmcarreiro commented Jun 20, 2018

@nphmuller

Manually give Owner permission (Contributor doesn't work) to the service principal for the subnet.

It worked for me with Contributor, but I put this permission in the resource group where my cluster and vnet are. Not just the virtual subnet like you described.

@nphmuller

This comment has been minimized.

Copy link
Author

nphmuller commented Jun 21, 2018

@lmcarreiro
I'd rather give the SP Owner permission to the VNet than Contributor permissions to the entire Resource Group. Bit more Principle of Least Privilege-esque (although the owner permission is still bad).

@maetthu

This comment has been minimized.

Copy link

maetthu commented Jun 29, 2018

Assigning SP the Owner role for the subnet didn't work for me, I still got the same permission error. Assigning Contributor to the Resource Group containing the vnet did. Since the vnet is the only resource in this RG, it might also work to assign Contributor to the vnet itself, though I didn't try that.

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Jul 6, 2018

Portal has add a fix to automatically add SP as contributor to the subnet used in creating AKS cluster. Confirmed the fix. Will close this issue now.

@JunSun17 JunSun17 closed this Jul 6, 2018

@sukrit007

This comment has been minimized.

Copy link

sukrit007 commented Jul 25, 2018

Just ran into this issue with aks CLI and I assigned the owner permission to the service principal for MC_* group and subnet did not seem to be working for us for existing cluster and the addon-http-application-routing-nginx-ingress-controller pods still seem to be in CrashLoop. (Note: We did not pass the service principal in CLI and az cli had created the principal for us. azure-cli (2.0.42)

@iMartyn

This comment has been minimized.

Copy link

iMartyn commented Aug 23, 2018

@JunSun17 Surely the issue should remain open until the CLI has been updated as well?

@nphmuller

This comment has been minimized.

Copy link
Author

nphmuller commented Aug 23, 2018

@iMartyn I've created a new AKS cluster via the CLI (version 2.0.44) yesterday and it seems the permission was set automatically. So I think the original issue is fixed. Maybe you're running into another issue?

You can check it via Azure Portal by going to Virtual Networks -> YourVnet -> Subnets -> YourAksVnet -> Manage Users. The Service Principal should be in that list under Network Contributor.

@iMartyn

This comment has been minimized.

Copy link

iMartyn commented Aug 23, 2018

I can confirm that it does not give the SP Owner permission - there is even an error message, whenever the CLI is asked to create a cluster with Advanced Networking :
AAD role propagation done[############################################] 100.0000%Operation failed:
Then a message about timeout. (I don't have it in my scrollback to paste).

This was observed from two different machines on two different days and two different OS' (mac and linux) so I'm pretty sure it's not actually a timeout.

It does seem very similar to Azure/azure-cli#5190 which was closed but other peole are saying that bug is back.

@nphmuller

This comment has been minimized.

Copy link
Author

nphmuller commented Aug 23, 2018

Thanks. Seems like something broke since yesterday. But it looks like a different issue than this one.

I recommend creating a new Github issue so the appropriate person can look at even. (Or even creating an azure support ticket/Twitter message. You'll probably get a quicker response that way)

@iMartyn

This comment has been minimized.

Copy link

iMartyn commented Aug 24, 2018

I highly doubt that, my experience of Azure support is not to the level that I would expect any kind of useful response.
This did not break yesterday, it has definitely been this way since at least Wednesday last week.

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Aug 24, 2018

Hi all, I will pass this request to CLI dev, and will also check it a bit later to verify the issue then for a fix. Thanks!

@zqingqing1 zqingqing1 referenced this issue Sep 4, 2018

Merged

[AKS] role-assignment-fix: #357 #7222

2 of 2 tasks complete
@Xorima

This comment has been minimized.

Copy link

Xorima commented Sep 14, 2018

@JunSun17 Which release will this fix be in?

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Sep 14, 2018

The fix should be already deployed WW. Please report back if your issue is not resolved.

@logcorner

This comment has been minimized.

Copy link

logcorner commented Sep 16, 2018

loadbalancererr

@JunSun17

This comment has been minimized.

Copy link

JunSun17 commented Sep 19, 2018

@logcorner Are you using advanced netowrking? If so, have you tried the steps in: #357 (comment)

If not, can you file an incident through support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.