Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542

Closed
tomkerkhove opened this issue Jul 19, 2018 · 12 comments
Assignees

Comments

@tomkerkhove
Copy link
Member

What happened:
When trying to add a new node via the Azure Portal it failed and mentioned that the cluster went into a failed state. After contacting Azure Support it turns out that this was because we've reached our core quota for that VM family.

What you expected to happen:
The portal should provide equally information as what the az aks scale command does:

az aks scale --resource-group <resource-group-name> --name <cluster-name> --node-count 5
Deployment failed. Correlation ID: <correlation-id> exceeding quota limits of Core. Maximum allowed: 10, Current in use: 10, Additional requested: 2. Please read more about quota increase at http://aka.ms/corequotaincrease

This would make it easier to mitigate this instead of going through support as this is clearly indicated here:
core-quota

But you need to know that this was the issue.

How to reproduce it (as minimally and precisely as possible):

  1. Create a new AKS cluster
  2. Add more nodes to it that exceed the core quota on your current subscription
@tomkerkhove tomkerkhove changed the title Azure Portal Experience - Scaling cluster failed due to core quota but cluster entered "failed state" Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota Jul 19, 2018
@weinong
Copy link
Contributor

weinong commented Jul 22, 2018

@tomkerkhove portal should have got the same error message as cli does. interesting. Can you send details like subscriptionID, resource group, and aks name to aks-help@service.microsoft.com for me to take a look?

@weinong weinong self-assigned this Jul 22, 2018
@endforward
Copy link

I encountered the same issue after trying to upgrade the kubernetes version (I guess the upgrade happens by provisioning new nodes with the newer version). The portal had the same error message "Container service in failed state contact support". When I contacted support they informed me I was maxing out our core quota. The cluster was setup with the old D2 series so they also wouldn't allow an increase in quota and thus I had to destroy and redeploy the entire cluster. So some sort of pre-condition check in the portal as well as a more informative error message would greatly improve the UX.

I have already deleted the cluster and redeployed with D2v3. @weinong Let me know if it is still helpfull for you to receive the details of the old cluster.

@rocketraman
Copy link

rocketraman commented Sep 17, 2018

Like @endforward , I also encountered the same issue during an upgrade.

Three things:

  1. The cluster manager should not even start the upgrade unless the quota is available to complete it, and

  2. Azure AKS support needs to be aware of this issue and how to recover a cluster from Failed state. So far, they have told me (incorrectly) that a scale operation will fix the cluster state. It does not:

Failed to save container service 'my-cluster-name'. Error: Operation is not allowed while cluster is being upgrading or failed in upgrade
  1. When doing the upgrade from the portal rather than the command line, that quota is the underlying reason for the error is not obvious at all. There seems to be no way to see it, in fact, unless one looks at the notifications -- even there, it is is buried in a wall of error output.

@Karishma-Tiwari-MSFT
Copy link

@weinong Is there an update on this issue? Customers are still seeing issues with upgrading the cluster resulting into "The container service is in failed state". On Azure portal, there is no other detail provided on what that error mean. Can you please share what is the solution in this case? Do we have to create a new cluster and deploy?

@jnoller
Copy link
Contributor

jnoller commented Jan 16, 2019

@Tiwari-MSFT I've labeled this as a known issue (e.g. the Azure portal does not make the error explicit). The ultimate fix is to ensure that any operation does a pre-flight check to ensure capacity for the subscription will allow the action. Additionally, errors resulting from quota errors need to shown with the same specificity as the CLI.

Currently, as noted - because the upgrade fails due to the underlying quota issue, it "locks" the cluster to the pending action. This means an upgrade/scale/etc operation will not function.

I will leave this open, communicate it to support and file an AKS backlog item to track fixes. This issue will be closed when the fixes land.

@Karishma-Tiwari-MSFT
Copy link

@jnoller Thanks for sharing this information. This is helpful to know that it is a known issue. I will communicate that to the customers.

@BertusVanZyl
Copy link

Updated from 1.11.4 to 1.11.5, and I ran into exactly this problem. This was reported in July 2018. This is not a minor issue, because I was told by support that to fix this I have to recreate the service. Why is this still not fixed? At the very least, put a warning message on the portal where you do the upgrade stating "some error messages are not reported yet, we suggest using the CLI to do upgrades". Or just disable it until it actually works properly.

@frohikey
Copy link

frohikey commented Feb 24, 2019

Exactly the same thing happen to me today when upgrading from 1.12.4 to 1.12.5 in the portal with infamous error: Failed to save container service 'xxxxxx'. Error: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidTemplateDeployment" Message="The template deployment is not valid according to the validation procedure. ... Operation results in exceeding quota limits of Core.

AKS remained in the failed state. No possibility to upgrade/scale/whatever.

Out of curiosity I decided to try out CLI (after support for increased quota has been completed).

Failed state. See: available upgrade version is the same as reported.

az aks get-upgrades --resource-group xxxxxx --name xxxxxx --output table

Name     ResourceGroup    MasterVersion    NodePoolVersion    Upgrades
-------  ---------------  ---------------  -----------------  ----------
default  xxxxxx           1.12.5           1.12.5             1.12.5

I tried to upgrade

az aks upgrade --resource-group xxx --name xxx --kubernetes-version 1.12.5

And I ended up with a working cluster and the failed state has been resolved.

az aks get-upgrades --resource-group xxxxxx --name xxxxxx --output table

Name     ResourceGroup    MasterVersion    NodePoolVersion    Upgrades
-------  ---------------  ---------------  -----------------  --------------
default  xxxxxx           1.12.5           1.12.5             None available

@nodeselector
Copy link

@jnoller I think this one was taken care of with this: https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2019-03-07

Should this be closed?

@jnoller
Copy link
Contributor

jnoller commented Apr 2, 2019

AKS now confirms all quota availability prior to allowing a given action to be executed. Closing as Done.

@mhtsharma948
Copy link

I'm getting the same error The container service is in failed state. Click here to open new support request. This error occured when I was trying to upgrade my k8s version.
I've figured out that this issue is resolved in https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2019-03-21, but I'm still facing the same issue.

@jnoller
Copy link
Contributor

jnoller commented May 14, 2019

@mhtsharma948 Please open a support request - these failures are cluster specific

@ghost ghost locked as resolved and limited conversation to collaborators Aug 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants