Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542

tomkerkhove · 2018-07-19T06:49:53Z

What happened:
When trying to add a new node via the Azure Portal it failed and mentioned that the cluster went into a failed state. After contacting Azure Support it turns out that this was because we've reached our core quota for that VM family.

What you expected to happen:
The portal should provide equally information as what the az aks scale command does:

az aks scale --resource-group <resource-group-name> --name <cluster-name> --node-count 5
Deployment failed. Correlation ID: <correlation-id> exceeding quota limits of Core. Maximum allowed: 10, Current in use: 10, Additional requested: 2. Please read more about quota increase at http://aka.ms/corequotaincrease

This would make it easier to mitigate this instead of going through support as this is clearly indicated here:

But you need to know that this was the issue.

How to reproduce it (as minimally and precisely as possible):

Create a new AKS cluster
Add more nodes to it that exceed the core quota on your current subscription

The text was updated successfully, but these errors were encountered:

weinong · 2018-07-22T18:28:41Z

@tomkerkhove portal should have got the same error message as cli does. interesting. Can you send details like subscriptionID, resource group, and aks name to aks-help@service.microsoft.com for me to take a look?

endforward · 2018-09-15T10:57:11Z

I encountered the same issue after trying to upgrade the kubernetes version (I guess the upgrade happens by provisioning new nodes with the newer version). The portal had the same error message "Container service in failed state contact support". When I contacted support they informed me I was maxing out our core quota. The cluster was setup with the old D2 series so they also wouldn't allow an increase in quota and thus I had to destroy and redeploy the entire cluster. So some sort of pre-condition check in the portal as well as a more informative error message would greatly improve the UX.

I have already deleted the cluster and redeployed with D2v3. @weinong Let me know if it is still helpfull for you to receive the details of the old cluster.

rocketraman · 2018-09-17T14:37:15Z

Like @endforward , I also encountered the same issue during an upgrade.

Three things:

The cluster manager should not even start the upgrade unless the quota is available to complete it, and
Azure AKS support needs to be aware of this issue and how to recover a cluster from Failed state. So far, they have told me (incorrectly) that a scale operation will fix the cluster state. It does not:

Failed to save container service 'my-cluster-name'. Error: Operation is not allowed while cluster is being upgrading or failed in upgrade

When doing the upgrade from the portal rather than the command line, that quota is the underlying reason for the error is not obvious at all. There seems to be no way to see it, in fact, unless one looks at the notifications -- even there, it is is buried in a wall of error output.

Karishma-Tiwari-MSFT · 2019-01-16T22:56:06Z

@weinong Is there an update on this issue? Customers are still seeing issues with upgrading the cluster resulting into "The container service is in failed state". On Azure portal, there is no other detail provided on what that error mean. Can you please share what is the solution in this case? Do we have to create a new cluster and deploy?

jnoller · 2019-01-16T23:03:29Z

@Tiwari-MSFT I've labeled this as a known issue (e.g. the Azure portal does not make the error explicit). The ultimate fix is to ensure that any operation does a pre-flight check to ensure capacity for the subscription will allow the action. Additionally, errors resulting from quota errors need to shown with the same specificity as the CLI.

Currently, as noted - because the upgrade fails due to the underlying quota issue, it "locks" the cluster to the pending action. This means an upgrade/scale/etc operation will not function.

I will leave this open, communicate it to support and file an AKS backlog item to track fixes. This issue will be closed when the fixes land.

Karishma-Tiwari-MSFT · 2019-01-16T23:56:18Z

@jnoller Thanks for sharing this information. This is helpful to know that it is a known issue. I will communicate that to the customers.

BertusVanZyl · 2019-01-17T06:30:20Z

Updated from 1.11.4 to 1.11.5, and I ran into exactly this problem. This was reported in July 2018. This is not a minor issue, because I was told by support that to fix this I have to recreate the service. Why is this still not fixed? At the very least, put a warning message on the portal where you do the upgrade stating "some error messages are not reported yet, we suggest using the CLI to do upgrades". Or just disable it until it actually works properly.

frohikey · 2019-02-24T16:10:14Z

Exactly the same thing happen to me today when upgrading from 1.12.4 to 1.12.5 in the portal with infamous error: Failed to save container service 'xxxxxx'. Error: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidTemplateDeployment" Message="The template deployment is not valid according to the validation procedure. ... Operation results in exceeding quota limits of Core.

AKS remained in the failed state. No possibility to upgrade/scale/whatever.

Out of curiosity I decided to try out CLI (after support for increased quota has been completed).

Failed state. See: available upgrade version is the same as reported.

az aks get-upgrades --resource-group xxxxxx --name xxxxxx --output table

Name     ResourceGroup    MasterVersion    NodePoolVersion    Upgrades
-------  ---------------  ---------------  -----------------  ----------
default  xxxxxx           1.12.5           1.12.5             1.12.5

I tried to upgrade

az aks upgrade --resource-group xxx --name xxx --kubernetes-version 1.12.5

And I ended up with a working cluster and the failed state has been resolved.

az aks get-upgrades --resource-group xxxxxx --name xxxxxx --output table

Name     ResourceGroup    MasterVersion    NodePoolVersion    Upgrades
-------  ---------------  ---------------  -----------------  --------------
default  xxxxxx           1.12.5           1.12.5             None available

nodeselector · 2019-03-25T23:40:22Z

@jnoller I think this one was taken care of with this: https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2019-03-07

Should this be closed?

jnoller · 2019-04-02T16:13:44Z

AKS now confirms all quota availability prior to allowing a given action to be executed. Closing as Done.

mhtsharma948 · 2019-05-14T07:13:32Z

I'm getting the same error The container service is in failed state. Click here to open new support request. This error occured when I was trying to upgrade my k8s version.
I've figured out that this issue is resolved in https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2019-03-21, but I'm still facing the same issue.

jnoller · 2019-05-14T15:05:19Z

@mhtsharma948 Please open a support request - these failures are cluster specific

tomkerkhove changed the title ~~Azure Portal Experience - Scaling cluster failed due to core quota but cluster entered "failed state"~~ Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota Jul 19, 2018

weinong self-assigned this Jul 22, 2018

jnoller added the known-issue label Jan 16, 2019

jnoller closed this as completed Apr 2, 2019

jnoller removed the known-issue label Apr 2, 2019

ghost locked as resolved and limited conversation to collaborators Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542

Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542

tomkerkhove commented Jul 19, 2018

weinong commented Jul 22, 2018

endforward commented Sep 15, 2018

rocketraman commented Sep 17, 2018 •

edited

Loading

Karishma-Tiwari-MSFT commented Jan 16, 2019

jnoller commented Jan 16, 2019

Karishma-Tiwari-MSFT commented Jan 16, 2019

BertusVanZyl commented Jan 17, 2019

frohikey commented Feb 24, 2019 •

edited

Loading

nodeselector commented Mar 25, 2019

jnoller commented Apr 2, 2019

mhtsharma948 commented May 14, 2019

jnoller commented May 14, 2019

Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542

Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542

Comments

tomkerkhove commented Jul 19, 2018

weinong commented Jul 22, 2018

endforward commented Sep 15, 2018

rocketraman commented Sep 17, 2018 • edited Loading

Karishma-Tiwari-MSFT commented Jan 16, 2019

jnoller commented Jan 16, 2019

Karishma-Tiwari-MSFT commented Jan 16, 2019

BertusVanZyl commented Jan 17, 2019

frohikey commented Feb 24, 2019 • edited Loading

nodeselector commented Mar 25, 2019

jnoller commented Apr 2, 2019

mhtsharma948 commented May 14, 2019

jnoller commented May 14, 2019

rocketraman commented Sep 17, 2018 •

edited

Loading

frohikey commented Feb 24, 2019 •

edited

Loading