-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Portal Experience - Scaling cluster failed & cluster entered "failed state" due to core quota #542
Comments
@tomkerkhove portal should have got the same error message as cli does. interesting. Can you send details like subscriptionID, resource group, and aks name to aks-help@service.microsoft.com for me to take a look? |
I encountered the same issue after trying to upgrade the kubernetes version (I guess the upgrade happens by provisioning new nodes with the newer version). The portal had the same error message "Container service in failed state contact support". When I contacted support they informed me I was maxing out our core quota. The cluster was setup with the old D2 series so they also wouldn't allow an increase in quota and thus I had to destroy and redeploy the entire cluster. So some sort of pre-condition check in the portal as well as a more informative error message would greatly improve the UX. I have already deleted the cluster and redeployed with D2v3. @weinong Let me know if it is still helpfull for you to receive the details of the old cluster. |
Like @endforward , I also encountered the same issue during an upgrade. Three things:
|
@weinong Is there an update on this issue? Customers are still seeing issues with upgrading the cluster resulting into "The container service is in failed state". On Azure portal, there is no other detail provided on what that error mean. Can you please share what is the solution in this case? Do we have to create a new cluster and deploy? |
@Tiwari-MSFT I've labeled this as a known issue (e.g. the Azure portal does not make the error explicit). The ultimate fix is to ensure that any operation does a pre-flight check to ensure capacity for the subscription will allow the action. Additionally, errors resulting from quota errors need to shown with the same specificity as the CLI. Currently, as noted - because the upgrade fails due to the underlying quota issue, it "locks" the cluster to the pending action. This means an upgrade/scale/etc operation will not function. I will leave this open, communicate it to support and file an AKS backlog item to track fixes. This issue will be closed when the fixes land. |
@jnoller Thanks for sharing this information. This is helpful to know that it is a known issue. I will communicate that to the customers. |
Updated from 1.11.4 to 1.11.5, and I ran into exactly this problem. This was reported in July 2018. This is not a minor issue, because I was told by support that to fix this I have to recreate the service. Why is this still not fixed? At the very least, put a warning message on the portal where you do the upgrade stating "some error messages are not reported yet, we suggest using the CLI to do upgrades". Or just disable it until it actually works properly. |
Exactly the same thing happen to me today when upgrading from 1.12.4 to 1.12.5 in the portal with infamous error: Failed to save container service 'xxxxxx'. Error: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidTemplateDeployment" Message="The template deployment is not valid according to the validation procedure. ... Operation results in exceeding quota limits of Core. AKS remained in the failed state. No possibility to upgrade/scale/whatever. Out of curiosity I decided to try out CLI (after support for increased quota has been completed). Failed state. See: available upgrade version is the same as reported.
I tried to upgrade
And I ended up with a working cluster and the failed state has been resolved.
|
@jnoller I think this one was taken care of with this: https://github.com/Azure/AKS/blob/master/CHANGELOG.md#release-2019-03-07 Should this be closed? |
AKS now confirms all quota availability prior to allowing a given action to be executed. Closing as Done. |
I'm getting the same error |
@mhtsharma948 Please open a support request - these failures are cluster specific |
What happened:
When trying to add a new node via the Azure Portal it failed and mentioned that the cluster went into a failed state. After contacting Azure Support it turns out that this was because we've reached our core quota for that VM family.
What you expected to happen:
The portal should provide equally information as what the
az aks scale
command does:This would make it easier to mitigate this instead of going through support as this is clearly indicated here:
![core-quota](https://user-images.githubusercontent.com/4345663/42926032-722f8088-8b30-11e8-99ac-aa3e16c811ca.jpg)
But you need to know that this was the issue.
How to reproduce it (as minimally and precisely as possible):
The text was updated successfully, but these errors were encountered: