Upgrade process for kubernets is not working (acs-engine not ready for production workload) #2567
Comments
Is it possible to run manually cloud-init to recreate what is missing? |
etcd logs:
kubelet logs:
|
so I change any reference to "2.5.2" to "2.3.7" in my |
Hi @jalberto thanks for your courage. What's the first thing we can look at? |
@jackfrancis thanks for your time IMHO:
|
#1 and #3 sound like longer-term improvements to the upgrade implementation, so thanks for your patience while those are not in place (and the burden remains on the user to reconcile cluster state). Let's figure out why the |
when running the upgrade command in a working cluster with etcd 2.3.7 new acs-engine crates the file
clearly the curl command faills to fetch that version, but the script continues running to end so acs-engine it was successful. @jackfrancis is there a way to run a commadn in master to recreate the initial provisioning commands? (previous remove of Why? becasue if manually I go into each amster, and change that value, and re-run |
BTW point 1 can be achived just exiting of any script if there is any error, at least that will stop the process |
There is no way to recreate an original provisioning, no. The only semi-notion of cluster "state" (quotations intentional) lives in the api model on the client side, which as you've discovered is only a fractional representation of the actual cluster, especially w/ respect to the original api model representation vs a newer version of acs-engine. The
|
@jackfrancis this can be related: kubernetes/kubernetes#54918 |
@jackfrancis I understand it should come from there, but before I run upgrade, I edited my an suggestion? my cluster is down for so many hours now |
@jackfrancis more context:
|
All of this suggests an api model that is not able to be easily reconcilable with current versions of acs-engine. (Of course this is not idea, just a reflection of the current limitations of what acs-engine does reliably.) Are you able to build a new cluster and install your workloads on it? |
@jackfrancis only if I am able to move the data from the PVs to the new cluster |
@jackfrancis this is my
|
@jackfrancis I generated a new
So seems not a problem related to my cluster configuration |
more logs (cloud init output)
|
@jackfrancis I achieved to upgrade to 1.9.6 with correct etc 2.3.7 but still not working, as no cni config found and the errors in previous comment.. |
I ran an ad hoc test against your api model (acs-engine:v0.14.5) and couldn't get a working cluster:
|
About manually setting up calico. make sure make sure Calico does not currently work w/ CNI, so your kubelet runtime config should be |
ar least you have connectivity :) must I change calico for azure?
the volume flags are present, but host |
If you're using Calico for k8s networkPolicy you have to use kubenet for IPAM. So change to |
@jackfrancis already changed, and rebooted the VM, nothing new. in other hand, I cannot create a new cluster becasue there is a lack of VMs in westeu, already tried 3 sizes |
@jackfrancis I mind to remove calico if it solves the issue |
Here is the provision script we run, for reference, if you want to try replaying things manually: https://github.com/Azure/acs-engine/blob/master/parts/k8s/kubernetesmastercustomscript.sh (also in The |
returns nothing, and thst just the beginning of the script I really am wondering why this script is not just exiting if any required value is not present instead of just blindly run commands I suspect there is other script to setup all this empty vars before this script runs. Any clue? |
every key in |
I opened a critical case in azure, I just cannot continue havin ga down cluster with no solution, thanks for you help @jackfrancis I am really upset with acs-engine i hope the same code is not used in aks, or it will be scary to use |
acs-engine comes with no operational guarantees. This is the contract users make w/ this tool. I understand why your experience is upsetting, we take this as valuable feedback that our documentation is not adequately expressing the limits of what will and may not work w/ respect to cluster lifecycle management driven by aca-engine only. It is precisely these kinds of limitations in acs-engine which inform AKS as a service value proposition above and beyond; and inform the additional orchestration implementations that separate AKS from acs-engine. AKS uses a subset of acs-engine as an SDK-like library dependency, and not the entire codebase as-is, which is the workflow you’re currently suffering through. To repeat and emphasize for clarity and transparency (not because it is what you want to hear): acs-engine is a collaborative, open source project to facilitate rapid development of Kubernetes (and other container cluster kit) on Azure. The only support model is the PR process, either through the issue process, advocacy, and project maintainer response in the form of PR + release, or the submission and acceptance of your own PR. By design, this model produces changes in code (and potentially against your existing clusters, or onto novel clusters) over the course of days/weeks/months; but not in an appropriate amount of time to facilitate a production operation response. Once Azure identifies this as a customer-built and maintained cluster, they will deprioritize the issue you’ve opened. Again, sorry that there is no good news here w/ respect to your current situation, I hope that this transparency is helpful, especially in the long term as it pertains to whether or not acs-engine is an appropriate tool for your Kubernetes cluster management toolkit. |
@jackfrancis I understand what you say, and I understand the risk of using ACS-engine, that said, what I don't understand is how a MS-Azure product/project (because it is under MS umbrella) is in this shape. At same time acs-engien declares this is a "non supported product" says "this is community driven and we listen to your feedback an implement it" if you check my issues open here, you will see I spent considerable amount of time giving feedback adn tangible solutions, but nothing gets implemented nor fixed. What i learnt today is: acs-engine is not supported but is not either community driven as it follows a roadmap not decide by the community. so it's a "community project driven by azure product interest" and that never works. Please don't misunderstand my upset with how MS/azure/acs deals with all this problems with a lack of appreciation for the work of the team, but acs team needs more clear direction and isolation of community driven vs product driven. Final feedback put in big/bold/shiny in 1st line of README "Don't use this for production workloads under any circumstance" |
These are valid criticisms and reflect immature aspects of this project:
This project started out as a "let's see what happens when we open source the Azure ARM template conveniences to the OSS community on Azure that is interested in prototyping container orchestrator clusters"; i.e., the intent of the open source aspects of the project was intrinsically experimental, rather than a purposeful project with a specific Microsoft-desired outcome. The intent of doing this in the open was to empower folks who were impatient with the maturation process of SLA-backed Azure service offerings (e.g., AKS, which is not yet GA) but whose business goals aligned with this particular tech stack category (e.g., Kubernetes, docker). Arguably, we can do more to engage community contributions, and improve the above criticisms. We take that feedback seriously. Consider, though, that the primary objective of this project is to enable folks to iterate and build upon each others' ideas and work to produce novel cluster deployments in Azure. To that end this project continues to add value, with the risk associated with all the above-mentioned caveats. I would accept the feedback that more disclaimer material would be valuable to warn folks about the support model, but I would push back on your representation of what's dangerous. It's not "production workloads" that acs-engine is operating against: for that the Kubernetes API and the way it is configured that matter. The hard work is rationalizing the Azure API w/ Kubernetes-supporting IaaS + Kubernetes runtime config. The intention of acs-engine is to provide tooling for the user to achieve the outcome of a working Kubernetes cluster on Azure. Once that outcome has been achieved, whether or not production workloads should be scheduled to a particular cluster is really determinant upon the viability of that cluster's configuration as compared to the requirements of the workloads that may land there, including the configuration of the IaaS config underneath. This is not an acs-engine problem: acs-engine merely aims to make the process of defining, declaring, and applying these IaaS + k8s configurations onto Azure. I would agree, however, that upgrade + scale functionality in acs-engine in its current state is not an acceptable cluster lifecycle management dependency for a production cluster. Whether or not its limitations are more or less reliable than a hand-rolled cluster lifecycle toolkit is up to the discretion of each user. That reality can be better documented, and we will do so. Thanks for your continued feedback! |
Thanks for your time and sincerity @jackfrancis I totally understand the complexity of the project but I also expect high quality outcome from a MS driven project. I think main issue is in this words: "The intention of acs-engine is to provide tooling for the user to achieve the outcome of a working Kubernetes cluster on Azure." The meaning of "working Kubernetes cluster on Azure" can mean different things for each one, for me it includes maintenance, not necessarily "major upgrade" support thou, but at least proper troubleshooting options and basic maintenance tasks to ensure a "working Kubernetes cluster on Azure" through time, and not only "once" Maybe listing which features are stable and which not will help, so at least expectations are managed. Thanks for your time |
@jackfrancis this is another example: #1961 The change is justified, but is a breaking change, not properly documented and without upgrade path attached. This change make me waste 3 days trying to figure out why gitlab-runner is not working anymore |
@jackfrancis this table is an accurate visualisation of my frustration with acs-engine: So the only way to "fix" a 1.8.4 cluster is to upgrade at least to 1.8.5 so azure-file works as expected (tha this what I was trying to fix with this upgrade) but if you jump "too much" to 1.9.0 it breaks again! Notice how I need to go to official MS-azure docs to find information about community-driven acs-engine |
Hi @jalberto, for the azure file fileMode and dirMode issue, there is design change back and forth, I would suggest use azure file mountOptions to set what you want: https://github.com/andyzhangx/demo/blob/master/linux/azurefile/azurefile-mountoptions.md |
@andyzhangx agree, the problem is you need at least k8s 1.8.5 to use |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead. |
Is this a request for help?:
YES & a BUG REPORT
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
0.14.5
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
kubernetes 1.8.4 to 1.8.10
What happened:
run `upgrade command following #2062 have lots of troubles:
What you expected to happen:
to work
How to reproduce it (as minimally and precisely as possible):
just try to upgrade an existing cluster
Anything else we need to know:
this is really critical, as my prod cluster is down right now
The text was updated successfully, but these errors were encountered: