New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC design docs for Cluster Federation/Ubernetes. #19313
Conversation
Labelling this PR as size/XL |
GCE e2e test build/test passed for commit a50d469ddd4b9f81ad0e487778c2643423da71ea. |
Wove the diagrams in, and fixed the worst of the formatting errors. |
GCE e2e build/test failed for commit fc055d6a93de7831405d37c2831cea93e7742f66. |
Labelling this PR as size/XXL |
GCE e2e test build/test passed for commit 9faf858fc4d3a04d258a4192d1334e6750677a89. |
GCE e2e test build/test passed for commit 8223c8cc483a6c030bb88178ad8d67cbcab13aee. |
1. **Internal discovery and connection**: Pods/containers (running in a Kubernetes cluster) must be able to easily discover and connect to endpoints for Kubernetes services on which they depend in a consistent way, irrespective of whether those services exist in a different kubernetes cluster within the same cluster federation. Hence-forth referred to as "cluster-internal clients", or simply "internal clients". | ||
1. **External discovery and connection**: External clients (running outside a Kubernetes cluster) must be able to discover and connect to endpoints for Kubernetes services on which they depend. | ||
1. **External clients predominantly speak HTTP(S)**: External clients are most often, but not always, web browsers, or at least speak HTTP(S) (TBD list notable exceptions. DNS servers? SIP servers? Database Servers?) | ||
1. **Find the "best" endpoint:** Upon initial discovery and connection, both internal and external clients should ideally find "the best" endpoint if multiple eligible endpoints exist. "Best" in this context implies the closest (by network topology) endpoint that is both operational (as defined by some positive health check) and not overloaded (by some published load metric). For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an MVP feature or just nice-to-have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp By "this" are you referring to "not overloaded" specifically, or "best endpoint" generally. The former we can definitely live without, although in practise it is extremely useful in reducing human operational toil and making capacity planning way, way simpler. So I think we'd definitely want to add it over time. The latter ("best endpoint") we can also technically do without, although it has pretty major latency (and hence performance) and reliability implications, so it would be a higher priority than "not overloaded".
Sorry, I made some comments on an outdated revision because I didn't hit refresh, but I think they are all still valid comments. Overall, I like this a lot more than I expected to! |
above 3 are the primary canonical use cases. | ||
1. **Resilience to some correlated failures:** Kubernetes clusters | ||
which span multiple availability zones in a region should by | ||
default be resilient to complete failure of one entire availability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "one" instead of N-1? (i concede that even if you meant N-1, "one" is still a true statement. but if you meant N-1, it's better to say N-1.) for example, if I have a cluster that spans 3 zones, why should it behave any differently if 2 zone fails than if 1 zone fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp Because of quora. A 3-zone cluster where 2 zones are failed, does not have a majority quorum remaining, while with only 1 failed zone it does. Note that "N-1" usually refers to the number of zones that remain (i.e. if we have N zones, and one fails, we have N-1 zones remaining). If what you're actually asking is whether we want to be resilient to two simultaneous availability zone failures (i.e. "N-2"), the answer is that we could be, provided that N>=5 (i.e. we can always be resilient to ceil(N/2)-1 failures.
1. If a RC specifies _N_ acceptable clusters in the | ||
clusterSelector, all replica will be evenly distributed among | ||
these clusters. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinton-hoole just before the merge, I want to raise one query.
When we talk about Ubernetes, it means multiple cluster federation. Ideally we want a single cluster to span all the available machines in a single AZ. But in our data center, we have over 2000 hypervisors in single AZ, considering the scalability of current Kubernetes, we would need to build multiple clusters in single AZ.
Following scheduling algorithm described above, there will be circumstance that if two replicas are defined in single pod, and user does not specify any cluster, then the two replicas will be replaced to two least loaded cluster. If the two least loaded cluster is in single AZ, then even the replicas of the pod were split to multiple clusters, they are still in single AZ, it is not fault-tolerance in AZ level.
I raised this comment to discuss the approaches that,
- Will Ubernetes provide a best practice for deployment mode, like one cluster in single AZ, so we can treat K8S and AZ equally. If this choose this approach how we can support the large scale in single AZ?
- If we suggest multiple clusters in single AZ, then in Ubernetes Scheduler, we would might need to consider split the replica into different AZ not only different cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfanjie That's a great question, and some of us have recently been giving exactly that some additional thought. Here is a summary of my opinion on the matter (and @davidopp @wojtek-t purely FYI regarding scheduling and scalability respectively):
- Our scaling goals for Kubernetes v1.3 (targeted around 07/2016) are 2000-5000 nodes per cluster, and Google will be investing considerable engineering effort in getting there in the next few months. Supporting that scale is explicitly within the top three priority items for v1.3 on our product roadmap. So your scaling goal of 2,000 nodes in a single cluster and AZ should be achievable without federation by mid 2016 anyway.
- Notwithstanding the above, one of the valid use cases of Cluster Federation/Ubernetes is to accommodate very large availability zones (tens of thousands of nodes), not because of any particular scaling limitations of Kubernetes, but rather because, based on our experience at Google, that's the right way to build large, highly available applications. Concretely, a cluster upgrade gone wrong, or a cluster control plane malfunction, when the cluster size is enormous, can be very, very bad. The solution that we support to address this challenge is building, managing and federating multiple clusters. So we need to address your questions anyway.
- Phase 1 of this project, as outlined in this design doc, is intended to be an absolutely minimum viable product. So please don't see this as the long-term, complete plan for cluster federation. It is explicitly a first implementation step.
- There are a few possible solutions to the specific limitation that you mention, at least one of which we plan to implement soon after phase 1. One solution (as you suggest) would be for the cross-cluster scheduler to explicitly spread across zones as well as clusters (in the same way that for multi-zone "ubernetes lite" clusters, our scheduler spreads across nodes as well as zones, making sensible choices between putting too many replicas on one node vs putting too many replicas in one zone, when forced to choose between the two). I think that this is fairly easy, and a good solution. Another solution (which is already supported in the above design) would be to specify an appropriate NodeSelector to restrict placement to clusters in non-overlapping zones (although this has the downside of artificially introducing hard placement constraints, which has the potential to strand resources).
I will soon be putting together a detailed design document for review, which will cover a variety of pod placement and movement/rescheduling scenarios and designs, where we can figure out all the details, and the best practises that you mention. I will be sure to keep you in the loop there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quinton, thanks for your explanation. That makes sense to me.
Our k8s cluster is not built on bare metal but on VM, we may deploy 3 or more VMs on single hypervisor as k8s minions. So There will be up to 6000+ minions in single AZ.
I agree it is right approach to build multiple clusters in single AZ, in addition, from my perspective, federation should have the ability to split the sub RC and replica to different AZ, as when someone defines an ubernetes application, as an end user, he may have no info about underlying infrastructure info like AZ, and he may not care about which AZ to deploy, high availability is the only need. So Ubernetes scheduler should consider this and make the right decision.
Looking forward your detailed design doc, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cc-ing me @quinton-hoole - the above explanation SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfanjie Agreed - user should only specify HA as a high level requirement, not be required to specify which zones to deploy into.
One other thing - why do you run multiple smaller VM's on a single hypervisor/host, rather than a single large VM on that host? We generally favor the latter as the most scalable way to achieve clusters with large amounts of computing resources, i.e for large clusters, it's better to run smaller numbers of larger nodes, than larger numbers of smaller nodes.
@davidopp Can haz LGTM? I think that all comments are adequately addressed. If any more feedback comes in, I think that it can be dealt with in follow-up PR's. |
(consol, zookeeper) will probably be developedand supported over | ||
time. | ||
|
||
## Ubernetes Scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section could be made significantly clearer by saying something like "Whereas the Kubernetes scheduler schedules pending pods to nodes, the Ubernetes scheduler creates per-cluster API objects corresponding to Ubernetes control plane API objects."
@k8s-bot test this Tests are more than 48 hours old. Re-running tests. |
LGTM I'd like to go back and re-read more carefully some time soon, but this is certainly fine for now. |
GCE e2e build/test passed for commit 227710d. |
Automatic merge from submit-queue |
Auto commit by PR queue bot
@quinton-hoole
Is this mean that the Service's ClusterIP is global across the clusters in a ubernetes? |
No. See docs/design/federated-services.md
|
Automatic merge from submit-queue duplicate kube-apiserver to federated-apiserver duplicate the kube-apiserver source code to ube-apiserver and update references cluster specific api objects will be in separate PRs #19313, #21190 #21190 (comment) @nikhiljindal
@quinton-hoole |
Are we expecting first cut of this till Jul 2016 or it will go beyond it .. |
Automatic merge from submit-queue Federation apiobject cluster add federation api group add cluster api object and registry ~~generate cluster client~~ moved to #24117 update scripts to generate files for /federation #19313 #23653 #23554 @nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom
Automatic merge from submit-queue Move install of version handler to genericapiserver This is to satisfy kbuectl verification Please review only the last commit. #19313 #23653 @nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom
…nsions-replicaset Automatic merge from submit-queue Add extensions/replicaset to federation-apiserver Add extensions/replicaset for federated scheduler (#24038) as all k8s api objects were removed in #23959 Please review only the very last one commit. #19313 #23653 @nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom
…-design-docs Auto commit by PR queue bot
A first round of comments and feedback was solicited from a subset of the Federation SIG via Google Docs. Moving this here now for wider review.
Still needs some improved aesthetics (weave the diagrams in, and improve some formatting) but that can be done while addressing other review feedback.
@justinsb @mikedanese @davidopp @lavalamp @thockin @bprashanth