RFC design docs for Cluster Federation/Ubernetes. #19313

ghost · 2016-01-06T02:04:10Z

A first round of comments and feedback was solicited from a subset of the Federation SIG via Google Docs. Moving this here now for wider review.

Still needs some improved aesthetics (weave the diagrams in, and improve some formatting) but that can be done while addressing other review feedback.

@justinsb @mikedanese @davidopp @lavalamp @thockin @bprashanth

k8s-github-robot · 2016-01-06T02:09:18Z

Labelling this PR as size/XL

k8s-bot · 2016-01-06T02:50:26Z

GCE e2e test build/test passed for commit a50d469ddd4b9f81ad0e487778c2643423da71ea.

ghost · 2016-01-06T23:15:21Z

Wove the diagrams in, and fixed the worst of the formatting errors.

k8s-bot · 2016-01-06T23:38:47Z

GCE e2e build/test failed for commit fc055d6a93de7831405d37c2831cea93e7742f66.

k8s-github-robot · 2016-01-07T00:10:17Z

Labelling this PR as size/XXL

k8s-bot · 2016-01-07T00:37:28Z

GCE e2e test build/test passed for commit 9faf858fc4d3a04d258a4192d1334e6750677a89.

k8s-bot · 2016-01-08T01:19:34Z

GCE e2e test build/test passed for commit 8223c8cc483a6c030bb88178ad8d67cbcab13aee.

lavalamp · 2016-01-08T17:43:38Z

docs/design/federated-services.md

+1. **Internal discovery and connection**: Pods/containers (running in a Kubernetes cluster) must be able to easily discover and connect to endpoints for Kubernetes services on which they depend in a consistent way, irrespective of whether those services exist in a different kubernetes cluster within the same cluster federation.  Hence-forth referred to as "cluster-internal clients", or simply "internal clients".
+1. **External discovery and connection**: External clients (running outside a Kubernetes cluster) must be able to discover and connect to endpoints for Kubernetes services on which they depend.  
+   1. **External clients predominantly speak HTTP(S)**: External clients are most often, but not always, web browsers, or at least speak HTTP(S) (TBD list notable exceptions.  DNS servers?  SIP servers?  Database Servers?)
+1. **Find the "best" endpoint:** Upon initial discovery and connection, both internal and external clients should ideally find "the best" endpoint if multiple eligible endpoints exist.  "Best" in this context implies the closest (by network topology) endpoint that is both operational (as defined by some positive health check) and not overloaded (by some published load metric).  For example:


Is this an MVP feature or just nice-to-have?

@lavalamp By "this" are you referring to "not overloaded" specifically, or "best endpoint" generally. The former we can definitely live without, although in practise it is extremely useful in reducing human operational toil and making capacity planning way, way simpler. So I think we'd definitely want to add it over time. The latter ("best endpoint") we can also technically do without, although it has pretty major latency (and hence performance) and reliability implications, so it would be a higher priority than "not overloaded".

lavalamp · 2016-01-08T18:03:13Z

Sorry, I made some comments on an outdated revision because I didn't hit refresh, but I think they are all still valid comments.

Overall, I like this a lot more than I expected to!

davidopp · 2016-01-08T19:53:16Z

docs/design/control-plane-resilience.md

+   above 3 are the primary canonical use cases.
+1. **Resilience to some correlated failures:** Kubernetes clusters
+   which span multiple availability zones in a region should by
+   default be resilient to complete failure of one entire availability


why "one" instead of N-1? (i concede that even if you meant N-1, "one" is still a true statement. but if you meant N-1, it's better to say N-1.) for example, if I have a cluster that spans 3 zones, why should it behave any differently if 2 zone fails than if 1 zone fails?

@davidopp Because of quora. A 3-zone cluster where 2 zones are failed, does not have a majority quorum remaining, while with only 1 failed zone it does. Note that "N-1" usually refers to the number of zones that remain (i.e. if we have N zones, and one fails, we have N-1 zones remaining). If what you're actually asking is whether we want to be resilient to two simultaneous availability zone failures (i.e. "N-2"), the answer is that we could be, provided that N>=5 (i.e. we can always be resilient to ceil(N/2)-1 failures.

mfanjie · 2016-03-04T02:19:48Z

docs/design/federation-phase-1.md

+    1. If a RC specifies _N_ acceptable clusters in the
+       clusterSelector, all replica will be evenly distributed among
+       these clusters.
+


@quinton-hoole just before the merge, I want to raise one query.
When we talk about Ubernetes, it means multiple cluster federation. Ideally we want a single cluster to span all the available machines in a single AZ. But in our data center, we have over 2000 hypervisors in single AZ, considering the scalability of current Kubernetes, we would need to build multiple clusters in single AZ.
Following scheduling algorithm described above, there will be circumstance that if two replicas are defined in single pod, and user does not specify any cluster, then the two replicas will be replaced to two least loaded cluster. If the two least loaded cluster is in single AZ, then even the replicas of the pod were split to multiple clusters, they are still in single AZ, it is not fault-tolerance in AZ level.
I raised this comment to discuss the approaches that,

Will Ubernetes provide a best practice for deployment mode, like one cluster in single AZ, so we can treat K8S and AZ equally. If this choose this approach how we can support the large scale in single AZ?

If we suggest multiple clusters in single AZ, then in Ubernetes Scheduler, we would might need to consider split the replica into different AZ not only different cluster.

@kevin-wangzefeng @alfred-huangjian

@mfanjie That's a great question, and some of us have recently been giving exactly that some additional thought. Here is a summary of my opinion on the matter (and @davidopp @wojtek-t purely FYI regarding scheduling and scalability respectively):

Our scaling goals for Kubernetes v1.3 (targeted around 07/2016) are 2000-5000 nodes per cluster, and Google will be investing considerable engineering effort in getting there in the next few months. Supporting that scale is explicitly within the top three priority items for v1.3 on our product roadmap. So your scaling goal of 2,000 nodes in a single cluster and AZ should be achievable without federation by mid 2016 anyway.

Notwithstanding the above, one of the valid use cases of Cluster Federation/Ubernetes is to accommodate very large availability zones (tens of thousands of nodes), not because of any particular scaling limitations of Kubernetes, but rather because, based on our experience at Google, that's the right way to build large, highly available applications. Concretely, a cluster upgrade gone wrong, or a cluster control plane malfunction, when the cluster size is enormous, can be very, very bad. The solution that we support to address this challenge is building, managing and federating multiple clusters. So we need to address your questions anyway.

Phase 1 of this project, as outlined in this design doc, is intended to be an absolutely minimum viable product. So please don't see this as the long-term, complete plan for cluster federation. It is explicitly a first implementation step.

There are a few possible solutions to the specific limitation that you mention, at least one of which we plan to implement soon after phase 1. One solution (as you suggest) would be for the cross-cluster scheduler to explicitly spread across zones as well as clusters (in the same way that for multi-zone "ubernetes lite" clusters, our scheduler spreads across nodes as well as zones, making sensible choices between putting too many replicas on one node vs putting too many replicas in one zone, when forced to choose between the two). I think that this is fairly easy, and a good solution. Another solution (which is already supported in the above design) would be to specify an appropriate NodeSelector to restrict placement to clusters in non-overlapping zones (although this has the downside of artificially introducing hard placement constraints, which has the potential to strand resources).

I will soon be putting together a detailed design document for review, which will cover a variety of pod placement and movement/rescheduling scenarios and designs, where we can figure out all the details, and the best practises that you mention. I will be sure to keep you in the loop there.

Quinton, thanks for your explanation. That makes sense to me.
Our k8s cluster is not built on bare metal but on VM, we may deploy 3 or more VMs on single hypervisor as k8s minions. So There will be up to 6000+ minions in single AZ.
I agree it is right approach to build multiple clusters in single AZ, in addition, from my perspective, federation should have the ability to split the sub RC and replica to different AZ, as when someone defines an ubernetes application, as an end user, he may have no info about underlying infrastructure info like AZ, and he may not care about which AZ to deploy, high availability is the only need. So Ubernetes scheduler should consider this and make the right decision.
Looking forward your detailed design doc, thank you!

Thanks for cc-ing me @quinton-hoole - the above explanation SGTM

@mfanjie Agreed - user should only specify HA as a high level requirement, not be required to specify which zones to deploy into.

One other thing - why do you run multiple smaller VM's on a single hypervisor/host, rather than a single large VM on that host? We generally favor the latter as the most scalable way to achieve clusters with large amounts of computing resources, i.e for large clusters, it's better to run smaller numbers of larger nodes, than larger numbers of smaller nodes.

ghost · 2016-03-11T22:55:37Z

@davidopp Can haz LGTM? I think that all comments are adequately addressed. If any more feedback comes in, I think that it can be dealt with in follow-up PR's.

davidopp · 2016-03-14T05:35:21Z

docs/design/federation-phase-1.md

+(consol, zookeeper) will probably be developedand supported over
+time.
+
+## Ubernetes Scheduler


This section could be made significantly clearer by saying something like "Whereas the Kubernetes scheduler schedules pending pods to nodes, the Ubernetes scheduler creates per-cluster API objects corresponding to Ubernetes control plane API objects."

k8s-github-robot · 2016-03-15T04:29:16Z

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

davidopp · 2016-03-15T04:40:40Z

LGTM

I'd like to go back and re-read more carefully some time soon, but this is certainly fine for now.

k8s-bot · 2016-03-15T05:03:24Z

GCE e2e build/test passed for commit 227710d.

k8s-github-robot · 2016-03-15T05:25:28Z

Automatic merge from submit-queue

Auto commit by PR queue bot

wulonghui · 2016-03-31T08:30:30Z

@quinton-hoole
This is a great thing.But I got a question:

Pods are able to discover and connect to services hosted in other clusters (in cases where inter-cluster networking is necessary, desirable and implemented).

Is this mean that the Service's ClusterIP is global across the clusters in a ubernetes?
If so, kube-proxy should listen to all the clusters?

ghost · 2016-03-31T13:58:41Z

No. See docs/design/federated-services.md
On Mar 31, 2016 1:31 AM, "Tim Wu" notifications@github.com wrote:

@quinton-hoole https://github.com/quinton-hoole
This is a great thing.But I got a question:

Pods are able to discover and connect to services hosted in other clusters
(in cases where inter-cluster networking is necessary, desirable and
implemented).

Is this mean that the Service's ClusterIP is global across the clusters in
a ubernetes?
If so, kube-proxy should listen to all the clusters?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#19313 (comment)

@nikhiljindal

Automatic merge from submit-queue duplicate kube-apiserver to federated-apiserver duplicate the kube-apiserver source code to ube-apiserver and update references cluster specific api objects will be in separate PRs #19313, #21190 #21190 (comment) @nikhiljindal

dalanlan · 2016-04-19T09:18:36Z

@quinton-hoole
This is really impressive, yet I cannot really find the exact cloudbursting (aka. cross-cluster thing) idea as specified in federation design doc. Is it deprecated or did I miss anything?

yogeshmsharma · 2016-04-21T08:26:04Z

Are we expecting first cut of this till Jul 2016 or it will go beyond it ..

@nikhiljindal

Automatic merge from submit-queue Federation apiobject cluster add federation api group add cluster api object and registry ~~generate cluster client~~ moved to #24117 update scripts to generate files for /federation #19313 #23653 #23554 @nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom

@nikhiljindal

Automatic merge from submit-queue Move install of version handler to genericapiserver This is to satisfy kbuectl verification Please review only the last commit. #19313 #23653 @nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom

@nikhiljindal

…nsions-replicaset Automatic merge from submit-queue Add extensions/replicaset to federation-apiserver Add extensions/replicaset for federated scheduler (#24038) as all k8s api objects were removed in #23959 Please review only the very last one commit. #19313 #23653 @nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom

…-design-docs Auto commit by PR queue bot

ghost added priority/backlog Higher priority than priority/awaiting-more-evidence. team/control-plane labels Jan 6, 2016

googlebot added the cla: yes label Jan 6, 2016

k8s-github-robot assigned thockin Jan 6, 2016

k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 6, 2016

mikedanese assigned davidopp and unassigned thockin Jan 6, 2016

k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 7, 2016

lavalamp reviewed Jan 8, 2016
View reviewed changes

davidopp reviewed Jan 8, 2016
View reviewed changes

mfanjie reviewed Mar 4, 2016
View reviewed changes

davidopp reviewed Mar 14, 2016
View reviewed changes

davidopp added e2e-not-required lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 15, 2016

k8s-github-robot pushed a commit that referenced this pull request Mar 15, 2016

Merge pull request #19313 from quinton-hoole/2015-01-05-ube-design-docs

536a30f

Auto commit by PR queue bot

k8s-github-robot merged commit 536a30f into kubernetes:master Mar 15, 2016

jianhuiz mentioned this pull request Mar 25, 2016

duplicate kube-apiserver to federated-apiserver #23509

Merged

This was referenced Apr 4, 2016

Federation apiobject cluster #23847

Merged

Federation apiobject replicaset #23998

Closed

This was referenced Apr 25, 2016

Add extensions/replicaset to federation-apiserver #24764

Merged

Move install of version handler to genericapiserver #24774

Merged

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016

Merge pull request kubernetes#19313 from quinton-hoole/2015-01-05-ube…

291e9bc

…-design-docs Auto commit by PR queue bot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC design docs for Cluster Federation/Ubernetes. #19313

RFC design docs for Cluster Federation/Ubernetes. #19313

ghost commented Jan 6, 2016

k8s-github-robot commented Jan 6, 2016

k8s-bot commented Jan 6, 2016

ghost commented Jan 6, 2016

k8s-bot commented Jan 6, 2016

k8s-github-robot commented Jan 7, 2016

k8s-bot commented Jan 7, 2016

k8s-bot commented Jan 8, 2016

lavalamp Jan 8, 2016

ghost Jan 8, 2016

lavalamp commented Jan 8, 2016

davidopp Jan 8, 2016

ghost Jan 12, 2016

mfanjie Mar 4, 2016

ghost Mar 4, 2016

mfanjie Mar 4, 2016

wojtek-t Mar 7, 2016

ghost Mar 11, 2016

ghost commented Mar 11, 2016

davidopp Mar 14, 2016

k8s-github-robot commented Mar 15, 2016

davidopp commented Mar 15, 2016

k8s-bot commented Mar 15, 2016

k8s-github-robot commented Mar 15, 2016

wulonghui commented Mar 31, 2016

ghost commented Mar 31, 2016

dalanlan commented Apr 19, 2016

yogeshmsharma commented Apr 21, 2016

RFC design docs for Cluster Federation/Ubernetes. #19313

RFC design docs for Cluster Federation/Ubernetes. #19313

Conversation

ghost commented Jan 6, 2016

k8s-github-robot commented Jan 6, 2016

k8s-bot commented Jan 6, 2016

ghost commented Jan 6, 2016

k8s-bot commented Jan 6, 2016

k8s-github-robot commented Jan 7, 2016

k8s-bot commented Jan 7, 2016

k8s-bot commented Jan 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Jan 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Mar 11, 2016

Choose a reason for hiding this comment

k8s-github-robot commented Mar 15, 2016

davidopp commented Mar 15, 2016

k8s-bot commented Mar 15, 2016

k8s-github-robot commented Mar 15, 2016

wulonghui commented Mar 31, 2016

ghost commented Mar 31, 2016

dalanlan commented Apr 19, 2016

yogeshmsharma commented Apr 21, 2016