Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC design docs for Cluster Federation/Ubernetes. #19313

Merged
merged 1 commit into from Mar 15, 2016
Merged

RFC design docs for Cluster Federation/Ubernetes. #19313

merged 1 commit into from Mar 15, 2016

Conversation

ghost
Copy link

@ghost ghost commented Jan 6, 2016

A first round of comments and feedback was solicited from a subset of the Federation SIG via Google Docs. Moving this here now for wider review.

Still needs some improved aesthetics (weave the diagrams in, and improve some formatting) but that can be done while addressing other review feedback.

@justinsb @mikedanese @davidopp @lavalamp @thockin @bprashanth

@ghost ghost added priority/backlog Higher priority than priority/awaiting-more-evidence. team/control-plane labels Jan 6, 2016
@k8s-github-robot
Copy link

Labelling this PR as size/XL

@k8s-github-robot k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 6, 2016
@mikedanese mikedanese assigned davidopp and unassigned thockin Jan 6, 2016
@k8s-bot
Copy link

k8s-bot commented Jan 6, 2016

GCE e2e test build/test passed for commit a50d469ddd4b9f81ad0e487778c2643423da71ea.

@ghost
Copy link
Author

ghost commented Jan 6, 2016

Wove the diagrams in, and fixed the worst of the formatting errors.

@k8s-bot
Copy link

k8s-bot commented Jan 6, 2016

GCE e2e build/test failed for commit fc055d6a93de7831405d37c2831cea93e7742f66.

@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 7, 2016
@k8s-github-robot
Copy link

Labelling this PR as size/XXL

@k8s-bot
Copy link

k8s-bot commented Jan 7, 2016

GCE e2e test build/test passed for commit 9faf858fc4d3a04d258a4192d1334e6750677a89.

@k8s-bot
Copy link

k8s-bot commented Jan 8, 2016

GCE e2e test build/test passed for commit 8223c8cc483a6c030bb88178ad8d67cbcab13aee.

1. **Internal discovery and connection**: Pods/containers (running in a Kubernetes cluster) must be able to easily discover and connect to endpoints for Kubernetes services on which they depend in a consistent way, irrespective of whether those services exist in a different kubernetes cluster within the same cluster federation. Hence-forth referred to as "cluster-internal clients", or simply "internal clients".
1. **External discovery and connection**: External clients (running outside a Kubernetes cluster) must be able to discover and connect to endpoints for Kubernetes services on which they depend.
1. **External clients predominantly speak HTTP(S)**: External clients are most often, but not always, web browsers, or at least speak HTTP(S) (TBD list notable exceptions. DNS servers? SIP servers? Database Servers?)
1. **Find the "best" endpoint:** Upon initial discovery and connection, both internal and external clients should ideally find "the best" endpoint if multiple eligible endpoints exist. "Best" in this context implies the closest (by network topology) endpoint that is both operational (as defined by some positive health check) and not overloaded (by some published load metric). For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an MVP feature or just nice-to-have?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp By "this" are you referring to "not overloaded" specifically, or "best endpoint" generally. The former we can definitely live without, although in practise it is extremely useful in reducing human operational toil and making capacity planning way, way simpler. So I think we'd definitely want to add it over time. The latter ("best endpoint") we can also technically do without, although it has pretty major latency (and hence performance) and reliability implications, so it would be a higher priority than "not overloaded".

@lavalamp
Copy link
Member

lavalamp commented Jan 8, 2016

Sorry, I made some comments on an outdated revision because I didn't hit refresh, but I think they are all still valid comments.

Overall, I like this a lot more than I expected to!

above 3 are the primary canonical use cases.
1. **Resilience to some correlated failures:** Kubernetes clusters
which span multiple availability zones in a region should by
default be resilient to complete failure of one entire availability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "one" instead of N-1? (i concede that even if you meant N-1, "one" is still a true statement. but if you meant N-1, it's better to say N-1.) for example, if I have a cluster that spans 3 zones, why should it behave any differently if 2 zone fails than if 1 zone fails?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp Because of quora. A 3-zone cluster where 2 zones are failed, does not have a majority quorum remaining, while with only 1 failed zone it does. Note that "N-1" usually refers to the number of zones that remain (i.e. if we have N zones, and one fails, we have N-1 zones remaining). If what you're actually asking is whether we want to be resilient to two simultaneous availability zone failures (i.e. "N-2"), the answer is that we could be, provided that N>=5 (i.e. we can always be resilient to ceil(N/2)-1 failures.

1. If a RC specifies _N_ acceptable clusters in the
clusterSelector, all replica will be evenly distributed among
these clusters.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinton-hoole just before the merge, I want to raise one query.
When we talk about Ubernetes, it means multiple cluster federation. Ideally we want a single cluster to span all the available machines in a single AZ. But in our data center, we have over 2000 hypervisors in single AZ, considering the scalability of current Kubernetes, we would need to build multiple clusters in single AZ.
Following scheduling algorithm described above, there will be circumstance that if two replicas are defined in single pod, and user does not specify any cluster, then the two replicas will be replaced to two least loaded cluster. If the two least loaded cluster is in single AZ, then even the replicas of the pod were split to multiple clusters, they are still in single AZ, it is not fault-tolerance in AZ level.
I raised this comment to discuss the approaches that,

  1. Will Ubernetes provide a best practice for deployment mode, like one cluster in single AZ, so we can treat K8S and AZ equally. If this choose this approach how we can support the large scale in single AZ?
  2. If we suggest multiple clusters in single AZ, then in Ubernetes Scheduler, we would might need to consider split the replica into different AZ not only different cluster.

@kevin-wangzefeng @alfred-huangjian

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfanjie That's a great question, and some of us have recently been giving exactly that some additional thought. Here is a summary of my opinion on the matter (and @davidopp @wojtek-t purely FYI regarding scheduling and scalability respectively):

  1. Our scaling goals for Kubernetes v1.3 (targeted around 07/2016) are 2000-5000 nodes per cluster, and Google will be investing considerable engineering effort in getting there in the next few months. Supporting that scale is explicitly within the top three priority items for v1.3 on our product roadmap. So your scaling goal of 2,000 nodes in a single cluster and AZ should be achievable without federation by mid 2016 anyway.
  2. Notwithstanding the above, one of the valid use cases of Cluster Federation/Ubernetes is to accommodate very large availability zones (tens of thousands of nodes), not because of any particular scaling limitations of Kubernetes, but rather because, based on our experience at Google, that's the right way to build large, highly available applications. Concretely, a cluster upgrade gone wrong, or a cluster control plane malfunction, when the cluster size is enormous, can be very, very bad. The solution that we support to address this challenge is building, managing and federating multiple clusters. So we need to address your questions anyway.
  3. Phase 1 of this project, as outlined in this design doc, is intended to be an absolutely minimum viable product. So please don't see this as the long-term, complete plan for cluster federation. It is explicitly a first implementation step.
  4. There are a few possible solutions to the specific limitation that you mention, at least one of which we plan to implement soon after phase 1. One solution (as you suggest) would be for the cross-cluster scheduler to explicitly spread across zones as well as clusters (in the same way that for multi-zone "ubernetes lite" clusters, our scheduler spreads across nodes as well as zones, making sensible choices between putting too many replicas on one node vs putting too many replicas in one zone, when forced to choose between the two). I think that this is fairly easy, and a good solution. Another solution (which is already supported in the above design) would be to specify an appropriate NodeSelector to restrict placement to clusters in non-overlapping zones (although this has the downside of artificially introducing hard placement constraints, which has the potential to strand resources).

I will soon be putting together a detailed design document for review, which will cover a variety of pod placement and movement/rescheduling scenarios and designs, where we can figure out all the details, and the best practises that you mention. I will be sure to keep you in the loop there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quinton, thanks for your explanation. That makes sense to me.
Our k8s cluster is not built on bare metal but on VM, we may deploy 3 or more VMs on single hypervisor as k8s minions. So There will be up to 6000+ minions in single AZ.
I agree it is right approach to build multiple clusters in single AZ, in addition, from my perspective, federation should have the ability to split the sub RC and replica to different AZ, as when someone defines an ubernetes application, as an end user, he may have no info about underlying infrastructure info like AZ, and he may not care about which AZ to deploy, high availability is the only need. So Ubernetes scheduler should consider this and make the right decision.
Looking forward your detailed design doc, thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cc-ing me @quinton-hoole - the above explanation SGTM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfanjie Agreed - user should only specify HA as a high level requirement, not be required to specify which zones to deploy into.

One other thing - why do you run multiple smaller VM's on a single hypervisor/host, rather than a single large VM on that host? We generally favor the latter as the most scalable way to achieve clusters with large amounts of computing resources, i.e for large clusters, it's better to run smaller numbers of larger nodes, than larger numbers of smaller nodes.

@ghost
Copy link
Author

ghost commented Mar 11, 2016

@davidopp Can haz LGTM? I think that all comments are adequately addressed. If any more feedback comes in, I think that it can be dealt with in follow-up PR's.

(consol, zookeeper) will probably be developedand supported over
time.

## Ubernetes Scheduler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section could be made significantly clearer by saying something like "Whereas the Kubernetes scheduler schedules pending pods to nodes, the Ubernetes scheduler creates per-cluster API objects corresponding to Ubernetes control plane API objects."

@davidopp davidopp added e2e-not-required lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 15, 2016
@k8s-github-robot
Copy link

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

@davidopp
Copy link
Member

LGTM

I'd like to go back and re-read more carefully some time soon, but this is certainly fine for now.

@k8s-bot
Copy link

k8s-bot commented Mar 15, 2016

GCE e2e build/test passed for commit 227710d.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

k8s-github-robot pushed a commit that referenced this pull request Mar 15, 2016
@k8s-github-robot k8s-github-robot merged commit 536a30f into kubernetes:master Mar 15, 2016
@wulonghui
Copy link
Contributor

@quinton-hoole
This is a great thing.But I got a question:

Pods are able to discover and connect to services hosted in other clusters (in cases where inter-cluster networking is necessary, desirable and implemented).

Is this mean that the Service's ClusterIP is global across the clusters in a ubernetes?
If so, kube-proxy should listen to all the clusters?

@ghost
Copy link
Author

ghost commented Mar 31, 2016

No. See docs/design/federated-services.md
On Mar 31, 2016 1:31 AM, "Tim Wu" notifications@github.com wrote:

@quinton-hoole https://github.com/quinton-hoole
This is a great thing.But I got a question:

Pods are able to discover and connect to services hosted in other clusters
(in cases where inter-cluster networking is necessary, desirable and
implemented).

Is this mean that the Service's ClusterIP is global across the clusters in
a ubernetes?
If so, kube-proxy should listen to all the clusters?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#19313 (comment)

k8s-github-robot pushed a commit that referenced this pull request Apr 2, 2016
Automatic merge from submit-queue

duplicate kube-apiserver to federated-apiserver

duplicate the kube-apiserver source code to ube-apiserver and update references
cluster specific api objects will be in separate PRs

#19313,  #21190
#21190 (comment)
@nikhiljindal
@dalanlan
Copy link
Contributor

@quinton-hoole
This is really impressive, yet I cannot really find the exact cloudbursting (aka. cross-cluster thing) idea as specified in federation design doc. Is it deprecated or did I miss anything?

@yogeshmsharma
Copy link

Are we expecting first cut of this till Jul 2016 or it will go beyond it ..

k8s-github-robot pushed a commit that referenced this pull request Apr 27, 2016
Automatic merge from submit-queue

Federation apiobject cluster

add federation api group
add cluster api object and registry
~~generate cluster client~~ moved to #24117
update scripts to generate files for /federation

#19313 #23653 #23554
@nikhiljindal  @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom
k8s-github-robot pushed a commit that referenced this pull request Apr 27, 2016
Automatic merge from submit-queue

Move install of version handler to genericapiserver

This is to satisfy kbuectl verification

Please review only the last commit.

#19313 #23653
@nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom
k8s-github-robot pushed a commit that referenced this pull request Jul 19, 2016
…nsions-replicaset

Automatic merge from submit-queue

Add extensions/replicaset to federation-apiserver

Add extensions/replicaset for federated scheduler (#24038) as all k8s api objects were removed in #23959

Please review only the very last one commit.

#19313 #23653 
@nikhiljindal @quinton-hoole, @deepak-vij, @XiaoningDing, @alfred-huangjian @mfanjie @huangyuqi @colhom
xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
…-design-docs

Auto commit by PR queue bot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet