Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler and storage provision (PV controller) coordination #43504

Closed
jingxu97 opened this issue Mar 22, 2017 · 52 comments
Closed

scheduler and storage provision (PV controller) coordination #43504

jingxu97 opened this issue Mar 22, 2017 · 52 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@jingxu97
Copy link
Contributor

jingxu97 commented Mar 22, 2017

There are several ongoing discussion threads related to this topic, so open this one to summarize all the relevant discussions and hope to reach a conclusion.

Currently pod scheduling and PV controller (PVC/PV binding and dynamic provisioning) are implemented in completely separated controllers. This separation has some benefits such as enabling scheduler pluggable, better isolation, and better performance since PV/PVC binding can be performed asynchronously before pod scheduling. But both controller makes decisions independently without considering the other's choice so that the node/zone selection might conflict.

The goal is to still keep this separate controller design largely and make necessary modification to overcome the problem of conflicting decisions. There are four scenarios we need to consider.

  1. Single Zone, network attached storage
    In such case, PV/PVC is not tied to any node selection. The binding can be performed interdependently. No change is needed from current code.

  2. Single Zone, local storage
    In this case, PV is tied to a specific node. Once a PVC is bound to a PV, it is bound to the node. The decision might not be compatible to pod scheduler's decision. Normally pod scheduler ranks nodes based on some predefined policies and picks the node with the highest score. PV controller is also trying to search and find the best match to bind PVC and PV.

  3. Multiple Zone, network attached storage
    Pod scheduler gives each zone a score which is used to rank nodes. But PV controller itself does not consider zone information when choosing PV for PVC and it uses the same way of choosing PV for a given PVC. Only for dynamic provisioning, user might specify zone information and PV will be created in that zone. In such case, pod scheduler has a predicate which picks nodes only from the zone where the PV is. For statefulsets, a hacky way to make sure volumes are spread across zones when creating the volumes by using PVC name as an indicator of statefulset. In this situation, it is possible that PV controller picks a zone in which no node has enough resources (CPU/memory) for a pod. See more discussion at Fix StatefulSet volume provisioning "magic" #41598

  4. Multiple Zone, local storage
    PV controller could find PV candidates from different zones and nodes. It is similar to case 2. PV controller might picks up a zone and node that does not have enough resources for the pod.

To solve these problems, I think it would be good for storage, scheduling, and workload team to get together and agree on the outcome we want to deliver.

Proposal:
[@vishh] Move the binding selection and decision from PV controller to scheduler. The PV controller will still be around to take care of dangling claims and/or volumes, rollback incomplete transactions (necessary when a pod requests multiple local PVs), reclaim PVs, etc.

@jingxu97
Copy link
Contributor Author

@kubernetes/sig-storage-misc
@kubernetes/sig-scheduling-misc
@kubernetes/sig-apps-misc

@jingxu97 jingxu97 added this to the v1.7 milestone Mar 22, 2017
@davidopp
Copy link
Member

ref/ kubernetes/community#306 (for 2 and 4)

@jsafrane
Copy link
Member

As author of the PV controller I admit it's quite stupid when scheduling PVs to PVCs and it ignores pod requirements at all. It's a very simple process, however it's complicated by our database not allowing us transactions. It would save us lot of pain (and code!) if we could update two objects atomically in a single write and such operation could be easily done in pod scheduler.

PV controller would remain there to coordinate provisioning, deletion and such.

[I know etcd allows transactions, however we intentionally don't use this feature].

@thockin
Copy link
Member

thockin commented Mar 22, 2017

This started in email, so I'll bring in some of my notes from there.

In reality, the only difference between a zone and a node is cardinailty. A node is just a tiny little zone with one choice. If we fold PV binding into scheduling, we get a more holistic sense of resources, which would be good. What I don't want is the (somewhat tricky) PV binding being done in multiple places.

Another consideration: provisioning. If a PVC is pending and provisioning is invoked, we really should decide the zone first and tell the provisioner what zone we want. But so far,
that's optional (and opaque). As long as provisioners provision in whatever zone they feel like, we STILL have split-brain. For net-attached storage we get away with it because cardinality is usually > 1, whereas local storage is not.

I think the right answer might be to join these scheduling decisions and to be more prescriptive with topology wrt provisioning.

@kow3ns
Copy link
Member

kow3ns commented Mar 22, 2017

/ref #41598

@0xmichalis
Copy link
Contributor

[I know etcd allows transactions, however we intentionally don't use this feature].

This is the issue about transactions: #27548

@msau42
Copy link
Member

msau42 commented Apr 26, 2017

Here is my rough idea about how to make the scheduler storage-topology aware. It is similar to the topologyKey for pod affinity/anti-affinity, except you specify it in the storageclass instead of the pod. The sequence could look something like:

  1. PVC-PV binding has to be delayed until there is a pod associated with the PVC.
  2. The scheduler has to look into the storageclass of the PVC, pull out the topologyKey.
  3. It filters out existing available PVs based on the topologyKey value that also has to match on the node that its evaluating. Predicate returns true if there's enough available PVs.
  4. If no pre-existing PVs are available, ask the provisioner if it can provision in the topologyKey value of the node. Predicate returns true if provisioner says it can.
  5. Scheduler picks a node for the pod out of the remaining choices based on some ranking.
  6. Kubelet waits until PVCs are bound.
  7. Once the pod is assigned to a node, do the PVC-PV binding/provisioning.
  8. If the binding fails, kubelet has to reject the pod
  9. Scheduler retries a different node. Go back to 6).

@vishh
Copy link
Contributor

vishh commented Apr 26, 2017

I have an alternate idea for dealing with topology.
I think we can use labels for expressing topology. Here is the algorithm that we discussed:

  1. PVs expose a single topology label based on the storage they represent
kind:PersistentVolume
metadata:
  labels:
   topology.kubernetes.io/node:foo 
spec:
 localStorage: ...

or

kind:PersistentVolume
metadata:
  labels:
   topology.kubernetes.io/zone:bar
spec:
 GCEPersistentDisk: ...
  1. PVCs can already select using this topology label. Storage Classes should include a Label Selector that can specify topology constaints.
kind: StorageClass
metadata:
  name: local-fast-storage
spec:
  topologySelector:
    - key: "topology.k8s.io/node"

or

kind: StorageClass
metadata:
  name: durable-slow
spec:
  topologySelector:
    - key: "topology.k8s.io/zone"
      operator: In
      values:
      - bar-zone
  1. Nodes will expose all aspects of topology via consistent label keys.
kind: Node
metadata:
  label: 
   topology.kubernetes.io/node : foo
   topology.k8s.io/zone : bar
  1. The scheduler can then combine NodeSelector on the pod with the Selector on the StorageClass while identifying Nodes that can meet the storage locality requirements.

This method would require using consistent label keys across nodes and PVs. I hope that's not a non starter.

@vishh
Copy link
Contributor

vishh commented Apr 26, 2017

Kubelet is already exposing a failure domain label that indicates the zone and region. Here is an example from GKE:

Labels:			
			failure-domain.beta.kubernetes.io/region=us-central1
			failure-domain.beta.kubernetes.io/zone=us-central1-b
			kubernetes.io/hostname=gke-ssd-default-pool-ef225ddf-xfrk

We can consider re-purposing the existing labels too.

@jsafrane
Copy link
Member

@vishh, matching existing PVs is IMO not the issue here, problem is the dynamic provisioning. You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume. And @msau42 proposes that this decision should be made during pod scheduling.

@msau42, technically, this could work, however it will break external provisioners. You can't ask them if it's possible to provision a pod for specific node to filter the nodes. you can only ask them to provision a volume and they can either succeed or fail.

@msau42
Copy link
Member

msau42 commented Apr 27, 2017

Yes, the sequence I am suggesting is going to require changes to the current provisioning protocol to add this additional request.

@vishh
Copy link
Contributor

vishh commented Apr 27, 2017

You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume

I was assuming that provisioning will be triggered by the scheduler in the future at which point the zone/region/rack or specific node a pod will land on will be known prior to provisioning.

@msau42
Copy link
Member

msau42 commented Apr 27, 2017

I also think that the filtering is more of an optimization and can be optional. There are only a handful of zones, so we could try to provision in one zone, and if that fails, then try another zone until it succeeds.

But for the node case, being able to pre-filter available nodes will be important. It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.

@vishh
Copy link
Contributor

vishh commented Apr 27, 2017

I also think that the filtering is more of an optimization and can be optional.

Storage should reside where pods are. If pods have a specific spreading constraint, then storage allocation has to ideally meet that constraint. The scenario you specified is OK for pods that do not have any specific spreading constraints.

It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.

Define success? For local PVs it's only a matter of applying label filters and performing capacity checks right?

@msau42
Copy link
Member

msau42 commented Apr 27, 2017

I'm referring to a dynamic provisioning scenario, where the scheduler decides which node the pod should be on, and then trigger the provisioning on that node. But the scheduler should know beforehand some information about whether that node has enough provisionable capacity, so that it can pre-filter more nodes.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 28, 2017 via email

@vishh
Copy link
Contributor

vishh commented Apr 29, 2017 via email

@davidopp
Copy link
Member

For remote PVs, if dynamic provisioning fails in the rack/zone/region, then
the pod cannot be scheduled. The scheduler should not be changing it's pod
spreading policy based on storage availability.

I'm not sure I understand this. It's important to distinguish between predicates (hard constraints) and priority functions (soft constraints/preferences). The scheduler spreading policy (assuming you're not talking about explicit requiredDuringScheduling anti-affinity) is the latter category. So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.

BTW I like the idea of the StorageClass giving status that indicates the number and shape of PVs that it can allocate, so that the scheduler can use this information, plus its knowledge of the available PVs that have already been created, when making the assignment decision. I agree we probably need an alternative for "legacy provisioners" that don't expose this StorageClass status.

@vishh
Copy link
Contributor

vishh commented May 1, 2017

So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.

It is possible that storage constraints might violate pod scheduling constraints. What if a statefulSet wants to use a storage class that is accessible only from a single zone, but the pods in that storageClass are expected to be spread across zones? I feel this is an invalid configuration and scheduling should fail. If the scheduler were to (incorrectly) notice that storage is available in only zone and then place all pods in the same zone, that would violate user expectations.

To be clear, local PVs can have a predicate. Local PVs are statically provisioned and from a scheduling standpoint are similar to "cpu" or "memory".
It is dynamic provisioning that will require an additional scheduling step which runs after a sorted list of nodes are available for each pod in the scheduler.
@davidopp thoughts?

@davidopp
Copy link
Member

davidopp commented May 1, 2017

There are two kinds of spreading: constraint/hard requirement (called requiredDuringScheduling) and preference/soft requirement (called preferredDuringScheduling). I was just saying that it's OK to violate the second kind due to storage.

I think it's hard to pin down exactly what "user expectations" are for priority functions. We have a weighting scheme but there are so many factors that unless you manually adjust the weights yourself, you can't really have strong expectations.

@vishh
Copy link
Contributor

vishh commented May 1, 2017

I was just saying that it's OK to violate the second kind due to storage.

Got it. If storage availability can be exposed in a portable manner across deployments, then it can definitely be a "soft constraint" as you mentioned. The easiest path forward now is that of performing dynamic provisioning of storage once a list of nodes is available.

I think it's hard to pin down exactly what "user expectations" are for priority functions.

Got it. I was referring to "hard constraints" specifically.

@spiffxp
Copy link
Member

spiffxp commented May 31, 2017

@kubernetes/sig-storage-misc @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc do you want this in for v1.7? which is the correct sig to own this?

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels May 31, 2017
@msau42
Copy link
Member

msau42 commented May 31, 2017

We're targeting 1.8

@0xmichalis 0xmichalis removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jun 2, 2017
@deitch
Copy link
Contributor

deitch commented Nov 7, 2017

We're targeting 1.8

Did any of it make it into 1.8? I don't see PRs linked here, but I might have missed them.

@msau42
Copy link
Member

msau42 commented Nov 7, 2017

@deitch No unfortunately not. Scheduler improvements for static PV binding are targeted for 1.9. Dynamic provisioning will come after. The general feature tracker is at kubernetes/enhancements#490.

@deitch
Copy link
Contributor

deitch commented Nov 7, 2017

@msau42 looks like we are talking on 2 separate issues about the same... "issue"? :-)

So 1.9 for static, 1.10+ or 2.0+ for dynamic?

@msau42
Copy link
Member

msau42 commented Nov 7, 2017

Yes

@wu105
Copy link

wu105 commented Feb 3, 2018

Did we cover the use case when two (or more) pods use the same pv? If the underlying infrastructure does not allow the pv to attach to more than one node, the second pod should be scheduled on the same node as the first pod.

@msau42
Copy link
Member

msau42 commented Feb 3, 2018

Pods that use local PVs will be scheduled to the same node, but it's not going to work for zonal PVs. Pods that use zonal PVs will only be scheduled to the same zone.

@wu105
Copy link

wu105 commented Feb 4, 2018

@msau42 that would require the user to specify the node name of the pod, not desirable because kubernetes already has the information to schedule the second pod to the correct node and the user is burdened with selecting nodes.

On a different topic, hope is not off topic for this thread, is about the node and pv zones with cloud provider Openstack. When cloud provide is Openstack, the node and pv zones seems copied from nova and cinder respectively, with node zones are network security zones while pvs got one zone from cinder to serve multiple network zones, which are not suitable for Kubernetes scheduling. The openstack nova and cinder zones just do not seem to support kubernetes scheduling. It would be more helpful if the kubernetes admin can easily configure the node zones and the pv zones on openstack. The pvs come and go, thus it may help to add a pv zone override to pv claim.

@msau42
Copy link
Member

msau42 commented Feb 4, 2018

@wu105 are you referring to local or zonal PVs? The design goal of local PVs is that the pod does not need to specify any node name; it's all contained in the PV information.

The problem of node enforcement with zonal PVs is that you also need to take into account access mode. A multi writer zonal PV does not have a node attachment restriction like with readwriteonce PVs. I think the best way to solve the node problem for zonal PVs is to do access mode enforcement, instead of trying to conflate the PV node affinity.

I'm not sure I completely understand your issue with open stack zone labelling. At least I know for gce and aws volumes, we have admission controllers that already label the PV with he correct zone information. I imagine you can do the same for open stack.

@msau42
Copy link
Member

msau42 commented Feb 5, 2018

I just realized as a workaround, you could use pod affinity to get two pods sharing the same zonal single attach PVC to be scheduled on the same node.

@wu105
Copy link

wu105 commented Feb 5, 2018 via email

@msau42
Copy link
Member

msau42 commented Feb 5, 2018

@bsalamat anything we can do about the scenario where you specify podAffinity and it's the first pod (which is going to have no pods matching the selector yet)?

@bsalamat
Copy link
Member

bsalamat commented Feb 5, 2018

@msau42 if two pending pods had affinity to one another, they would never be scheduled. Affinity is a way of specifying dependency and two pods having affinity to one another represents a circular dependency which is an anti-pattern IMO.

@wu105
Copy link

wu105 commented Feb 5, 2018

@bsalamat Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_, or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

@bsalamat
Copy link
Member

bsalamat commented Feb 6, 2018

Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_,

so, the "OrFirstPod" part causes the pod to be scheduled even if the affinity rule cannot be satisfied? It could work, but I have to think about possible performance implications of this. Affinity/anti-affinity already causes performance issues in medium and large clusters and we are thinking about stream-lining the design. We must be very careful about adding new features which could worsen the situation.

or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

This is a hack. I wouldn't consider this as an option.

@wu105
Copy link

wu105 commented Feb 6, 2018 via email

@msau42
Copy link
Member

msau42 commented Feb 6, 2018

@wu105 like I mentioned earlier, I think the proper solution will be access mode enforcement. The PV NodeAffinity feature does not help here as it is unrelated to volume attaching + access modes. We cannot assume that all PVs are only attachable to a single node at a time. There have actually been quite a few other issues discussing this and the challenges: #26567, #30085, #47333

@wu105
Copy link

wu105 commented Feb 6, 2018 via email

@msau42
Copy link
Member

msau42 commented Feb 6, 2018

Agree, I think some new access mode API is needed to handle this case. Let's use #26567 to continue the discussion since that issue has the most history regarding access modes.

@msau42
Copy link
Member

msau42 commented Feb 27, 2018

Dynamic provisioning topology design proposal is here: kubernetes/community#1857

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018
@deitch
Copy link
Contributor

deitch commented May 28, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2018
@aalubin
Copy link

aalubin commented Sep 5, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2018
@msau42
Copy link
Member

msau42 commented Sep 5, 2018

Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.

/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closing this issue.

In response to this:

Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests