scheduler and storage provision (PV controller) coordination #43504

jingxu97 · 2017-03-22T07:25:22Z

There are several ongoing discussion threads related to this topic, so open this one to summarize all the relevant discussions and hope to reach a conclusion.

Currently pod scheduling and PV controller (PVC/PV binding and dynamic provisioning) are implemented in completely separated controllers. This separation has some benefits such as enabling scheduler pluggable, better isolation, and better performance since PV/PVC binding can be performed asynchronously before pod scheduling. But both controller makes decisions independently without considering the other's choice so that the node/zone selection might conflict.

The goal is to still keep this separate controller design largely and make necessary modification to overcome the problem of conflicting decisions. There are four scenarios we need to consider.

Single Zone, network attached storage
In such case, PV/PVC is not tied to any node selection. The binding can be performed interdependently. No change is needed from current code.
Single Zone, local storage
In this case, PV is tied to a specific node. Once a PVC is bound to a PV, it is bound to the node. The decision might not be compatible to pod scheduler's decision. Normally pod scheduler ranks nodes based on some predefined policies and picks the node with the highest score. PV controller is also trying to search and find the best match to bind PVC and PV.
Multiple Zone, network attached storage
Pod scheduler gives each zone a score which is used to rank nodes. But PV controller itself does not consider zone information when choosing PV for PVC and it uses the same way of choosing PV for a given PVC. Only for dynamic provisioning, user might specify zone information and PV will be created in that zone. In such case, pod scheduler has a predicate which picks nodes only from the zone where the PV is. For statefulsets, a hacky way to make sure volumes are spread across zones when creating the volumes by using PVC name as an indicator of statefulset. In this situation, it is possible that PV controller picks a zone in which no node has enough resources (CPU/memory) for a pod. See more discussion at Fix StatefulSet volume provisioning "magic" #41598
Multiple Zone, local storage
PV controller could find PV candidates from different zones and nodes. It is similar to case 2. PV controller might picks up a zone and node that does not have enough resources for the pod.

To solve these problems, I think it would be good for storage, scheduling, and workload team to get together and agree on the outcome we want to deliver.

Proposal:
[@vishh] Move the binding selection and decision from PV controller to scheduler. The PV controller will still be around to take care of dangling claims and/or volumes, rollback incomplete transactions (necessary when a pod requests multiple local PVs), reclaim PVs, etc.

jingxu97 · 2017-03-22T07:26:22Z

@kubernetes/sig-storage-misc
@kubernetes/sig-scheduling-misc
@kubernetes/sig-apps-misc

davidopp · 2017-03-22T07:43:49Z

ref/ kubernetes/community#306 (for 2 and 4)

jsafrane · 2017-03-22T08:36:33Z

As author of the PV controller I admit it's quite stupid when scheduling PVs to PVCs and it ignores pod requirements at all. It's a very simple process, however it's complicated by our database not allowing us transactions. It would save us lot of pain (and code!) if we could update two objects atomically in a single write and such operation could be easily done in pod scheduler.

PV controller would remain there to coordinate provisioning, deletion and such.

[I know etcd allows transactions, however we intentionally don't use this feature].

thockin · 2017-03-22T16:58:22Z

This started in email, so I'll bring in some of my notes from there.

In reality, the only difference between a zone and a node is cardinailty. A node is just a tiny little zone with one choice. If we fold PV binding into scheduling, we get a more holistic sense of resources, which would be good. What I don't want is the (somewhat tricky) PV binding being done in multiple places.

Another consideration: provisioning. If a PVC is pending and provisioning is invoked, we really should decide the zone first and tell the provisioner what zone we want. But so far,
that's optional (and opaque). As long as provisioners provision in whatever zone they feel like, we STILL have split-brain. For net-attached storage we get away with it because cardinality is usually > 1, whereas local storage is not.

I think the right answer might be to join these scheduling decisions and to be more prescriptive with topology wrt provisioning.

kow3ns · 2017-03-22T18:41:20Z

/ref #41598

0xmichalis · 2017-03-23T18:09:06Z

[I know etcd allows transactions, however we intentionally don't use this feature].

This is the issue about transactions: #27548

msau42 · 2017-04-26T18:10:39Z

Here is my rough idea about how to make the scheduler storage-topology aware. It is similar to the topologyKey for pod affinity/anti-affinity, except you specify it in the storageclass instead of the pod. The sequence could look something like:

PVC-PV binding has to be delayed until there is a pod associated with the PVC.
The scheduler has to look into the storageclass of the PVC, pull out the topologyKey.
It filters out existing available PVs based on the topologyKey value that also has to match on the node that its evaluating. Predicate returns true if there's enough available PVs.
If no pre-existing PVs are available, ask the provisioner if it can provision in the topologyKey value of the node. Predicate returns true if provisioner says it can.
Scheduler picks a node for the pod out of the remaining choices based on some ranking.
Kubelet waits until PVCs are bound.
Once the pod is assigned to a node, do the PVC-PV binding/provisioning.
If the binding fails, kubelet has to reject the pod
Scheduler retries a different node. Go back to 6).

vishh · 2017-04-26T21:30:59Z

I have an alternate idea for dealing with topology.
I think we can use labels for expressing topology. Here is the algorithm that we discussed:

PVs expose a single topology label based on the storage they represent

kind:PersistentVolume
metadata:
  labels:
   topology.kubernetes.io/node:foo 
spec:
 localStorage: ...

or

kind:PersistentVolume
metadata:
  labels:
   topology.kubernetes.io/zone:bar
spec:
 GCEPersistentDisk: ...

PVCs can already select using this topology label. Storage Classes should include a Label Selector that can specify topology constaints.

kind: StorageClass
metadata:
  name: local-fast-storage
spec:
  topologySelector:
    - key: "topology.k8s.io/node"

or

kind: StorageClass
metadata:
  name: durable-slow
spec:
  topologySelector:
    - key: "topology.k8s.io/zone"
      operator: In
      values:
      - bar-zone

Nodes will expose all aspects of topology via consistent label keys.

kind: Node
metadata:
  label: 
   topology.kubernetes.io/node : foo
   topology.k8s.io/zone : bar

The scheduler can then combine NodeSelector on the pod with the Selector on the StorageClass while identifying Nodes that can meet the storage locality requirements.

This method would require using consistent label keys across nodes and PVs. I hope that's not a non starter.

vishh · 2017-04-26T21:33:05Z

Kubelet is already exposing a failure domain label that indicates the zone and region. Here is an example from GKE:

Labels:			
			failure-domain.beta.kubernetes.io/region=us-central1
			failure-domain.beta.kubernetes.io/zone=us-central1-b
			kubernetes.io/hostname=gke-ssd-default-pool-ef225ddf-xfrk

We can consider re-purposing the existing labels too.

jsafrane · 2017-04-27T08:26:00Z

@vishh, matching existing PVs is IMO not the issue here, problem is the dynamic provisioning. You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume. And @msau42 proposes that this decision should be made during pod scheduling.

@msau42, technically, this could work, however it will break external provisioners. You can't ask them if it's possible to provision a pod for specific node to filter the nodes. you can only ask them to provision a volume and they can either succeed or fail.

msau42 · 2017-04-27T18:05:53Z

Yes, the sequence I am suggesting is going to require changes to the current provisioning protocol to add this additional request.

vishh · 2017-04-27T18:13:47Z

You must know before the provisioning in what zone / region / host / arbitrary topology item you want to provision the volume

I was assuming that provisioning will be triggered by the scheduler in the future at which point the zone/region/rack or specific node a pod will land on will be known prior to provisioning.

msau42 · 2017-04-27T19:54:22Z

I also think that the filtering is more of an optimization and can be optional. There are only a handful of zones, so we could try to provision in one zone, and if that fails, then try another zone until it succeeds.

But for the node case, being able to pre-filter available nodes will be important. It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.

vishh · 2017-04-27T23:02:26Z

I also think that the filtering is more of an optimization and can be optional.

Storage should reside where pods are. If pods have a specific spreading constraint, then storage allocation has to ideally meet that constraint. The scenario you specified is OK for pods that do not have any specific spreading constraints.

It's not a scalable solution if we have to retry hundreds of nodes until we find one that succeeds.

Define success? For local PVs it's only a matter of applying label filters and performing capacity checks right?

msau42 · 2017-04-27T23:28:27Z

I'm referring to a dynamic provisioning scenario, where the scheduler decides which node the pod should be on, and then trigger the provisioning on that node. But the scheduler should know beforehand some information about whether that node has enough provisionable capacity, so that it can pre-filter more nodes.

smarterclayton · 2017-04-28T02:34:44Z

I think we should try to keep the pod as the central scheduling object, and if dynamic provisioning could fail in a particular part of the cluster, that info needs to be available to the scheduler prior to its schedule (whether via dynamic provisioning marking a pvc as being constrained, or the scheduler knowing about the storage class and a selector existing on the storage class status). The latter is less racy. I would hate however for the scheduler to have to have complex logic to reason about where a pvc can go. I do think dynamic provisioners should be required to communicate capacity via status if the scheduler needs that info. I don't think the initial pvc placement is the responsibility of the dynamic provisioners.

vishh · 2017-04-29T16:27:24Z

The algorithm I was imagining is as follows: 1. Go through scheduling predicates to identify a list of viable nodes 2. Go through priority functions and get a list of nodes sorted by priority 3. Present this set of nodes to the dynamic provisioner and have the provisioner choose a node based on priority 4. Scheduler completes the binding process if provisioning succeeded. It might assign to a specific node if local PV was requested. For remote PVs, if dynamic provisioning fails in the rack/zone/region, then the pod cannot be scheduled. The scheduler should not be changing it's pod spreading policy based on storage availability.

…

On Thu, Apr 27, 2017 at 7:35 PM, Clayton Coleman ***@***.***> wrote: I think we should try to keep the pod as the central scheduling object, and if dynamic provisioning could fail in a particular part of the cluster, that info needs to be available to the scheduler prior to its schedule (whether via dynamic provisioning marking a pvc as being constrained, or the scheduler knowing about the storage class and a selector existing on the storage class status). The latter is less racy. I would hate however for the scheduler to have to have complex logic to reason about where a pvc can go. I do think dynamic provisioners should be required to communicate capacity via status if the scheduler needs that info. I don't think the initial pvc placement is the responsibility of the dynamic provisioners. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43504 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKKE6Zxl8FF_7M1rVSX2_a_MCU7rVks5r0VBigaJpZM4Mk0hU> .

davidopp · 2017-04-29T19:30:27Z

For remote PVs, if dynamic provisioning fails in the rack/zone/region, then
the pod cannot be scheduled. The scheduler should not be changing it's pod
spreading policy based on storage availability.

I'm not sure I understand this. It's important to distinguish between predicates (hard constraints) and priority functions (soft constraints/preferences). The scheduler spreading policy (assuming you're not talking about explicit requiredDuringScheduling anti-affinity) is the latter category. So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.

BTW I like the idea of the StorageClass giving status that indicates the number and shape of PVs that it can allocate, so that the scheduler can use this information, plus its knowledge of the available PVs that have already been created, when making the assignment decision. I agree we probably need an alternative for "legacy provisioners" that don't expose this StorageClass status.

vishh · 2017-05-01T05:00:43Z

So if there is a node where the pod can fit (and where storage is available or can be provisioned), it should always schedule, even if it "violates" the pod spreading policy.

It is possible that storage constraints might violate pod scheduling constraints. What if a statefulSet wants to use a storage class that is accessible only from a single zone, but the pods in that storageClass are expected to be spread across zones? I feel this is an invalid configuration and scheduling should fail. If the scheduler were to (incorrectly) notice that storage is available in only zone and then place all pods in the same zone, that would violate user expectations.

To be clear, local PVs can have a predicate. Local PVs are statically provisioned and from a scheduling standpoint are similar to "cpu" or "memory".
It is dynamic provisioning that will require an additional scheduling step which runs after a sorted list of nodes are available for each pod in the scheduler.
@davidopp thoughts?

davidopp · 2017-05-01T05:15:51Z

There are two kinds of spreading: constraint/hard requirement (called requiredDuringScheduling) and preference/soft requirement (called preferredDuringScheduling). I was just saying that it's OK to violate the second kind due to storage.

I think it's hard to pin down exactly what "user expectations" are for priority functions. We have a weighting scheme but there are so many factors that unless you manually adjust the weights yourself, you can't really have strong expectations.

vishh · 2017-05-01T05:25:44Z

I was just saying that it's OK to violate the second kind due to storage.

Got it. If storage availability can be exposed in a portable manner across deployments, then it can definitely be a "soft constraint" as you mentioned. The easiest path forward now is that of performing dynamic provisioning of storage once a list of nodes is available.

I think it's hard to pin down exactly what "user expectations" are for priority functions.

Got it. I was referring to "hard constraints" specifically.

spiffxp · 2017-05-31T20:21:03Z

@kubernetes/sig-storage-misc @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc do you want this in for v1.7? which is the correct sig to own this?

msau42 · 2017-05-31T20:38:20Z

We're targeting 1.8

deitch · 2017-11-07T11:23:06Z

We're targeting 1.8

Did any of it make it into 1.8? I don't see PRs linked here, but I might have missed them.

msau42 · 2017-11-07T16:43:16Z

@deitch No unfortunately not. Scheduler improvements for static PV binding are targeted for 1.9. Dynamic provisioning will come after. The general feature tracker is at kubernetes/enhancements#490.

deitch · 2017-11-07T16:46:07Z

@msau42 looks like we are talking on 2 separate issues about the same... "issue"? :-)

So 1.9 for static, 1.10+ or 2.0+ for dynamic?

msau42 · 2017-11-07T16:55:33Z

Yes

wu105 · 2018-02-03T09:43:33Z

Did we cover the use case when two (or more) pods use the same pv? If the underlying infrastructure does not allow the pv to attach to more than one node, the second pod should be scheduled on the same node as the first pod.

msau42 · 2018-02-03T16:08:05Z

Pods that use local PVs will be scheduled to the same node, but it's not going to work for zonal PVs. Pods that use zonal PVs will only be scheduled to the same zone.

wu105 · 2018-02-04T02:13:17Z

@msau42 that would require the user to specify the node name of the pod, not desirable because kubernetes already has the information to schedule the second pod to the correct node and the user is burdened with selecting nodes.

On a different topic, hope is not off topic for this thread, is about the node and pv zones with cloud provider Openstack. When cloud provide is Openstack, the node and pv zones seems copied from nova and cinder respectively, with node zones are network security zones while pvs got one zone from cinder to serve multiple network zones, which are not suitable for Kubernetes scheduling. The openstack nova and cinder zones just do not seem to support kubernetes scheduling. It would be more helpful if the kubernetes admin can easily configure the node zones and the pv zones on openstack. The pvs come and go, thus it may help to add a pv zone override to pv claim.

msau42 · 2018-02-04T17:33:35Z

@wu105 are you referring to local or zonal PVs? The design goal of local PVs is that the pod does not need to specify any node name; it's all contained in the PV information.

The problem of node enforcement with zonal PVs is that you also need to take into account access mode. A multi writer zonal PV does not have a node attachment restriction like with readwriteonce PVs. I think the best way to solve the node problem for zonal PVs is to do access mode enforcement, instead of trying to conflate the PV node affinity.

I'm not sure I completely understand your issue with open stack zone labelling. At least I know for gce and aws volumes, we have admission controllers that already label the PV with he correct zone information. I imagine you can do the same for open stack.

msau42 · 2018-02-05T16:54:43Z

I just realized as a workaround, you could use pod affinity to get two pods sharing the same zonal single attach PVC to be scheduled on the same node.

wu105 · 2018-02-05T19:36:44Z

Affinity indeed is a workaround. I selected the preferred rule because it does not seem to require the first pod specifies no affinity which would require the user to track the number of pods using the pv. I used the max weight 100 in the hope that k8s will never ignore the “preferred” rule and gets a pod on the wrong node. In summary, the original pod spec -- apiVersion: v1 kind: Pod metadata: name: app-pod spec: containers: - name: busybox image: busybox stdin: true tty: true command: - /bin/sh - -i volumeMounts: - mountPath: /data name: node-pv readOnly: false volumes: - name: node-pv persistentVolumeClaim: claimName: app-pvc becomes the following after adding the affinity specs -- apiVersion: v1 kind: Pod metadata: name: app-pod labels: volumeClaimName: app-pvc spec: containers: - name: busybox image: busybox stdin: true tty: true command: - /bin/sh - -i volumeMounts: - mountPath: /data name: node-pv readOnly: false affinity: podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: volumeClaimName operator: In values: - app-pvc topologyKey: kubernetes.io/hostname volumes: - name: node-pv persistentVolumeClaim: claimName: app-pvc From: Michelle Au [mailto:notifications@github.com] Sent: Monday, February 05, 2018 11:56 AM To: kubernetes/kubernetes Cc: Wu, Peng (Peng); Mention Subject: Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504) I just realized as a workaround, you could use pod affinity to get two pods sharing the same zonal single attach PVC to be scheduled on the same node. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_43504-23issuecomment-2D363146984&d=DwMCaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=0BbrcqMiH8YUfbIctRfXJ1UsJenjot_TItO7cru0Biw&s=3kSdCbWWYbIPtFvGR8iTbzwIGcbjeppnndArFP5tM2s&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQW5a3SIXfcTRM9dvMZPe9eSVRz97g1aks5tRzKdgaJpZM4Mk0hU&d=DwMCaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=0BbrcqMiH8YUfbIctRfXJ1UsJenjot_TItO7cru0Biw&s=XuBB82I22YUtyI72M-l62J3DUG9TdxSM-ea_1EQRGYQ&e=>.

msau42 · 2018-02-05T19:45:30Z

@bsalamat anything we can do about the scenario where you specify podAffinity and it's the first pod (which is going to have no pods matching the selector yet)?

bsalamat · 2018-02-05T21:45:36Z

@msau42 if two pending pods had affinity to one another, they would never be scheduled. Affinity is a way of specifying dependency and two pods having affinity to one another represents a circular dependency which is an anti-pattern IMO.

wu105 · 2018-02-05T22:03:10Z

@bsalamat Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_, or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

bsalamat · 2018-02-06T00:14:36Z

Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_,

so, the "OrFirstPod" part causes the pod to be scheduled even if the affinity rule cannot be satisfied? It could work, but I have to think about possible performance implications of this. Affinity/anti-affinity already causes performance issues in medium and large clusters and we are thinking about stream-lining the design. We must be very careful about adding new features which could worsen the situation.

or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching.

This is a hack. I wouldn't consider this as an option.

wu105 · 2018-02-06T04:21:45Z

The root issue is assigning a node to a pod and its pv so that they are on the same node. The new wrinkle is the pv is assigned to a node first via another pod. If we can check the pv and if it is already attached to a node, we should assign the same node to the pod. Then we would not need this explicit affinity workaround, just like we do not need to spell out the affinity of pvs to its pod. The performance probably won’t be an issue when the correct node can always be determined by looking only the pod and its pvs. If the correct node can be not be determined by looking only the pod and its pvs, optionally specifying an order over scheduling might help. Is it expensive to check whether a pv is already attached to a node or which node a pv is attached to? An alternative would be that when a pod failed to schedule due to assigned to a wrong node, e.g., its pv cannot be attached because the pv is already attached to another node, automatically reschedule the pod on the pv’s node. This is less desirable than the above but the performance drag may be limited if such failures are rare. This is better than asking the users to handle it, which is quite difficult as we are still trying to determine how to come up with a spec that would work. The special rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod uses the same idea, i.e., the extra check is only performed when the user specifies this rule. From: Bobby (Babak) Salamat [mailto:notifications@github.com] Sent: Monday, February 5, 2018 7:15 PM To: kubernetes/kubernetes Cc: Wu, Peng (Peng); Mention Subject: Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504) Maybe we can add a rule requiredDuringSchedulingIgnoredDuringExecutionOrFirstPod_, so, the "OrFirstPod" part causes the pod to be scheduled even if the affinity rule cannot be satisfied? It could work, but I have to think about possible performance implications of this. Affinity/anti-affinity already causes performance issues in medium and large clusters and we are thinking about stream-lining the design. We must be very careful about adding new features which could worsen the situation. or introduce a weight, e.g., 101 for the preferredDuringSchedulingIgnoredDuringExecution rule that would make it behave like the requiredDuringSchedulingIgnoredDuringExecution when there is at least one pod matching. This is a hack. I wouldn't consider this as an option. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_43504-23issuecomment-2D363265583&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=RelzpfX2HKLEetIHAooXtUwis5B-ICiqOZMq21LV24I&s=SSAQ8BVIAZ3yOF3dKZAkD1rkySl5uBTvNInGJc8vD1k&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQW5a5rr4dwD3aZAtz9rjuaKgM-5FSuvxSks5tR5mbgaJpZM4Mk0hU&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=RelzpfX2HKLEetIHAooXtUwis5B-ICiqOZMq21LV24I&s=B8wKlf4JBx4bZUsr_7aOfUmHMqoALCle3IfEJ91WW4o&e=>.

msau42 · 2018-02-06T04:53:04Z

@wu105 like I mentioned earlier, I think the proper solution will be access mode enforcement. The PV NodeAffinity feature does not help here as it is unrelated to volume attaching + access modes. We cannot assume that all PVs are only attachable to a single node at a time. There have actually been quite a few other issues discussing this and the challenges: #26567, #30085, #47333

wu105 · 2018-02-06T06:38:19Z

I quickly went through those referenced issues. I did not quick capture all the fine points, but I think one issue is to nail the definition of RWO, then, if it does not cover all the needs or variations, we may need to add additional modes. For instance, single node pv attachment constraint comes from openstack cinder, mixed with file systems on the volume that can support multi-writes, or applications willing to manage the risk of using single writer file system for multiple writers by implementing their own control logic, kubernetes at the infrastructure level may want to relax some of the constraints for special needs, just like we are wary privileged mode or monting hostpath /var/run/docker.sock but still keep them available. On the pod sharing pv situation, when Kubernetes allows it, has the information to assign nodes correctly but does it wrong, it feels more like a bug than a feature ☺ From: Michelle Au [mailto:notifications@github.com] Sent: Monday, February 5, 2018 11:54 PM To: kubernetes/kubernetes Cc: Wu, Peng (Peng); Mention Subject: Re: [kubernetes/kubernetes] scheduler and storage provision (PV controller) coordination (#43504) @wu105<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_wu105&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=JCBQSpQCkF9cq7cwFcW0VaXW5JufxEZMONBsDIe8elk&e=> like I mentioned earlier, I think the proper solution will be access mode enforcement. The PV NodeAffinity feature does not help here as it is unrelated to volume attaching + access modes. We cannot assume that all PVs are only attachable to a single node at a time. There have actually been quite a few other issues discussing this and the challenges: #26567<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_26567&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=f7CT39Mj76uFmfxR74Y0Gh5XdQfL_xI5BxKlRslNbCY&e=>, #30085<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_30085&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=80hDPdfQIaaShmJ_iZtFsYxi-Kv0ffEtw4CXREtRf94&e=>, #47333<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_47333&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=Q8vCgtIV8cJqPt-CzDRYssKtjo_ou896FZSm_Qti3H8&e=> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kubernetes_kubernetes_issues_43504-23issuecomment-2D363310569&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=XNIF5JYBmfL9-v8ThJrBZd4-Az0dJL0UjpR49X7UDqM&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AQW5a-2DZYgNiJBRvqp2UB5a2GTXTcOefBks5tR9rhgaJpZM4Mk0hU&d=DwMFaQ&c=BFpWQw8bsuKpl1SgiZH64Q&r=ZewSXw8isV1R3SHVAHiA9Q&m=yk7l-z3nr-68srkL012wZONyOVgvfYN-xI8jlidIL_A&s=APY9w8ISdsrDln-DRwj8K2hhYtc40IOsHukL9CkFTfY&e=>.

msau42 · 2018-02-06T18:59:16Z

Agree, I think some new access mode API is needed to handle this case. Let's use #26567 to continue the discussion since that issue has the most history regarding access modes.

msau42 · 2018-02-27T04:42:02Z

Dynamic provisioning topology design proposal is here: kubernetes/community#1857

fejta-bot · 2018-05-28T04:59:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

deitch · 2018-05-28T05:33:02Z

/remove-lifecycle stale

fejta-bot · 2018-08-26T06:18:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

aalubin · 2018-09-05T12:04:38Z

/remove-lifecycle stale

msau42 · 2018-09-05T14:40:10Z

Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.

/close

k8s-ci-robot · 2018-09-05T14:40:19Z

@msau42: Closing this issue.

In response to this:

Topology aware dynamic provisioning will be available in beta in 1.12. In-tree gce, aws and azure block disks are supported. Local and CSI volumes will be supported in a future release.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jingxu97 added this to the v1.7 milestone Mar 22, 2017

msau42 mentioned this issue Apr 26, 2017

Fix StatefulSet volume provisioning "magic" #41598

Closed

msau42 mentioned this issue Apr 26, 2017

LocalStorage api #44640

Merged

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels May 31, 2017

0xmichalis removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jun 2, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2018

k8s-ci-robot closed this as completed Sep 5, 2018

scheduler and storage provision (PV controller) coordination #43504

scheduler and storage provision (PV controller) coordination #43504

Comments

jingxu97 commented Mar 22, 2017 • edited by davidopp

jingxu97 commented Mar 22, 2017

davidopp commented Mar 22, 2017

jsafrane commented Mar 22, 2017

thockin commented Mar 22, 2017

kow3ns commented Mar 22, 2017

0xmichalis commented Mar 23, 2017

msau42 commented Apr 26, 2017

vishh commented Apr 26, 2017 • edited

vishh commented Apr 26, 2017

jsafrane commented Apr 27, 2017

msau42 commented Apr 27, 2017

vishh commented Apr 27, 2017

msau42 commented Apr 27, 2017

vishh commented Apr 27, 2017

msau42 commented Apr 27, 2017

smarterclayton commented Apr 28, 2017 via email

vishh commented Apr 29, 2017 via email

davidopp commented Apr 29, 2017

vishh commented May 1, 2017

davidopp commented May 1, 2017

vishh commented May 1, 2017

spiffxp commented May 31, 2017

msau42 commented May 31, 2017

deitch commented Nov 7, 2017

msau42 commented Nov 7, 2017

deitch commented Nov 7, 2017

msau42 commented Nov 7, 2017

wu105 commented Feb 3, 2018

msau42 commented Feb 3, 2018

wu105 commented Feb 4, 2018

msau42 commented Feb 4, 2018

msau42 commented Feb 5, 2018

wu105 commented Feb 5, 2018 via email

msau42 commented Feb 5, 2018

bsalamat commented Feb 5, 2018

wu105 commented Feb 5, 2018

bsalamat commented Feb 6, 2018

wu105 commented Feb 6, 2018 via email

msau42 commented Feb 6, 2018

wu105 commented Feb 6, 2018 via email

msau42 commented Feb 6, 2018

msau42 commented Feb 27, 2018

fejta-bot commented May 28, 2018

deitch commented May 28, 2018

fejta-bot commented Aug 26, 2018

aalubin commented Sep 5, 2018

msau42 commented Sep 5, 2018

k8s-ci-robot commented Sep 5, 2018

jingxu97 commented Mar 22, 2017 •

edited by davidopp

vishh commented Apr 26, 2017 •

edited