New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic volume provisioning creates EBS volume in the wrong availability zone #39178
Comments
I am having this happen however we have nodes in each region and it seems to not be able to get the pod and the volume together in the same az, was setup with kubeadm with modifications to work with aws |
Never heard of kaws, but kubeadm is still in alpha and cloudprovider integration is not supported (i.e. expected to be broken). You should probably be using kops to install on AWS. |
kaws is our own installation system we've been using from the start. I don't think this bug has anything to do with the cluster creation tool. The correct cloud provider flags are passed to each Kubernetes component and other AWS-specific cloud provider functionality works. |
If someone can point me to the area of the codebase where the decision about where to provision a dynamic volume in AWS is made, I can try to figure this out myself! |
I think I see where the issue is. The docs for EBS provisioning say:
However, I have not found any logic that chooses the zone that way. aws.Cloud.CreateDisk calls its own aws.Cloud.getAllZones method to populate the list of zones to choose from when creating a disk when the storage class/PVC doesn't request a specific zone. But |
+1 , happened to me too, trying to setup mongodb from a helm chart. I have a few test nodes of K8s, all in same zone, but the volumes it provisioned are in another zone, so creation of pods got stuck on the only way to overcome this for now is to have minions in all zones, so one of them can accept your dynamic volume with the pod? It's still a problem on AWS... if I store a few TB of data on a volume, and trust it to be migrated during failover to another node in cluster (where failed pod will be re-created), I will be surprised to see it stuck, because K8s will try to launch a pod on any other node, with no regards to its AZ. But this issue is related more to cloud provider than K8s itself ... maybe on GCE it will not happen. |
btw, I also tried to set a default storage class, in the same availability zone where I have the nodes, with this settings:
But the 3 volumes got created in all 3 AZs, (us-west-2a, 2b, and 2c). Weird... I have only this storage class, why would it create the volumes in 3 AZs even when explicitly told to use 'zone: us-west-2a' in storage class... |
@jimmycuadra, you correctly found So, tag all your AWS instances that are part of your cluster with "KubernetesCluster=jimmy" (incl. masters!) and restart Kubernetes. It should create volumes only in zones where there is an instance with the tag. You can run multiple clusters under one AWS project, as long as they have different values of KubernetesCluster tag. @justinsb, btw, is it documented anywhere? |
@Dmitry1987, that's odd, your PVs should respect parameters of storage class. Show me your PVC. It should ignore any storage class only if you use annotation |
Hi @jsafrane , yes, it was alpha in that mongodb helm chart that I used, noticed it only after some time. thanks! |
it worked well when i forked it and changed to "beta", then used in helm as a file, not from repo link. i have this all working now :) . |
@Dmitry1987. thanks for confirmation. @jimmycuadra, did you try KubernetesCluster tag on your AWS instances? Can we close this issue? |
@jsafrane No, I have not confirmed that that works yet. In either case, this should be left open until the need for a KubernetesCluster tag is documented, as that is apparently critical for custom clusters. This was supposed to be tracked in #11884, but it was closed and no one has responded to my questions, which were not answered before the issue was closed. |
A tag with this specific key is expected by Kubernetes cloud provider logic and used to determine which cluster a given AWS resource belongs to, even if the clusters are isolated by VPC. See kubernetes/kubernetes#39178 and kubernetes/kubernetes#11884.
I'm afraid I can't confirm that adding the |
Would it be possible for the cloud provider to just use the Kubernetes labels on nodes to determine a zone, rather than using AWS API calls to try to determine which nodes should be used? It would need to look for any schedulable nodes (i.e. not |
@justinsb, this idea of using Kunernetes nodes + labels instead of loading instances from AWS looks interesting to me. Could it fit into your attempt to add a caching layer to AWS provider? |
Any updates from the storage and/or AWS teams on this? We're currently unable to use dynamic volume provisioning because of this problem. |
@jimmycuadra Are you adding the label to a node of an already started cluster? I'm not sure this will be effective. You might have to add the label before your kube-system namespace is started for the first time. Not sure if you're using kubeadm or something else to start your cluster, but for me, I definitely hit this issue immediately on the first try for my AWS project team that has instances in multiple AZ's. But if I add the KubernetesCluster label before bringing the cluster up with
in my logs of kube-controller-manager-ip-172-29-151-151.ec2.internal pod. Now my PVCs seem to be creating PVs dynamically without issue, in the right AZ every time. |
Something else changes after I bring up my nodes, it might be DNS propagation delay, When my EC2 host comes up and I've configured it using kubeadm, it's a node named This cluster is private-only and exists inside of a VPC, that's why I'm using private addresses... This DNS name does not seem to resolve at any time before or after I notice this change, I haven't been able to narrow down why this happens or what else is changing to make the node change its mind about what the actual hostname is, but after whatever it is happens, I have noticed surprises in some of the pod logs to the effect of "couldn't find the node you're talking about" with references to the new/old name. That might be resulting in permissions errors for the backend processes that are supposed to be creating and destroying ELBs, PVs, and SGs in my AWS team... since it looks up the FQDN version of the node's hostname, and can't find a node with that name. All of this confusion has basically convinced me that I don't want to use an alpha version of kubeadm for any serious production deployment, or anything else custom, and I'll be using kops or kube-aws to build my permanent AWS deployment instead (as the official documentation recommends.) |
We use an in-house cluster deploy tool called kaws. I made sure to restart the kubernetes system components (e.g. controller manager) after applying the EC2 tags, so they should be visible for the cloud provider logic. Our nodes have always been named with the EC2 internal DNS. No problems with the cloud provider logic there—it's just about dynamic volume placement. |
Under k8s 1.5.2, we have found that tagging EC2 instances with With instance tags correctly set, we see the expected behavior: PVCs using storage-provisioner aws-ebs are bound to PVs created in AZs containing cluster nodes. |
It looks like PVC logic will also create the volume in the wrong availability zone if you specify a nodeSelector on the pod to attach it to. There must be a better way to select the availability zone than querying the AWS API for the KubernetesCluster tag so that kubernetes pod placement logic is actually considered in the process. |
I think using the information the API server has about the nodes is a more resilient approach. See my previous comment: #39178 (comment) |
I am seeing the same underlying behavior (full unfiltered list of zones being passed to ChooseZoneForVolume, volume placed in non-k8s zone) with GKE on Google Cloud. Can we treat this issue as platform-agnostic, or should I create a separate issue? |
/sig aws |
@msau42 In my experience, it isn't simply that volumes are provisioned in availability zones where nodes do not exist, but also that pods are scheduled independently from PVs (on statefulset creation), then PVs are created without regard for scheduled pod locations. Later, if a pod is reaped, there's no guarantee that it will be rescheduled in an AZ that matches the existing PV. My best effort work around has been to create custom storage classes to pin the statefulset to a zone, which negates the benefits of a cluster that spans multiple AZs. |
@StephanX agree, but because the solutions for the two are completely different, I want to split them out into separate issues and track them separately. |
Hello all, |
I'm seeing this issue as well, but the only workaround when dealing with large clusters across all AZs is to use an EFS mount, which is less than ideal. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Just in case somebody wants to cross-check: After removing the |
Found this issue whilst researching the curious problem that I have a free node where my stateful set instance could be created, but that kubernetes keeps reporting that it can't schedule the instance as there are no nodes with sufficient memory available (there is one in zone-a) but that the PVC creates a Volume in zone-c. Resulting in the conclusion that there aren't any nodes available for this instance. |
@edwardsmit Topology aware volume provisioning in 1.12 should help with provisioning volumes in zones that can meet your Pod scheduling requirements. |
Thank you for the response @msau42, do you know of an issue I can follow? |
Feature issue is here: kubernetes/enhancements#490 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@msau42 this one looks resolved |
@msau42 Does the new feature handle the case where a node with an existing pod/volume goes down and the pod gets moved to a node in a different AZ? Specifically, does Kubernetes handle copying the existing volume to a new volume in the new AZ? If not, when the pod gets redeployed in a new AZ, it would no longer have access to any data from the old volume. I reviewed both the blog post and the official documentation, but I didn't see anything that addressed this specific case. Thanks! |
No, data migration is not part of this feature. The feature only handles initial provisioning of a volume. Once the volume is provisioned, it must always be scheduled to a node in the same zone. If you need to handle zone outages, you will need to use a storage system that does cross zone replication. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We have different node-types for different workloads. This is accomplished by using node taints. Only looking at zones for nodes that where the pod can actually be scheduled, when selecting the zone for the PV, would be great! |
This appears to me to still be an ongoing issue. |
still happens on EKS 1.21 |
still running into this issue on 1.21.5 |
So the issue has been solved as mentioned in #39178 (comment) You need to have a |
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): dynamic volume provisioning
Is this a BUG REPORT or FEATURE REQUEST? (choose one): bug report
Kubernetes version (use
kubectl version
):Environment:
uname -a
): Linux ip-10-0-1-121.ec2.internal 4.7.3-coreos-r3 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Wed Dec 7 09:29:55 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/LinuxWhat happened:
Created a stateful set with a persistent volume claim. Dynamic volume provisioning created an EBS volume in the us-east-1a availability zone, despite all the masters and nodes in the cluster being in us-east-1e. I tried it twice with the same results both times.
The PVC:
The PV created by dynamic provisioning:
The stateful set:
The stateful set's pod, pending due to the volume being in the wrong zone:
The nodes, all in us-east-1e:
What you expected to happen:
Dynamic volume provisioning should have created the required volume in the us-east-1e availability zone.
How to reproduce it (as minimally and precisely as possible):
Add the following storage class to the cluster:
Create the following stateful set and service:
The text was updated successfully, but these errors were encountered: