Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Teleport on AKS #1785

Open
palma21 opened this issue Aug 12, 2020 · 42 comments
Open

Support for Teleport on AKS #1785

palma21 opened this issue Aug 12, 2020 · 42 comments
Assignees
Labels

Comments

@palma21
Copy link
Member

palma21 commented Aug 12, 2020

Support for container teleportation on AKS:
https://stevelasker.blog/2019/10/29/azure-container-registry-teleportation/

@zhiweiv
Copy link

zhiweiv commented Aug 12, 2020

@palma21
This issue is categorized to In Progress, does this mean MS is actively working on it? Do you have ETA of preview?

@palma21
Copy link
Member Author

palma21 commented Aug 12, 2020

Yes we are. No super concrete ETA yet, will be able to provide in a couple of weeks, it will definitely happen before the end of the year.

@jeanfrancoislarente
Copy link

Poke / prod @palma21

This one would be great in supporting our use case.

We have

  • over 12 clusters spread across various Azure regions
  • mix of Windows/Linux nodepools
  • nodepools set to scale up/down
  • deployments (install/remove) happen fairly frequently (dev, QA, demos, etc.)
  • deployments are 7 pods (4 linux and 3 Windows)

Image pull time is killing us on a scale up

Thanks in advance for the update!

@PixelRobots
Copy link
Collaborator

Any update on this?

@palma21
Copy link
Member Author

palma21 commented Oct 14, 2020

We're targeting private preview in November.

@bplasmeijer
Copy link

We're targeting private preview in November.

@jeanfrancoislarente, and I would really like to test the preview @palma21

@sanderaernouts
Copy link

sanderaernouts commented Oct 20, 2020

@palma21 we would definitely want to participate in the private preview provided it will include support for Windows containers as well. Our setup is quite similar to what the author of #1532 describes. We run Windows docker workloads on-demand on our AKS cluster and see that the pull time when updating images or scaling nodes is long (10-20 minutes depending on the exact image). Sounds like ACR Teleport is perfect for us 👍

@jeanfrancoislarente
Copy link

@palma21 - this is just my regular ~30 day ping. Have you guys been able to come up with an estimated timeline for the preview?

Thanks!

@bplasmeijer
Copy link

@palma21 scale down and up windows nodes, and then pull a windowservercore image can take ~8-10 minutes or longer.

@bplasmeijer
Copy link

any update @palma21 on the preview release?

@andrewsali
Copy link

Is it correct that Teleport will be based on Azure Fileshare premium as listed in the first comment on this issue?

If so, what kind of performance can be expected?

According to many reports (#223 (comment)), Azure Fileshare is not performant enough when it comes to small file / metadata intensive operations, which might be a limitation to mounting large container images.

Some guidance on the expected performance would be useful to know if it's worthwhile to plan to use this feature once available in preview.

@JJ11teen
Copy link

Any eta on when this will reach public preview?

@kishorerv25
Copy link

Any roadmap on the rollout , we have lot of windows containers running with image size of 10+GB. And we are using autoscale option , but this image download is killing lot of time. Its becoming one of main issue.

@miwithro
Copy link
Contributor

Public Preview is planned for Sept. 2021.

@PixelRobots
Copy link
Collaborator

@miwithro Is that including windows image support or just Linux?

@miwithro
Copy link
Contributor

@PixelRobots it is for both.

@PixelRobots
Copy link
Collaborator

@miwithro We are nearly halfway through September. Any update on when this will be released to public preview?

@miwithro
Copy link
Contributor

This has been pushed to October.

@PixelRobots
Copy link
Collaborator

Sad times. Can you share why?

@miwithro
Copy link
Contributor

staffing.

@bplasmeijer
Copy link

This has been pushed to October.

Let make it happen.

@InDieTasten
Copy link

I can confirm that auto-scaling with large images (18Gi) is borderline impossible, as scaling up takes up to 30 minutes per node. During that period, the previous resources are completely overloaded, and errors occur due to CPU maxing out.

@miwithro Any updates on the staffing end?

@damienpontifex
Copy link

Seems GCP are trying to solve the same problem https://cloud.google.com/blog/products/containers-kubernetes/introducing-container-image-streaming-in-gke
Apologies for tangent on topic, but thought it good to be knowledgeable across different solutions to this problem

@PixelRobots
Copy link
Collaborator

Hey @miwithro any news on this? I could really do with it for some of my customers.

@PixelRobots
Copy link
Collaborator

Any update on this? I could really do with it. Having to pre pull over 200 images at the moment and not having fun.

@george-zubrienko
Copy link

Awesome feature, totally +1. Re staffing, only thing I could say, I can totally live without extended windows container support, application ingress gateway, AAD integration. But features like this, that's liquid gold, literally, considering we can't really utilize a new node that is busy pulling images, but we still pay for it from first second of its availability :)

@cailyoung
Copy link

We're keen for this. Scale up from zero with 'large' (10+Gb) containers running workloads. Anything to speed it up!

@johannordincab
Copy link

@miwithro october is now 5 months past, do you have an updated eta?

@mschumacher-syntellis
Copy link

Any updates?

@ocdi
Copy link

ocdi commented Apr 12, 2022

I've been trying to work out a way to speed up scaling of windows node. A fresh node coming online takes 20 minutes, which as other people have said is while the servers are busy/overloaded, not great. Means I need to aggressively scale up in anticipation that load may increase further.

Really would like to see this being a possibility as this seems like a great solution.

@bplasmeijer
Copy link

hi @palma21

Please prioritize this work item.

Windows containers get smaller every release, but ACR pulling needs improvements on Windows Containers AKS.

cc: @gkaleta @weijuans-msft @richlander @brasmith-ms @brendandburns

@guidemetothemoon
Copy link

I got a chance to reach out to the ACR and AKS team directly regarding this and their answer is unfortunately that there is no concrete ETA for this and it's not clear when a new ETA will be available 😟
"The item is still accurate, work in progress not further details at the moment. Once we have a date/more details [we] will add it [to the GitHub issue]."

@InDieTasten
Copy link

For anyone who's interested in working around this issue, Amazons EKS (Elastic Kubernetes Service) does provide a way to inject pre-pulled/extracted images into node images. So when you scale up your nodes these new nodes can have a number of pulled images already present on the machine. Source: https://aws.amazon.com/blogs/containers/speeding-up-windows-container-launch-times-with-ec2-image-builder-and-image-cache-strategy/

I would really like a similar option in AKS as well, where you could just specify as part of the node-pool, that certain images/layers should already be present on machines of these pools.

@ender1598
Copy link

Happy 2 year anniversary of the issue! Hopefully this third year of progress is the most productive. 😎

@efzn
Copy link

efzn commented Aug 12, 2022

Happy 2 year anniversary of the issue! Hopefully this third year of progress is the most productive. 😎

😂

waiting

@sumitkute
Copy link

One of my customer is facing the same issues for a 15GB image, they want to cache it or store it in the image so a new VM comes up in VMSS have the docker image already cached. AWS has it.. we need an alternative in AKS

@jrauschenbusch
Copy link

While this is far from an ideal solution, in the meantime someone could use the following approach.

But AKS support for Teleportation would definitely be the better option.

@InDieTasten
Copy link

While this is far from an ideal solution, in the meantime someone could use the following approach.

@jrauschenbusch That works only for cases, where nodes exist long before any pods need to be scheduled there, which isn't the case for most of us. We want to have the teleportation feature, because we are scaling out new nodes and want to deploy large image pods immediately. I don't want to pay nodes to sit around and wait for a load to increase. If I have the node lying around, I could just as easily scale out my replicaset or statefulset directly and force the node to pull the image that way.

If we have multiple deployments utilizing these nodes I see how this could help. If deployment A needs a lot of resources it can schedule on some nodes. The nodes also pull images for deployment B. If load shifts from A to B, then the B deployment can scale quickly, but only if A also decreases. Seems like an uncommon use case :/

@InDieTasten
Copy link

As of today, missing support for teleportation undermines the entire auto-scaling nodes feature for windows nodes, which is arguably one of the most important features for a cloud-based managed k8s cluster that's supposed to support Windows.

@jrauschenbusch
Copy link

jrauschenbusch commented Nov 29, 2022

@InDieTasten This is of course true to a certain degree. But it minimizes the pain a bit, when you have a lot of daemon set pods running on the nodes which are a pre-requisite for your workloads. Then, the pre-pull daemon set pod can already start pulling the image before your workload will be scheduled on the node. I already mentioned that this is by far not the ideal solution, but maybe helps one or the other as long as the teleport support is not ready-to-use.

@jrauschenbusch
Copy link

@palma21 Will there be any progress on this topic or is it cancelled?

@vaibhav-dhawan
Copy link

For anyone else who had a hard time finding this: https://github.com/Azure/acr/blob/main/docs/teleport/aks-getting-started.md , this has a guide for requesting access to the private preview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests