Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: custom VHD or a way to prepull docker images offline #1532

Open
Timvissers opened this issue Mar 30, 2020 · 28 comments
Open
Assignees
Labels
action-required feature-request Requested Features Scale and Performance Use this for any AKS scale or performance related issue

Comments

@Timvissers
Copy link

Request: a way to prepull custom docker images. Docker images that are already available in the worker node so that they do not have to be pulled after the node is being started due to a scale up event.

Context:
We have huge docker images (>10GB) which have already been optimized in size.
I tested pulling a docker image (from a premium ACR from a geolocation that was in place) triggered by a kubernetes job, it takes about 6 minutes.
We run kubernetes jobs for which it's crucial to start as soon as possible. We are dealing with +/- 2m scale up time for a new node, but we cannot deal with 6 extra minutes being lost on the docker pulling.

I saw the current VHD packer scripts, which are already prepulling docker images. This request is to bring this to the customer.

@zhiweiv
Copy link

zhiweiv commented Mar 31, 2020

We'd like this feature too, we have similar use case.

@0x53A
Copy link

0x53A commented Mar 31, 2020

This is especially important for Windows nodes

@jluk jluk added the feature-request Requested Features label Mar 31, 2020
@jluk
Copy link
Contributor

jluk commented Mar 31, 2020

@Timvissers thanks for opening this request, do I read the root problem is image pull time is too long for your scenario? If so could I retitle your request as "Reduce image pull-time on AKS scale"?

We have options to address that such as integrating project teleport.
https://azure.microsoft.com/en-us/resources/videos/azure-friday-how-to-expedite-container-startup-with-project-teleport-and-azure-container-registry/

@zhiweiv
Copy link

zhiweiv commented Apr 1, 2020

I think for now, especially for Windows containers, pre cache base images is the easiest and most stable way.

@Timvissers
Copy link
Author

@jluk Thanks for your comment. I will investigate the teleport, I was unaware of this.

I would suggest to not rename the request to 'reduce pull-time'. Maybe it should be called 'customisable worker nodes' or so?
Because other usages for custom worker nodes besides offline prepulling of docker images could be to install extra (prometheus eg) exporters or filebeat collectors for logging or other software that could be of use for teams on worker nodes.

I think AKS is running a bit behind in the topic of worker node customization compared to other big cloud providers' managed kubernetes solutions.
I do see in AKS engine that packer is already in use, so the effort would be to just bring this to the customer.

@jluk
Copy link
Contributor

jluk commented Apr 2, 2020

@Timvissers the teleport integration requires a dependency chain to be unblocked, but it is a path we're investigating to reduce image pull time issues.

To clarify my previous ask to rename - I would like to understand the specific needs of customizations needed which is causing the ask for a BYO image scenario. Often the items needing customization on the OS level have alternative solutions on the existing OS or we already plan to address the root problem (like slow image pull time via teleport).

As for custom OS for worker nodes that are managed by cloud providers, there are none to my knowledge which will give you actual support/management of customized nodes. AKS is quite clear in this by only offering a managed node which qualifies for true Azure support / on-call. Any full-customization needed can be done with AKS-Engine which does not provide support, but the full suite of customization you could hope for.

The support experience you will face is very wide if you try to get help on an unmanaged node "BYO image" from any provider. That being said if you are comfortable with no support on a BYO image and acknowledge you only get support on the control plane, are your requirements still met?

@zhiweiv
Copy link

zhiweiv commented Apr 3, 2020

Our requirement is relative simple, pre cache .net/asp.net base image to improve startup time of pods on scaled up Windows nodes, our workloads are all based on .net framework, it takes a long time to pull and extract the base image.

The best apporach: AKS provides additional Windows image skus with these base images out of box, We can choose the the SKU while creating Windows pool.

The second apporach: AKS provides the ability with BYO images, we build images based on offical AKS images. Only control plane is supported by Azure, we take care of worker nodes by ourself.

@Timvissers
Copy link
Author

Thank you for giving some extra insights to me. Also about AKS-Engine. But currently I don't think AKS-Engine is the best option to me for the following reasons:

  • we will be running quite some clusters in different regions, and if I understand well, this is not a managed solution. We currently lack knowledge of the control plane, so we are going for the managed control plane.
  • this does not benefit of the cost-free master plane
  • we have everything in terraform, not ARM.

So, yes, I'm ok with no support on the data plane, but I'm not yet at the point that I'm ok with no support at the master plane.
In this case, I would be taking a supported base image and just adding some docker pull statements in a packer file. So those changes are minor. We are already doing this for 1,5 year on another cloud provider. We are planning to migrate to Azure, hence this feature request.

I am open to alternative solutions, but it's just that for me there seems to be no easy one:

  • teleport is in preview
  • overprovisioning and using init-containers to pull images, to have hot standby nodes for when customer jobs come in. Unfortunately, we are supporting expensive GPU nodes, so this would be unnecessary expensive.
  • smaller docker images. But these are already heavily optimized, so I there is not much to gain here.
  • there is a possible way to start the nodes earlier and gain 1 minute. Context: customer is uploading data. Once it is uploaded, we could determine the type of node (group) needed and already start a node. This would mean that we win about 1 minute of the time the new node is booting and joining the cluster. But this effort is quite big for the possible gain

Other options:

  • not migrate to azure if we consider this to be a blocker
  • My previous test results were for a docker image of 13m5 GB, pulling on a node in westeurope, from a georeplicated location (premium ACR) that was fully synced. It took 6 minutes. This is a really long time. If it would be 1 minute, we probably wouldn't care and not have this request. Maybe I should open a support request on why this takes so long (though I know that it's not only about network traffic, it's also about decompressing)??

@jluk
Copy link
Contributor

jluk commented Apr 3, 2020

Thanks for all the feedback - @mikkelhegn as FYI on the Windows caching requests from @zhiweiv. @zhiweiv if you were provided a BYO image scenario, would zero support of the data plane also be acceptable?

@Timvissers I'm assuming you're running quite a large Linux image or is it Windows? A 13GB image taking ~5 minutes is about what I would expect, you are correct that wait is incurred by both pull time and decompression.

Thanks for confirming no support of the data plane is acceptable to you if you bring your own nodes, this is something we're open to discussing. Would you mind sharing other generic requirements you may have for customizing OS nodes, I read you mentioned additional OS logging/binaries?

@zhiweiv
Copy link

zhiweiv commented Apr 4, 2020

We are ok with zero support of data plane in BYO images scenario.

@Timvissers
Copy link
Author

@jluk We use Linux on Standard_F8s_v2, 100gb disk

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed if no further activity occurs. Thank you!

@github-actions github-actions bot added the stale Stale issue label Jul 20, 2020
@zhiweiv
Copy link

zhiweiv commented Jul 20, 2020

Any update?

@palma21 palma21 removed the stale Stale issue label Jul 20, 2020
@palma21
Copy link
Member

palma21 commented Aug 12, 2020

It seems this thread is leaning a bit towards BYO Image support which is not something we're planning on the foreseeable future right now.

I've created this specific issue specifically Teleport support which is being worked on. #1785

@ghost
Copy link

ghost commented Feb 13, 2021

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Feb 13, 2021
@ghost
Copy link

ghost commented Feb 28, 2021

Issue needing attention of @Azure/aks-leads

3 similar comments
@ghost
Copy link

ghost commented Mar 19, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Apr 3, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Apr 18, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented May 4, 2021

Issue needing attention of @Azure/aks-leads

7 similar comments
@ghost
Copy link

ghost commented May 19, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 3, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 18, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 4, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 19, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 3, 2021

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 18, 2021

Issue needing attention of @Azure/aks-leads

@EPinci
Copy link

EPinci commented Sep 1, 2021

Hey, as an update to this, there are two features coming up.
First one is "Scale down mode" #2061 that will allow you to "turn off" one nodes without decommissioning the image (and thus not loose the pulled images) and "Teleport" #1785 that caches image layers as already mentioned in the thread.
You can look at the mentioned issue for details.

Thank you.

@EPinci EPinci removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Sep 1, 2021
@Azure Azure deleted a comment from zhiweiv Feb 11, 2022
@pavneeta pavneeta self-assigned this Feb 16, 2022
@pavneeta pavneeta added the Scale and Performance Use this for any AKS scale or performance related issue label Feb 16, 2022
@ghost ghost added the action-required label Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action-required feature-request Requested Features Scale and Performance Use this for any AKS scale or performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants