AKS should cache container images that are repeatedly requested #2594

eugen-nw · 2021-10-12T16:20:44Z

Downloading each image time and again from ACR is a waste of cycles. Keep the cached images for only 1 day so the ACR secret’s expiration can kick in and disable the ACR downloads. In our solution we do a lot of on-demand scale-out and the suggested improvement would help the scaled-out containers to start faster.

ghost · 2021-10-12T16:20:47Z

Hi eugen-nw, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

hieumoscow · 2021-10-14T00:43:20Z

Hi @eugen-nw, can you describe your use case further? Currently the container image does get cached on the node after the first pull until that node is restarted by an upgrade.
From your message it looks like you are referring to scale out. What is the size of your image? For new nodes that never pulled the image, initial pull so still required, from then on then it is cached on that specific node.
There is an incoming Teleport feature from ACR that will enable AKS to pull much faster #1785

eugen-nw · 2021-10-14T15:46:35Z

Thanks very much for having looked into this issue! It is fantastic to learn that AKS already caches the container images. As I suggested earlier, the caching timeout should better be only 24 hours, so the ACR secret's timeout can kick in and disable those downloads.

Our scenario is a bit different than the norm. We run only the Virtual Kubelet on AKS and that creates our Windows containers in ACI. Recently we've had a problem with containers running for weeks in a "Failed" state in ACI because the pull operation from ACR could not complete within N minutes. We use K8S' HPA to scale out the count of containers based on the count of messages received in Azure Message Queues. And there's plenty of scaling out that we do, several times a day, up to having 50 containers running. We do have 7 containers running all the time in order to quickly respond to demands.

marwanad · 2021-10-14T16:26:07Z

Another slightly different alternative for you to consider could be the deallocate scale down with cluster autoscaler. Cluster autoscaler will respond to the pending pods pressure from HPA and start/deallocate VMs as necessary. Deallocated VMs will have your images preservered.

https://azure.microsoft.com/en-us/updates/public-preview-scaledown-mode-in-aks/

eugen-nw · 2021-10-14T19:49:04Z

That is fantastic, thanks very much for the info! But does it work with ACI hosted containers? Thank you, Eugen Diese Nachricht wurde von meinem iPhone gesendet. Am 10/14/21 um 9:26 AM schrieb Marwan Ahmed ***@***.***>: Another slightly different alternative for you to consider could be the deallocate scale down with cluster autoscaler. Cluster autoscaler will respond to the pending pods pressure from HPA and start/deallocate VMs as necessary. Deallocated VMs will have your images preservered. https://azure.microsoft.com/en-us/updates/public-preview-scaledown-mode-in-aks/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2594 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADT64RV725HR6M7EJTE6ZGTUG4ACVANCNFSM5F26ZDSQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

eugen-nw · 2021-10-15T15:25:14Z

This will not work for us because we are not using AKS nodes to run the containers on, but use the Virtual Kubelet to run the containers in the ACI instance.

alexeldeib · 2021-12-10T23:41:21Z

@miwithro @justindavies I don't think we can fix this on AKS side, perhaps one of you could relay this ask to ACI/Virtual Kubelet folks?

eugen-nw · 2021-12-11T00:29:17Z

I kind of think that the caching should happen on the ACI side. It also requires a bit of communication with the Virtual Kubelet on this matter.

huzefaqubbawala · 2022-01-23T12:18:42Z

I kind of think that the caching should happen on the ACI side. It also requires a bit of communication with the Virtual Kubelet on this matter.

Hi,

The image caching is not working because Virtual Kubelet uses ACI to run containers and ACI does not have any caching, it will pull containers each time from ACR making the whole process slow.

We exactly faced the same issue while using ACI (Not in virtual kubelet). We are thinking to move to AKS + KEDA now to achieve faster start time. Project teleport is also coming to AKS first and no plans for ACI as of today.

I am not sure why you are using virtual kubelet if you want to run 7 containers always. Better approach will be to have few nodes in AKS running and scaleout on demand.

eugen-nw · 2022-01-24T05:58:10Z

@huzefaqubbawala Thanks very much for having looked into this! As I wrote above, we are scaling out from 7 containers running constantly to 50 containers at a time.

Our reasons for running all the containers in ACI are:

I do not know yet how to run the 7 permanent containers in AKS and only the scale-out containers in ACI. At any rate, that setup will still not solve the problem that I raised: AKS/ACI's downloading of all scale-out containers is rarely necessary.
We are not willing to pay for AKS VMs that are capable of accommodating our large scale-out demands. If we'd want to pay for that, we'd run 50 containers all the time on the AKS VMs and not even bother with the scale-out complications.

I did not use KEDA yet. I know as much that it creates a new container in response to each incoming request. If you host the KEDA-generated containers on AKS VM nodes, I wonder how' it'd look the scenario where the count of incoming calls is so high that AKS is out of resources and can no longer generate containers for KEDA? Do those calls get queued up or are they lost?

huzefaqubbawala · 2022-01-24T06:13:56Z

@huzefaqubbawala Thanks very much for having looked into this! As I wrote above, we are scaling out from 7 containers running constantly to 50 containers at a time.

Our reasons for running all the containers in ACI are:

I do not know yet how to run the 7 permanent containers in AKS and only the scale-out containers in ACI. At any rate, that setup will still not solve the problem that I raised: AKS/ACI's downloading of all scale-out containers is rarely necessary.

We are not willing to pay for AKS VMs that are capable of accommodating our large scale-out demands. If we'd want to pay for that, we'd run 50 containers all the time on the AKS VMs and not even bother with the scale-out complications.

I did not use KEDA yet. I know as much that it creates a new container in response to each incoming request. If you host the KEDA-generated containers on AKS VM nodes, I wonder how' it'd look the scenario where the count of incoming calls is so high that AKS is out of resources and can no longer generate containers for KEDA? Do those calls get queued up or are they lost?

First,

For scaled out scenarios caching is not possible even in AKS since new VMs will be spinned up dynamically and first time pull is required from ACR. Project teleport will eventually solve this problem to reduce time to pull from registry in AKS.

Second,

Why you need your scaled out containers to run on ACI ? you can use only AKS to run your scaled out containers using cluster autoscaler and it will also scale down when there are no messages (You will only pay for your usage).

Assuming you need to run your AKS with 7 containers always and rest on load. You can configure your AKS like below -

Nodepool - Cluster Autoscaler -> Min node 2 - max node 10 (Depends on what your containers need)
KEDA configuration -> Spin one containers for each message using KEDA ScaledObject.

In the above case, where there are 1000s of messages in azure service bus, it will scale up to your cluster autoscaler which is 10 nodes.

Even with ACI, you do have some Quota which needs to be respected and you cannot create more containers plus you would need external orchestrator to create containers in ACI.

ghost · 2022-04-17T08:00:43Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ghost · 2022-04-24T13:00:45Z

This issue will now be closed because it hasn't had any activity for 7 days after stale. eugen-nw feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

ghost added the triage label Oct 12, 2021

hieumoscow added Feedback General feedback Needs Author Feedback labels Oct 14, 2021

ghost removed the triage label Oct 14, 2021

ghost added action-required and removed Needs Author Feedback labels Oct 14, 2021

ghost removed the action-required label Oct 14, 2021

seguler assigned stl327 Feb 8, 2022

pavneeta self-assigned this Feb 16, 2022

pavneeta added the Scale and Performance Use this for any AKS scale or performance related issue label Feb 16, 2022

ghost added the stale Stale issue label Apr 17, 2022

ghost closed this as completed Apr 24, 2022

ghost locked as resolved and limited conversation to collaborators May 24, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS should cache container images that are repeatedly requested #2594

AKS should cache container images that are repeatedly requested #2594

eugen-nw commented Oct 12, 2021

ghost commented Oct 12, 2021

hieumoscow commented Oct 14, 2021

eugen-nw commented Oct 14, 2021

marwanad commented Oct 14, 2021

eugen-nw commented Oct 14, 2021 via email

eugen-nw commented Oct 15, 2021

alexeldeib commented Dec 10, 2021

eugen-nw commented Dec 11, 2021 •

edited

Loading

huzefaqubbawala commented Jan 23, 2022

eugen-nw commented Jan 24, 2022 •

edited

Loading

huzefaqubbawala commented Jan 24, 2022

ghost commented Apr 17, 2022

ghost commented Apr 24, 2022

AKS should cache container images that are repeatedly requested #2594

AKS should cache container images that are repeatedly requested #2594

Comments

eugen-nw commented Oct 12, 2021

ghost commented Oct 12, 2021

hieumoscow commented Oct 14, 2021

eugen-nw commented Oct 14, 2021

marwanad commented Oct 14, 2021

eugen-nw commented Oct 14, 2021 via email

eugen-nw commented Oct 15, 2021

alexeldeib commented Dec 10, 2021

eugen-nw commented Dec 11, 2021 • edited Loading

huzefaqubbawala commented Jan 23, 2022

eugen-nw commented Jan 24, 2022 • edited Loading

huzefaqubbawala commented Jan 24, 2022

ghost commented Apr 17, 2022

ghost commented Apr 24, 2022

eugen-nw commented Dec 11, 2021 •

edited

Loading

eugen-nw commented Jan 24, 2022 •

edited

Loading