Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS should cache container images that are repeatedly requested #2594

Closed
eugen-nw opened this issue Oct 12, 2021 · 13 comments
Closed

AKS should cache container images that are repeatedly requested #2594

eugen-nw opened this issue Oct 12, 2021 · 13 comments
Assignees
Labels
Feedback General feedback Scale and Performance Use this for any AKS scale or performance related issue stale Stale issue

Comments

@eugen-nw
Copy link

Downloading each image time and again from ACR is a waste of cycles. Keep the cached images for only 1 day so the ACR secret’s expiration can kick in and disable the ACR downloads. In our solution we do a lot of on-demand scale-out and the suggested improvement would help the scaled-out containers to start faster.

@ghost ghost added the triage label Oct 12, 2021
@ghost
Copy link

ghost commented Oct 12, 2021

Hi eugen-nw, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@hieumoscow
Copy link

Hi @eugen-nw, can you describe your use case further? Currently the container image does get cached on the node after the first pull until that node is restarted by an upgrade.
From your message it looks like you are referring to scale out. What is the size of your image? For new nodes that never pulled the image, initial pull so still required, from then on then it is cached on that specific node.
There is an incoming Teleport feature from ACR that will enable AKS to pull much faster #1785

@ghost ghost removed the triage label Oct 14, 2021
@eugen-nw
Copy link
Author

Thanks very much for having looked into this issue! It is fantastic to learn that AKS already caches the container images. As I suggested earlier, the caching timeout should better be only 24 hours, so the ACR secret's timeout can kick in and disable those downloads.

Our scenario is a bit different than the norm. We run only the Virtual Kubelet on AKS and that creates our Windows containers in ACI. Recently we've had a problem with containers running for weeks in a "Failed" state in ACI because the pull operation from ACR could not complete within N minutes. We use K8S' HPA to scale out the count of containers based on the count of messages received in Azure Message Queues. And there's plenty of scaling out that we do, several times a day, up to having 50 containers running. We do have 7 containers running all the time in order to quickly respond to demands.

@marwanad
Copy link

Another slightly different alternative for you to consider could be the deallocate scale down with cluster autoscaler. Cluster autoscaler will respond to the pending pods pressure from HPA and start/deallocate VMs as necessary. Deallocated VMs will have your images preservered.

https://azure.microsoft.com/en-us/updates/public-preview-scaledown-mode-in-aks/

@ghost ghost removed the action-required label Oct 14, 2021
@eugen-nw
Copy link
Author

eugen-nw commented Oct 14, 2021 via email

@eugen-nw
Copy link
Author

This will not work for us because we are not using AKS nodes to run the containers on, but use the Virtual Kubelet to run the containers in the ACI instance.

@alexeldeib
Copy link
Contributor

@miwithro @justindavies I don't think we can fix this on AKS side, perhaps one of you could relay this ask to ACI/Virtual Kubelet folks?

@eugen-nw
Copy link
Author

eugen-nw commented Dec 11, 2021

I kind of think that the caching should happen on the ACI side. It also requires a bit of communication with the Virtual Kubelet on this matter.

@huzefaqubbawala
Copy link

I kind of think that the caching should happen on the ACI side. It also requires a bit of communication with the Virtual Kubelet on this matter.

Hi,

The image caching is not working because Virtual Kubelet uses ACI to run containers and ACI does not have any caching, it will pull containers each time from ACR making the whole process slow.

We exactly faced the same issue while using ACI (Not in virtual kubelet). We are thinking to move to AKS + KEDA now to achieve faster start time. Project teleport is also coming to AKS first and no plans for ACI as of today.

I am not sure why you are using virtual kubelet if you want to run 7 containers always. Better approach will be to have few nodes in AKS running and scaleout on demand.

@eugen-nw
Copy link
Author

eugen-nw commented Jan 24, 2022

@huzefaqubbawala Thanks very much for having looked into this! As I wrote above, we are scaling out from 7 containers running constantly to 50 containers at a time.

Our reasons for running all the containers in ACI are:

  1. I do not know yet how to run the 7 permanent containers in AKS and only the scale-out containers in ACI. At any rate, that setup will still not solve the problem that I raised: AKS/ACI's downloading of all scale-out containers is rarely necessary.

  2. We are not willing to pay for AKS VMs that are capable of accommodating our large scale-out demands. If we'd want to pay for that, we'd run 50 containers all the time on the AKS VMs and not even bother with the scale-out complications.

I did not use KEDA yet. I know as much that it creates a new container in response to each incoming request. If you host the KEDA-generated containers on AKS VM nodes, I wonder how' it'd look the scenario where the count of incoming calls is so high that AKS is out of resources and can no longer generate containers for KEDA? Do those calls get queued up or are they lost?

@huzefaqubbawala
Copy link

@huzefaqubbawala Thanks very much for having looked into this! As I wrote above, we are scaling out from 7 containers running constantly to 50 containers at a time.

Our reasons for running all the containers in ACI are:

  1. I do not know yet how to run the 7 permanent containers in AKS and only the scale-out containers in ACI. At any rate, that setup will still not solve the problem that I raised: AKS/ACI's downloading of all scale-out containers is rarely necessary.
  2. We are not willing to pay for AKS VMs that are capable of accommodating our large scale-out demands. If we'd want to pay for that, we'd run 50 containers all the time on the AKS VMs and not even bother with the scale-out complications.

I did not use KEDA yet. I know as much that it creates a new container in response to each incoming request. If you host the KEDA-generated containers on AKS VM nodes, I wonder how' it'd look the scenario where the count of incoming calls is so high that AKS is out of resources and can no longer generate containers for KEDA? Do those calls get queued up or are they lost?

First,

For scaled out scenarios caching is not possible even in AKS since new VMs will be spinned up dynamically and first time pull is required from ACR. Project teleport will eventually solve this problem to reduce time to pull from registry in AKS.

Second,

Why you need your scaled out containers to run on ACI ? you can use only AKS to run your scaled out containers using cluster autoscaler and it will also scale down when there are no messages (You will only pay for your usage).

Assuming you need to run your AKS with 7 containers always and rest on load. You can configure your AKS like below -

Nodepool - Cluster Autoscaler -> Min node 2 - max node 10 (Depends on what your containers need)
KEDA configuration -> Spin one containers for each message using KEDA ScaledObject.

In the above case, where there are 1000s of messages in azure service bus, it will scale up to your cluster autoscaler which is 10 nodes.

Even with ACI, you do have some Quota which needs to be respected and you cannot create more containers plus you would need external orchestrator to create containers in ACI.

@pavneeta pavneeta self-assigned this Feb 16, 2022
@pavneeta pavneeta added the Scale and Performance Use this for any AKS scale or performance related issue label Feb 16, 2022
@ghost ghost added the stale Stale issue label Apr 17, 2022
@ghost
Copy link

ghost commented Apr 17, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Apr 24, 2022
@ghost
Copy link

ghost commented Apr 24, 2022

This issue will now be closed because it hasn't had any activity for 7 days after stale. eugen-nw feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators May 24, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Feedback General feedback Scale and Performance Use this for any AKS scale or performance related issue stale Stale issue
Projects
None yet
Development

No branches or pull requests

7 participants