AKS Latency and performance/availability issues due to IO saturation and throttling under load #1373
Issue shortlink: https://aka.ms/aks/io-throttle-issue
AKS Engineering has identified an issue leading to customers reporting service, workload and networking instability when running under load or with large numbers of ephemeral, periodic events (jobs). These failures (covered below) are the result of Disk IO saturation and throttling at the file operation (IOPS) level.
Worker node VMs running customer workloads are regularly disk IO throttled/saturated on all VM operating system disks due to the underlying quota of the storage device potentially leading to cluster and workload failure.
This issue should be investigated (as documented below) if you are seeing worker node/workload or API server unavailability. This issue can lead to NodeNotReady and loss of cluster availability in extreme cases.
Customers may jump ahead to common errors and mitigation, we recommend not doing so as this issue is a common IaaS pitfall.
Most cloud providers, Azure included, use network-attached storage supplied by the storage service (Azure Blob) for the Operating system / local storage for a given Virtual Machine by default for many VM classes.
Physical storage devices have limitations in terms of bandwidth and total number of file operations (IOPS) but this is usually constrained by the physical device itself. Cloud provisioned block and file storage devices also have limitations due to architecture or service limits (quotas).
These service limits/quotas enforced by the storage service are layered - VMs have specific quotas, network cards, disks, etc. When these limits are exceeded the service itself (storage, compute, etc) pushes back (throttles) the offending entity.
Examining the customer reported failures, AKS engineering identified that customer workloads were exceeding quotas set by Azure Storage to the operating system disk of cluster worker nodes (Max IOPS). Due to the nature of the failure customers would not be aware to monitor for these metrics on worker nodes.
AKS has identified this issue as contributing significantly to the following common error / failure reports:
The root-cause of this issue is resource starvation / saturation on the worker nodes. The trigger of the failures is IO overload of the worker node OS disks. This leads to the OS disk (from the perspective of the kernel) becoming unresponsive, blocked in IOWait. As everything on Linux is a file (including network sockets) - CNI, Docker, and other services that network I/O will also fail as they are unable to read off of the disk.
For linux system engineers - if you strace() the processes you will see them locked in fsync, fwait, lock acquisition (pthread) and other operations.
The events that can trigger this throttle include high volumes of Docker containers running on the nodes (Docker IO is shared on the OS disk), custom or 3rd party tools (security, monitoring, logging) running on the OS disk, node failover events and periodic jobs. As load increases or the pods are scaled, this throttling occurs more frequently until all nodes go NotReady while the IO completes/backs off.
Key take aways:
When an AKS cluster is provisioned the worker nodes VMs are all assigned 100 GiB operating system disks. For this example, we will use a DS3_v2 SKU - per the documentation this SKU has the following limits:
The Max IOPS value shown here is the total for the VM shared across all storage devices. For example, this sku has a maximum of 12800 IOPS. Assuming I wanted four devices on my VM, each device would have a maximum IOPS of 3200 (12800 / 4) or a single device with a maximum IOPS value of 12800.
Azure, using network attached storage maps these to specific disk classes / types - AKS defaults to use Premium SSD storage. If we look at the storage quotas/limits we see:
AKS uses 100 GiB OS disks for worker nodes by default - in the chart above, this is a P10 tier disk with a Max IOPS value of 500.
This means, using our example DS3_v2 SKU (VM Max IOPS 12800) has an OS disk Max IOPS of 500 (P10 Class Max IOPS 500). You can not exceed these values without VM / Storage level hosts pushing back at the VM layer.
When quotas are stacked with VM, networking and storage, the lesser of the quotas applies first, meaning if your VM has a maximum IOPS of 12800, and the OS disk has 500, the maximum VM OS disk IOPS is 500, exceeding this will result in throttling of the VM and it's storage until the load backs off.
These periods of throttling of the VM due to the mismatched IaaS resources (VM/Disk vs workload) directly impacts the runtime stability and performance of your AKS clusters.
For more reading on Azure / VM and storage quotas, see "Azure VM storage performance and throttling demystified".
Note: These limits and quotas can not be expanded or extended for any specific storage device or VM.
AKS engineering is working on deploying fleet wide analysis and metrics collection for these failures and issues, including enhancements to Azure Container Insights and other tools and utilities for customers and other product changes.
While those are in progress the following instructions will help customers identify these failures and other performance bottlenecks in their clusters.
Metrics / Monitoring Data
The USE method metrics are a critical component as many customer applications run both a mixture of application services and back end services. The USE metrics specifically help you identify bottlenecks at the system level which will impact the runtime of both.
For more information on the metrics/signals see:
Additionally, for a good look at the USE metric(s) and performance testing AKS specifically, see: "Disk performance on Azure Kubernetes Service (AKS) - Part 1: Benchmarking".
Identification using the prometheus operator (recommended)
The prometheus operator project provides a best practice set of monitoring and metrics for Kubernetes that covers all of the metrics above and more.
We recommend the operator as it provides both a simple (helm) based installation as well as all of the prometheus monitoring, grafana charts, configuration and default metrics critical to understanding performance, latency and stability issues such as this.
Additionally the prometheus operator deployment is specifically designed to be highly available - this helps significantly in availability scenarios that could risk missing metrics due to container/cluster outages.
Customers are encouraged to examine and implement using their own metrics/monitoring pipeline copying the the USE (Utilization and Saturation) metrics/dashboard, as well as the pod-level and namespace node level utilization reports from the operator. Additionally the node reports clearly display OS disk saturation leading to high levels of system latency and degraded application/cluster performance.
Please Note: Installation of the Prometheus Operator requires that the authentication webhook is enabled on the worker nodes. Currently this is not enabled by default. We will be releasing a change to enable this in early 2020. Until then, customers can execute the following common on the AKS worker nodes (using SSH, vmss commands, etc) - this change will not persist through an upgrade or scale out.
Please test, verify and re-verify all commands in test systems/clusters to ensure they match your configuration
Using the Azure CLI (VMSS cluster):
After this is done, run
You can install the operator using helm: https://github.com/helm/charts/tree/master/stable/prometheus-operator
Warning: Deployments such as this with lots of containers (istio as well) may fail temporarily, please re-run the installation after deleting the namespace if this occurs. This happens due to existing IO saturation/throttling, large deployments / deployment load will also trigger this issue.
For more information, please see: https://github.com/helm/charts/tree/master/stable/prometheus-operator.
The issue investigation walkthrough below uses the prometheus-operator.
Identification using Azure Monitor
Customers can enable the following metrics at the AKS cluster nodepool and VM instance level. These do not show all of the metrics and data exposed by the prometheus operate and what is failing, but does indicate the problem is occurring and how severe the load is.
Below is an example view - all spikes in the OS disk queue depth were the result of IO throttling:
Customers can set an alert for the OS disks that:
For more on linux bock device settings and queues, see "Performance Tuning on Linux — Disk I/O" and "Improving Linux System Performance with I/O Scheduler Tuning.
What follows is an example root-cause investigation using the prometheus operator data.
For this test, the cluster is 5 nodes, with the prometheus operator installed, as well as istio (default configuration). Symptoms including API server disconnects, lost pods/containers and system latency were identified between 2019-12-16 22:00 UTC - 2019-12-17 01:00.
From the USE Cluster grafana chart:
We can see that load was pretty spiky, but there were periodic peaks well above the average. This is reflected in the Disk IO charts at the bottom - Disk IO Saturation specifically shows periodic heavy spikes - this is the saturation/throttle scenario.
Going to the first node in the USE Node chart:
Now we correlate that window of time in the Node chart (host metrics):
The impact is shown in the Kubelet chart:
Those spikes should not be occurring.
Examining the same cluster for a different window of time (all events follow the same pattern), but zooming in on the IO / operation view:
As you can see, a spike in the throttle/disk saturation causes significant latency to container_sync and other critical container lifecycle events. Pod sandboxes, liveliness probes, etc also begin to timeout and fail.
These failures ripple through the entire stack above - from the view of a running application this would look like widespread deployment/container timeouts, slow performance, etc.
Shifting to the pod utilization/load per node, there's an additional jump and churn in the utilization of the
The prometheus operator offers a consistent set of defaults and metrics that allows a deeper look into possible operational issues impacting worker nodes. We will continue to iterate on additional support and other prometheus-operator extensions (such as alertmanager configurations).
This error is misleading, it usually follows one of the above - this is Azure CNI/Calico/etc failure due to OS disk saturation/throttling
Isolating IO paths can be difficult in complex systems. OS disk load comes from the Kernel, the Linux user-space, system daemons, logging and more. The metrics details above will help with issue identification and the impact - but collecting this data requires that your cluster / application is under actual load.
Additionally, the container
Currently, there is no 100% mitigation for OS disk saturation. Each of these suggestions minimizes or removes a portion of the IO, customers must have monitoring as shown above in place.
Isolate the Docker IO path
In examining this issue AKS identified that docker container IO was a major contributor to the saturation of the OS disks. We are looking at options for customers and examining the default cluster configurations to help with this. These changes will begin to be rolled out and tested in 2020.
For customers having consistent failures/issues as outlined, our team has developed a small tool that when run, will move your worker nodes docker data-root directory to the VMs temporary storage disk.
Azure VMs all have temporary storage available - this storage is ephemeral and it's IO is isolated being a second attached device (it does have its own quotas). The size of this disk is dependent on the VM SKU used.
Moving the Docker data-root to this storage changes the available space for your docker images. For example, the default DS2v3 has 16 GiB of temporary (SSD) storage.
Customers can test this configuration for workload compatibility and stability by using the
Knode creates a privileged daemonset that moves dockers data-root to /temp - this process takes some time as the tool validates/waits between each stage. These changes will be lost if the cluster is upgraded and must be re-applied.
Would using ephemeral OS disks #1370 alleviate this?
when the current default nodepool is
if moving the docker-root with knode
(ephemeral IOPS numbers plucked from)
@djeeg not exactly. The iops are still capped - for example the same effect is met if you use a 2tb regular blob disk. All you do by moving it is expand your ceiling until the failure occurs.
This is why we are adding metrics/monitoring/alerting and we recommend customers do the same in addition to isolating data paths for things like Docker IO and container logging off of the single IO path, ephemeral or not.
The container/kube stack doesn’t have hard guarantees around these resources, so the failures when they occur look like something else entirely
AKS is moving to ephemeral OS disks, and as I noted making other changes to make this clear, obvious and consistent in behavior
Can you confirm that this statement is correct "Assuming I wanted four devices on my VM, each device would have a maximum IOPS of 3200 (12800*4) or a single device with a maximum IOPS value of 12800." If I had a DS3v2 with 3x P10 disks and 1x P40 disk, I would expect the P40 to perform up to its IOPS limit of 7500 IOPS.
@Ginginho there are multiple quotas in play here. If you have 4 drives it doesn’t matter what their individual quota is (you’re right if you slam the p40 disk it will max for that device).
If you saturate the p40 and not the other drives to the VM sku IOPS limit, you’ll hit the p40 limit first, the the VM iops limit. You can do both.
So yes, a p40 will max at 7500 iops max. So you have to remove 7500 iops from the total quota for all devices on the VM
I have a couple of clarifications I would like to check:
Could you clarify that? As I understand it, in Linux sockets are modelled like a file but do not actually operate as files. In other words, even if I create local sockets, the data is transferred across memory and does not involve physical disk I/O.
It's important to clarify that, because if we think socket operations involve disk I/O and you have a very socket-based service in a container (such as a high throughput Web server that doesn't reuse sockets well) then there would be a direct correlation between your throughput and your choice of disk IOPS. And I believe that shouldn't be the case, but am happy to be proven wrong.
Should that be 12800 / 4?
Thank you for such a detailed write-up, it is very helpful.
@asos-andrewpotts everything is modeled after a file - Eg sockets live on the fs and the data path is offloaded by the kernel for that device (not in all cases, samba/cifs kernel modules can be single threaded) from the primary data path but the socket.
If a service can not open, stat() or otherwise read it’s socket the system is still dead and throttled. This is why if IO doesn’t abate the hosts will kernel panic.
You won’t see the data path (actual bytes to the store) count against things but you will see the socket open/close/read/write all count against you and fail. This is a gotcha running Linux hot/under load
And that’s not a typo - I meant 4x devices, but ordering is hard
@jnoller - OK, I get it. So open(), stat() etc, will be counted against the Azure Storage Max IOPS, even though the socket send and receive don't persist to disk. And if the Max IOPS is exceeded then those operations will fail resulting in your container app being unable to make an outbound API call, or service a new incoming socket. Thank you!
@asos-andrewpotts Yes, but - This is true for all Linux disk devices regardless of cloud / vendor - open, close, stat, etc on the socket will trigger IO operations (IOPS).
Yes, in this example - it is one source of the total IO which is why the Docker container IO can 'starve' the socket files on the OS disk - once the IOPS peak is hit, all file calls to /dev/sda including writing the disk journal can time out. Your example of this causing the API call to fail is precisely correct, but networking itself isn't a large driver of IO
Possible IO sources on the root disk are the kernel, system logging, sockets on disk, docker / container / daemon IO paths, single threaded network attached disk or other kernel drivers, spikes in overall system load causing all of those (eg failover) to hit at once.
@edernucci So temp disk IOPS are special (and actually pretty great because ephemeral)
"it depends" - and its really not clear:
On most charts, you want to look at "Max cached and temp storage throughput: IOPS / MBps (cache size in GiB)"
Here's a screenie
https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#virtual-machine-disk-limits - good read
@edernucci But, this answers a secondary question: why not just use ephemeral OS disks? Ephemeral disks are limited to the max IOPS above and cache size. So, different limit, a lot higher - but its still going to be exhausted.
@jnoller My production cluster has 100GB SSD for OS (500 IOPS limited) on F8s (v1) machines. Looking on that table I got 32000/256 (96) on Max cached and temp storage throughput: IOPS / MBps.
If I correctly understood, I can really get a lot more IOPS (something near the 32000 versus my actual 500) just by moving my docker temp files to temp disk.
Is this statement right?
@infa-ddeore cache size in gigabytes of the virtual machine (each machine type has different cache sizes). IOPs is as the main article states - IO operations per second. IOPs are all file open, read, write, close and so on. Mbps is megabytes (or bits, I may be swapping them) per second- megabytes or megabits per second is a definition of bandwidth, so if a hard drive has a size of two terabytes, it will also have a maximum number of IOPs it can support as well as a max on how much data you can read or write per second.
So what that means is (peak file operations) / (peak storage bandwidth) with cache size in gigabytes being the cache size for the VM sku
It could allow a better overall experience for users without particular needs that require persistent node os storage, plus it will be a great benefit for SMEs/NPOs where os disk cost represents 5%-20% of the total and using aks-engine will add unnecessary administrative overhead over an already limited staff.
@gioamato ephemeral disks have similar limitations. Even using ephemeral for the os means you need to isolate the correct data paths.
I can show this failure using ephemeral and even physical hosts and disks. This a design gap in the kernel/cgroups/container stack - there are lots of options to mitigate and isolate the data paths to maintain QoS, but the fact is the entire stack up to kube will fail in surprising ways when IO limits and saturation occurs.
What’s worse is that IO saturation orphans cgroups and other containers, that leads to extreme memory and cpu pressure.
@jnoller allright, isolating io paths seems like the best option right now. And knode represents a big help.
Ephemeral OS disks (to me, maybe i'm wrong) seems like a "cost friendly" alternative to your mitigation tip n.1:
Surely it must be applied together with the other suggestions.
Plus i'm sure it represents an excellent opportunity for the use case/business i talked about in my previous comments. I know it's off-topic but what do you think about it?
@jnoller a slightly separate issue, but please can I emphasise that this really isn't an option for many of us and yet it seems to be the default stance for changes to clusters.
The complexity of doing on online migration, or the amount of downtime required for an offline migration, is simply unacceptable for those of us working in small businesses. Frankly, if we're making the effort to move between clusters we'd be looking at moving out of Azure to a different provider.
I appreciate it's difficult for Microsoft to develop migration steps, but I really feel this should be a requirement for all but the most fundamental changes to clusters (to use VMSS as an example, I struggle to see why it wouldn't be possible to add a VMSS to an existing cluster then remove the existing nodes).
I'm happy to open this as a separate issue if you'd rather track it elsewhere.