Skip to content

AKS Latency and performance/availability issues due to IO saturation and throttling under load #1373

Closed
@jnoller

Description

@jnoller

edit: Please contact Azure support to assess if you are impacted.

Issue summary

AKS Engineering has identified an issue leading to customers reporting service, workload and networking instability when running under load or with large numbers of ephemeral, periodic events (jobs). These failures (covered below) are the result of Disk IO saturation and throttling at the file operation (IOPS) level.

Worker node VMs running customer workloads are regularly disk IO throttled/saturated on all VM operating system disks due to the underlying quota of the storage device potentially leading to cluster and workload failure.

This issue should be investigated (as documented below) if you are seeing worker node/workload or API server unavailability. This issue can lead to NodeNotReady and loss of cluster availability in extreme cases.

Contents:

Customers may jump ahead to common errors and mitigation, we recommend not doing so as this issue is a common IaaS pitfall.

Issue Description

Most cloud providers, Azure included, use storage supplied by Azure Disks for the Operating system / local storage for a given Virtual Machine by default for many VM classes.

Physical storage devices have limitations in terms of bandwidth and total number of file operations (IOPS) but this is usually constrained by the physical device itself. Cloud provisioned block and file storage devices also have limitations due to architecture or service limits (quotas).

These service limits/quotas enforced are layered – including VM, network cards, disks specific quotas, etc. When these limits are exceeded the service itself (storage, compute, etc) pushes back (throttles) the offending entity.’

Examining the customer reported failures, AKS engineering identified that customer workloads were exceeding quotas set by Azure Storage to the operating system disk of cluster worker nodes (Max IOPS). Due to the nature of the failure customers would not be aware to monitor for these metrics on worker nodes.

AKS has identified this issue as contributing significantly to the following common error / failure reports:

removed

The root-cause of this issue is resource starvation / saturation on the worker nodes. The trigger of the failures is IO overload of the worker node OS disks. This leads to the OS disk (from the perspective of the kernel) becoming unresponsive, blocked in IOWait. As everything on Linux is a file (including network sockets) - CNI, Docker, and other services that network I/O will also fail as they are unable to read off of the disk.

For linux system engineers - if you strace() the processes you will see them locked in fsync, fwait, lock acquisition (pthread) and other operations.

The events that can trigger this throttle include high volumes of Docker containers running on the nodes (Docker IO is shared on the OS disk), custom or 3rd party tools (security, monitoring, logging) running on the OS disk, node failover events and periodic jobs. As load increases or the pods are scaled, this throttling occurs more frequently until all nodes go NotReady while the IO completes/backs off.

Key take aways:

  • Any OS disk queue depth spikes or sustained load >= 8 can result in component / workload failure
  • Disk queue length is an indicator that the OS disk for you worker nodes is throttled
  • These events will cause slow container start time, NodeNotReady, networking and DNS failures and latency.

Quotas leading to failure

When an AKS cluster is provisioned the worker nodes VMs are all assigned 100 GiB operating system disks. For this example, we will use a DS3_v2 SKU - per the documentation this SKU has the following limits:

NodeQShort

The Max IOPS value shown here is the total for the VM shared across all storage devices. For example, this sku has a maximum of 12800 IOPS. Assuming I wanted four devices on my VM, each device would have a maximum IOPS of 3200 (12800 / 4) or a single device with a maximum IOPS value of 12800.

Azure, using network attached storage maps these to specific disk classes / types - AKS defaults to use Premium SSD disks. Additionally, these disks are mounted to the Linux host as Samba/CIFS volumes. If we look at the storage quotas/limits we see:

Disk Sizes Trimmed

AKS uses 100 GiB OS disks for worker nodes by default - in the chart above, this is a P10 tier disk with a Max IOPS value of 500.

This means, using our example DS3_v2 SKU (VM Max IOPS 12800) has an OS disk Max IOPS of 500 (P10 Class Max IOPS 500). Your application - or rather, the software running within the Virtual Machine can not exceed these values - doing so will result in host throttling.

When quotas are stacked with VM, networking and storage, the lesser of the quotas applies first, meaning if your VM has a maximum IOPS of 12800, and the OS disk has 500, the maximum VM OS disk IOPS is 500, exceeding this will result in throttling of the VM and it's storage until the load backs off.

These periods of throttling of the VM due to the mismatched IaaS resources (VM/Disk vs workload) directly impacts the runtime stability and performance of your AKS clusters.

For more reading on Azure / VM and storage quotas, see "Azure VM storage performance and throttling demystified".

Note: These limits and quotas can not be expanded or extended for any specific storage device or VM.

Issue Identification

AKS engineering is working on deploying fleet wide analysis and metrics collection for these failures and issues, including enhancements to Azure Container Insights and other tools and utilities for customers and other product changes.

While those are in progress the following instructions will help customers identify these failures and other performance bottlenecks in their clusters.

Metrics / Monitoring Data

AKS recommends customers add monitoring / telemetry for the Utilization Saturation and Errors (USE) method metrics in addition to the recommended SRE "Golden Signals" for all workers/nodes/etc.

The USE method metrics are a critical component as many customer applications run both a mixture of application services and back end services. The USE metrics specifically help you identify bottlenecks at the system level which will impact the runtime of both.

For more information on the metrics/signals see:

Additionally, for a good look at the USE metric(s) and performance testing AKS specifically, see: "Disk performance on Azure Kubernetes Service (AKS) - Part 1: Benchmarking".

Identification using the prometheus operator (recommended)

The prometheus operator project provides a best practice set of monitoring and metrics for Kubernetes that covers all of the metrics above and more.

We recommend the operator as it provides both a simple (helm) based installation as well as all of the prometheus monitoring, grafana charts, configuration and default metrics critical to understanding performance, latency and stability issues such as this.

Additionally the prometheus operator deployment is specifically designed to be highly available - this helps significantly in availability scenarios that could risk missing metrics due to container/cluster outages.

Customers are encouraged to examine and implement using their own metrics/monitoring pipeline copying the the USE (Utilization and Saturation) metrics/dashboard, as well as the pod-level and namespace node level utilization reports from the operator. Additionally the node reports clearly display OS disk saturation leading to high levels of system latency and degraded application/cluster performance.

Installation

Please Note: Installation of the Prometheus Operator requires that the authentication webhook is enabled on the worker nodes. Currently this is not enabled by default. We will be releasing a change to enable this in early 2020. Until then, customers can execute the following common on the AKS worker nodes (using SSH, vmss commands, etc) - this change will not persist through an upgrade or scale out.

Please test, verify and re-verify all commands in test systems/clusters to ensure they match your configuration

sed -i 's/--authorization-mode=Webhook/--authorization-mode=Webhook --authentication-token-webhook=true/g' /etc/default/kubelet

Using the Azure CLI (VMSS cluster):

command="sed -i 's/--authorization-mode=Webhook/--authorization-mode=Webhook --authentication-token-webhook=true/g' /etc/default/kubelet"
az vmss run-command invoke -g "${cluster_resource_group}" \
  -n "${cluster_scaleset}" \
  --instance "${vmss_instance_id}" \
  --command-id RunShellScript -o json --scripts "${command}" | jq -r '.value[].message'

After this is done, run systemctl restart kubelet on the worker nodes.

You can install the operator using helm: https://github.com/helm/charts/tree/master/stable/prometheus-operator

kubectl create namespace monitoring
helm install monitoring stable/prometheus-operator --namespace=monitoring

Warning: Deployments such as this with lots of containers (istio as well) may fail temporarily, please re-run the installation after deleting the namespace if this occurs. This happens due to existing IO saturation/throttling, large deployments / deployment load will also trigger this issue.

For more information, please see: https://github.com/helm/charts/tree/master/stable/prometheus-operator.

The issue investigation walkthrough below uses the prometheus-operator.

Identification using Azure Monitor

Customers can enable the following metrics at the AKS cluster nodepool and VM instance level. These do not show all of the metrics and data exposed by the prometheus operate and what is failing, but does indicate the problem is occurring and how severe the load is.

  • OS Disk Queue Depth (Preview)
  • OS Disk Read Operations/sec (Preview)
  • OS Disk Write Operations/sec (Preview)

UPDATE 02/06/2020: Linux OS disk queue depth as represented in the portal - and in general - is not a valid indicator of this issue. We have invalidated this metric and are working deploying eBPF tracing and other metrics to assist customers with identification. For now we recommend using the prometheus-operator USE Method (Node) and Container operation latency reports:

ZNodeUse

Below is an example view - all spikes in the OS disk queue depth were the result of IO throttling:

Screen Shot 2019-12-17 at 2 01 55 PM

For more on linux bock device settings and queues, see "Performance Tuning on Linux — Disk I/O" and "Improving Linux System Performance with I/O Scheduler Tuning.

Investigation using prometheus-operator

What follows is an example root-cause investigation using the prometheus operator data.

For this test, the cluster is 5 nodes, with the prometheus operator installed, as well as istio (default configuration). Symptoms including API server disconnects, lost pods/containers and system latency were identified between 2019-12-16 22:00 UTC - 2019-12-17 01:00.

From the USE Cluster grafana chart:

Screen Shot 2019-12-17 at 12 31 55 PM

We can see that load was pretty spiky, but there were periodic peaks well above the average. This is reflected in the Disk IO charts at the bottom - Disk IO Saturation specifically shows periodic heavy spikes - this is the saturation/throttle scenario.

Going to the first node in the USE Node chart:

ZNodeUse

Zooming in:

Screen Shot 2019-12-17 at 1 28 23 PM

Now we correlate that window of time in the Node chart (host metrics):

Screen Shot 2019-12-17 at 2 22 34 PM

The impact is shown in the Kubelet chart:

Screen Shot 2019-12-17 at 1 50 55 PM

Those spikes should not be occurring.

Examining the same cluster for a different window of time (all events follow the same pattern), but zooming in on the IO / operation view:

Screen Shot 2019-12-19 at 8 59 38 AM

As you can see, a spike in the throttle/disk saturation causes significant latency to container_sync and other critical container lifecycle events. Pod sandboxes, liveliness probes, etc also begin to timeout and fail.

These failures ripple through the entire stack above - from the view of a running application this would look like widespread deployment/container timeouts, slow performance, etc.

Shifting to the pod utilization/load per node, there's an additional jump and churn in the utilization of the tunnelfront pod - these pods act as the bridge between your worker nodes and the API server, during this period the API server would seem slow to respond or timed out due to the impact to tunnelfront being unable to read from it's socket:

Screen Shot 2019-12-17 at 2 28 46 PM

The prometheus operator offers a consistent set of defaults and metrics that allows a deeper look into possible operational issues impacting worker nodes. We will continue to iterate on additional support and other prometheus-operator extensions (such as alertmanager configurations).

Please contact and work with Azure support to identify if you're impacted by this issue

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions