diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md index 90cb9d9675..810ea8905b 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md @@ -1,21 +1,17 @@ --- -title: Microbenchmark Storage Performance with fio - -draft: true -cascade: - draft: true +title: Microbenchmark storage performance with fio on Arm minutes_to_complete: 30 -who_is_this_for: This is an introductory topic for developers seeking to optimize storage costs and performance, identify bottlenecks, and navigate storage considerations during application migration across platforms. +who_is_this_for: This is an introductory topic for developers looking to optimize storage performance, reduce costs, identify bottlenecks, and evaluate storage options when migrating applications across platforms. learning_objectives: - - Understand the flow of data for storage devices. - - Use basic observability utilities such as iostat, iotop and pidstat. - - Understand how to run fio for microbenchmarking a block storage device. + - Describe data flow through storage devices. + - Monitor storage performance using tools like iostat, iotop, and pidstat. + - Run fio to microbenchmark a block storage device. prerequisites: - - An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an Arm Linux server. + - An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an Arm Linux server. - Familiarity with Linux. author: Kieran Hejmadi diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md index af58524576..53c1150036 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md @@ -1,37 +1,39 @@ --- -title: Characterizing a workload +title: Analyzing I/O behavior with real workloads weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Basic Characteristics +## Workload attributes -The basic attributes of a given workload are the following. +The basic attributes of a given workload are the following: -- IOPS -- I/O Size -- Throughput -- Read to Write Ratio -- Random vs Sequential access +- IOPS. +- I/O size. +- Throughput. +- Read-to-write ratio. +- Random vs. sequential access. -There are many more characteristics to observe, such as latency, but since this is an introductory topic you will mostly stick to the high-level metrics listed above. +While latency is also an important factor, this section focuses on these high-level metrics to establish a foundational understanding. -## Run an Example Workload +## Run an example workload Connect to an Arm-based server or cloud instance. -As an example workload, you can use the media manipulation tool, FFMPEG, on an AWS `t4g.medium` instance. The `t4g.medium` is an Arm-based (AWS Graviton2) virtual machine with 2 vCPUs, 4 GiB of memory, and is designed for general-purpose workloads with a balance of compute, memory, and network resources. +As an example workload, use the media manipulation tool, FFMPEG on an AWS `t4g.medium` instance. -First, install the required tools. +This is an Arm-based (AWS Graviton2) virtual machine with two vCPUs and 4 GiB of memory, designed for general-purpose workloads with a balance of compute, memory, and network resources. + +First, install the required tools: ```bash sudo apt update sudo apt install ffmpeg iotop -y ``` -Download the popular reference video for transcoding, `BigBuckBunny.mp4`, which is available under the [Creative Commons 3.0 License](https://creativecommons.org/licenses/by/3.0/). +Download the sample video `BigBuckBunny.mp4`, available under the [Creative Commons Attribution 3.0 License](https://creativecommons.org/licenses/by/3.0/). ```bash cd ~ @@ -39,13 +41,15 @@ mkdir src && cd src wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4 ``` -Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. The `-flush_packets` flag forces FFMPEG to write each chunk of video data from memory to storage immediately, rather than buffering it in memory. This reduces the risk of data loss in case of a crash and allows you to observe more frequent disk writes during the transcoding process. +Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. The `-flush_packets` flag forces FFMPEG to write each chunk of video data from memory to storage immediately, rather than buffering it in memory. + +This reduces the risk of data loss in case of a crash and allows disk write activity to be more observable during monitoring, making it easier to study write behavior in real-time. ```bash ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -flush_packets 1 output_video.mp4 ``` -### Observe Disk Usage +### Observe disk usage While the transcoding is running, you can use the `pidstat` command to see the disk statistics of that specific process. @@ -73,7 +77,7 @@ Linux 6.8.0-1024-aws (ip-10-248-213-118) 04/15/25 _aarch64_ In this simple example, since you are interacting with a file on the mounted filesystem, you are also observing the behavior of the filesystem. {{% /notice %}} -There may be other processes or background services that are writing to this disk. You can use the `iotop` command for inspection. As shown in the output below, the `ffmpeg` process has the highest disk utilization. +There might be other processes or background services that are writing to this disk. You can use the `iotop` command for inspection. As shown in the output below, the `ffmpeg` process has the highest disk utilization. ```bash sudo iotop @@ -88,7 +92,13 @@ Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s 2 be/4 root 0.00 B/s 0.00 B/s [kthreadd] ``` -Using the input/output statistics command (`iostat`), you can observe the system-wide metrics from the `nvme0n1` drive. Please note that you are using a snapshot of this workload; more accurate characteristics can be obtained by measuring the distribution of a workload. +Using the input/output statistics command (`iostat`), you can observe the system-wide metrics from the `nvme0n1` drive. + +{{% notice Note%}} +Be mindful of the fact that you are using a snapshot of this workload; more accurate characteristics can be obtained by measuring the distribution of a workload. +{{% /notice %}} + + ```bash watch -n 0.1 iostat -z nvme0n1 @@ -113,7 +123,7 @@ Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB nvme0n1 0.66 29.64 0.24 26.27 0.73 44.80 2.92 203.88 3.17 52.01 2.16 69.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.15 ``` -### Basic Characteristics of the Example Workload +### Basic attributes of the example workload This is a simple transcoding workload with flushed writes, where most data is processed and stored in memory. Disk I/O is minimal, with an IOPS of just 3.81, low throughput (248.71 kB/s), and an average IO depth of 0.01 — all summarized in very low disk utilization. The 52% write merge rate and low latencies further suggest sequential, infrequent disk access, reinforcing that the workload is primarily memory-bound. diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md index f01b16b6a9..7eb99aa93d 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md @@ -8,40 +8,66 @@ layout: learningpathall ## Introduction -Ideally, your system's storage activity should be zero—meaning all application data and instructions are available in memory or cache, with no reads or writes to hard disk drives (HDDs) or solid-state drives (SSDs) required. However, due to physical capacity limits, data volatility, and the need to store large amounts of data, most applications frequently access storage media. +Performance-sensitive application data - such as frequently-accessed configuration files, logs, or transactional state - should ideally reside in system memory (RAM) or CPU cache, where data access latency is measured in nanoseconds to microseconds. These are the fastest tiers in the memory hierarchy, enabling rapid read and write operations that reduce latency and improve throughput. -## High-Level Flow of Data +However, random-access memory (RAM) has the following constraints: -The diagram below provides a high-level overview of how data is written to or read from a storage device. It illustrates a multi-disk I/O architecture, where each disk (Disk 1 to Disk N) has its own I/O queue and optional disk cache, communicating with a central CPU via a disk controller. Memory, not explicitly shown, sits between the CPU and storage, offering fast but volatile access. File systems, also not depicted, operate at the OS/kernel level to handle file access metadata and provide a user-friendly interface through files and directories. +* It is volatile - data is lost on power down. +* It is limited in capacity. +* It is more expensive per gigabyte than other storage types. -![disk i/o](./diskio.jpeg) +For these reasons, most applications also rely on solid-state drives (SSDs) or hard disk drives (HDDs). + +## High-level view of data flow + +The diagram below shows a high-level view of how data moves to and from storage in a multi-disk I/O architecture. Each disk (Disk 1 to Disk N) has its own I/O queue and optional disk cache, communicating with a central CPU through a disk controller. + +While memory is not shown, it plays a central role in providing fast temporary access between the CPU and persistent storage. Likewise, file systems (also not depicted) run in the OS kernel and manage metadata, access permissions, and user-friendly file abstractions. + +This architecture has the following benefits: + +* It enables parallelism in I/O operations. +* It improves throughput. +* It supports scalability across multiple storage devices. + +![disk i/o alt-text#center](./diskio.jpeg "A high-level view of data flow in a multi-disk I/O architecture.") ## Key Terms #### Sectors and Blocks -Sectors are the basic physical units on a storage device. Traditional hard drives typically use a sector size of 512 bytes, while many modern disks use 4096 bytes (4K sectors) for improved error correction and efficiency. +* *Sectors* are the smallest physical storage units, typically 512 or 4096 bytes. Many modern drives use 4K sectors for better error correction and efficiency. + +* *Blocks* are logical groupings of one or more sectors used by file systems, typically 4096 bytes in size. A block might span multiple 512-byte sectors or align directly with 4K physical sectors if supported by the device. -Blocks are logical groupings of one or more sectors used by filesystems for data organization. A common filesystem block size is 4096 bytes, meaning each block might consist of eight 512-byte sectors, or map directly to a 4096-byte physical sector if supported by the disk. #### Input/Output Operations per Second (IOPS) -IOPS measures how many random read or write requests your storage system can handle per second. IOPS can vary by block size and storage medium (e.g., flash drives). Traditional HDDs often do not specify IOPS; for example, AWS does not show IOPS values for HDD volumes. +IOPS measures how many random read/write requests your storage system can perform per second. It depends on the block size or device type. For example, AWS does not show IOPS values for traditional HDD volumes, as shown in the image below: + +![iops_hdd alt-text#center](./IOPS.png "Example where IOPS values are not provided.") + +#### Throughput and bandwidth -![iops_hdd](./IOPS.png) +* *Throughput* is the data transfer rate, usually measured in MB/s. -#### Throughput and Bandwidth +* *Bandwidth* is the maximum potential transfer rate of a connection. -Throughput is the data transfer rate, usually measured in MB/s. Bandwidth specifies the maximum amount of data a connection can transfer. You can calculate storage throughput as IOPS × block size. +You can calculate storage `throughput as IOPS × block size`. -#### Queue Depth +#### Queue depth -Queue depth is the number of simultaneous I/O operations that can be pending on a device. Consumer SSDs typically have a queue depth of 32–64, while enterprise-class NVMe drives can support hundreds or thousands of concurrent requests per queue. Higher queue depth allows more parallelism and can improve I/O performance. +*Queue depth* is the number of I/O operations a device can process concurrently. Consumer SSDs typically support a queue depth of 32–64, while enterprise-class NVMe drives can support hundreds to thousands of concurrent requests per queue. Higher queue depths allow more operations to be handled simultaneously, which can significantly boost throughput on high-performance drives — especially NVMe SSDs with advanced queuing capabilities. + +#### I/O engine -#### I/O Engine +The I/O engine is the software layer in Linux that manages I/O requests between applications and storage. For example, the Linux kernel's block I/O scheduler queues and dispatches requests to device drivers, using multiple queues to optimize disk access. Benchmarking tools like `fio` let you choose different I/O engines: -The I/O engine is the software component in Linux that manages I/O requests between applications and the storage subsystem. For example, the Linux kernel's block I/O scheduler queues and dispatches requests to device drivers, using multiple queues to optimize disk access. In benchmarking tools like fio, you can select I/O engines such as sync (synchronous I/O), `libaio` (Linux native asynchronous I/O), or `io_uring` (which uses newer Linux kernel features for asynchronous I/O). +* `sync` – Performs blocking I/O operations using standard system calls. Simple and portable, but less efficient under high concurrency. +* `libaio` – Uses Linux's native asynchronous I/O interface (`io_submit`/`io_getevents`) for non-blocking operations with lower overhead than `sync`. +* `io_uring` – A modern, high-performance async I/O API introduced in Linux 5.1. It minimizes syscalls and context switches, and supports advanced features like buffer selection and batched submissions. -#### I/O Wait + +#### I/O wait -I/O wait is the time a CPU core spends waiting for I/O operations to complete. +I/O wait is the time a CPU core spends waiting for I/O operations to complete. Tools like `pidstat`, `top`, and `iostat` can help identify storage-related CPU bottlenecks. diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md index 6e7e0a40ed..76c18c664e 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md @@ -1,5 +1,5 @@ --- -title: Using fio +title: Benchmarking block storage performance with fio weight: 4 ### FIXED, DO NOT MODIFY @@ -8,37 +8,44 @@ layout: learningpathall ## Install fio -You can use the same `t4g.medium` instance from the previous section with 2 different types of SSD-based block storage devices as per the console screenshot below. +You can use the same `t4g.medium` instance from the previous section with two different types of SSD-based block storage devices as shown in the console screenshot below. +### Attach and Identify Block Devices To add the required EBS volumes to your EC2 instance: -1. In the AWS Console, navigate to EC2 > Volumes > Create Volume +1. In the AWS Console, navigate to **EC2** > **Volumes** > **Create Volume**. + 2. Create a volume with the following settings: - - Volume Type: io2 (Provisioned IOPS SSD) - - Size: 8 GiB - - IOPS: 400 - - Availability Zone: Same as your EC2 instance + - Volume Type: io2 (Provisioned IOPS SSD). + - Size: 8 GiB. + - IOPS: 400. + - Availability Zone: The same as your EC2 instance + 3. Create another volume with the following settings: - - Volume Type: gp2 (General Purpose SSD) - - Size: 8 GiB - - Availability Zone: Same as your EC2 instance -4. Once created, select each volume and choose Actions > Attach Volume -5. Select your t4g.medium instance from the dropdown and attach each volume + - Volume Type: gp2 (General Purpose SSD). + - Size: 8 GiB. + - Availability Zone: The same as your EC2 instance. + +4. Once created, select each volume and choose **Actions** > **Attach Volume**. -Both block devices have the same, 8GiB capacity but the `io2` is geared towards throughput as opposed to the general purpose SSD `gp2`. +5. Select your t4g.medium instance from the dropdown and attach each volume. -![EBS](./EBS.png) +Both block devices have the same 8 GiB capacity, but the `io2` is optimized for throughput, while `gp2` is general-purpose. -In this section you will observe what the real-world performance for your workload is so that it can inform your selection. +![EBS alt-text#center](./EBS.png "Multi-volume storage information.") -Flexible I/O (fio) is a command-line tool to generate a synthetic workload with specific I/O characteristics. This serves as a simpler alternative to full record and replay testing. Fio is available through most Linux distribution packages, please refer to the [documentation](https://github.com/axboe/fio) for the binary package availability. +In this section, you’ll measure real-world performance to help guide your storage selection. + +Flexible I/O (fio) is a command-line tool to generate a synthetic workload with specific I/O characteristics. This serves as a simpler alternative to full record and replay testing. + +Fio is available through most Linux distribution packages, see the [documentation](https://github.com/axboe/fio) for package availability. ```bash sudo apt update sudo apt install fio -y ``` -Confirm installation with the following commands. +Confirm installation with the following command: ```bash fio --version @@ -50,9 +57,9 @@ The version is printed: fio-3.37 ``` -## Locate Device +## Identify Device Names for Benchmarking -Fio allows you to microbenchmark either the block device or a mounted filesystem. Use the disk free, `df` command to confirm your EBS volumes are not mounted. Writing to drives that hold critical information may cause issues. Hence you are writing to blank, unmounted block storage device. +Fio allows you to microbenchmark either the block device or a mounted filesystem. Use the disk free, `df` command to confirm your EBS volumes are not mounted. Writing to drives containing critical data can result in data loss. In this tutorial, you're writing to blank, unmounted block devices. Use the `lsblk` command to view the EBS volumes attached to the server (`nvme1n1` and `nvme2n1`). The immediate number appended to `nvme`, e.g., `nvme0`, shows it is a physically separate device. `nvme1n1` corresponds to the faster `io2` block device and `nvme2n1` corresponds to the slower `gp2` block device. @@ -70,23 +77,26 @@ nvme0n1 259:1 0 8G 0 disk nvme2n1 259:2 0 8G 0 disk ``` -{{% notice Please Note%}} -If you have more than 1 block volumes attached to an instance, the `sudo nvme list` command from the `nvme-cli` package can be used to differentiate between volumes +{{% notice Note%}} +If you have more than one block volume attached to an instance, the `sudo nvme list` command from the `nvme-cli` package can be used to differentiate between volumes {{% /notice %}} -## Generating a Synthetic Workload +## Generating a synthetic workload -Suppose you want to simulate a fictional logging application with the following characteristics observed using the tools from the previous section. +Let’s define a synthetic workload that mimics the behavior of a logging application, using metrics observed earlier. {{% notice Workload%}} -The logging workload has light sequential read and write characteristics. The system write throughput per thread is 5 MB/s with 83% writes. There are infrequent bursts of reads for approximately 5 seconds, operating at up to 16MB/s per thread. The workload can scale the infrequent reads and writes to use up to 16 threads each. The block size for the writes and reads are 64KiB and 256KiB respectively (as opposed to the standard 4KiB Page size). +This workload involves light, sequential reads and writes. The system write throughput per thread is 5 MB/s with 83% writes. There are infrequent bursts of reads for approximately 5 seconds, operating at up to 16MB/s per thread. The workload can scale the infrequent reads and writes to use up to 16 threads each. The block size for the writes and reads are 64KiB and 256KiB respectively (as opposed to the standard 4KiB Page size). Further, the application is latency sensitive and given it holds critical information, needs to write directly to non-volatile storage through direct IO. {{% /notice %}} -The fio tool uses simple configuration `jobfiles` to describe the characteristics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, you can create 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details. +The fio tool uses simple configuration files - called `jobfiles` - to describe the characteristics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, you can create 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details. + +### Create fio Job Files -Copy and paste the configuration file below into 2 files named `nvme.fio`. Replace the `` with the block devices you are comparing and adjust the `filename` parameter accordingly. + +Create two job files, one for each device, by copying the configuration below and adjusting the filename parameter (`/dev/nvme1n1` or `/dev/nvme2n1`): ```ini ; -- start job file including.fio -- @@ -111,17 +121,20 @@ bs=64k ; Block size of 64KiB (default block size of 4 KiB) name=burst_read rw=read bs=256k ; Block size of 256KiB for reads (default is 4KiB) -startdelay=10 ; simulate infrequent reads (5 seconds out 30) +startdelay=25 ; simulate a 5-second read burst at the end of a 30-second window runtime=5 ; -- end job file including.fio -- ``` +## Run the Benchmarks {{% notice Note %}} Running fio directly on block devices requires root privileges (hence the use of `sudo`). Be careful: writing to the wrong device can result in data loss. Always ensure you are targeting a blank, unmounted device. {{% /notice %}} -Run the following commands to run each test back to back. + + +Run the following commands to execute each test sequentially: ```bash sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme1.fio @@ -157,6 +170,10 @@ Disk stats (read/write): nvme2n1: ios=1872/28855, sectors=935472/3693440, merge=0/0, ticks=159753/1025104, in_queue=1184857, util=89.83% ``` +{{% notice Note%}} +fio reports bandwidth in MiB/s (mebibytes per second). MB/s (megabytes per second) is included for comparison. 1 MiB = 1,048,576 bytes, while 1 MB = 1,000,000 bytes. +{{% /notice %}} + Here you can see that the faster `io2` block storage (`nvme1`) is able to meet the throughput requirement of 80MB/s for steady writes when all 16 write threads are running (5MB/s per thread). However, `gp2` saturates at 60.3 MiB/s with over 89.8% SSD utilization. Suppose your fictional logging application is sensitive to operation latency. The output below highlights that over ~35% of operations have a latency above 1s on nvme2 compared to ~7% on nvme1. High latency percentiles can significantly impact application responsiveness, especially for latency-sensitive workloads like logging.