diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md index 71682169c2..90cb9d9675 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md @@ -1,5 +1,5 @@ --- -title: Microbenchmark Storage Performance with Fio +title: Microbenchmark Storage Performance with fio draft: true cascade: @@ -7,16 +7,16 @@ cascade: minutes_to_complete: 30 -who_is_this_for: A cloud developer who wants to optimize storage cost or performance of their application. Developers who want to uncover potential storage-bound bottlenecks or changes when migrating an application to a different platform. +who_is_this_for: This is an introductory topic for developers seeking to optimize storage costs and performance, identify bottlenecks, and navigate storage considerations during application migration across platforms. learning_objectives: - - Understand the flow of data for storage devices - - Use basic observability utilities such as iostat, iotop and pidstat - - Understand how to run fio for microbenchmarking a block storage device + - Understand the flow of data for storage devices. + - Use basic observability utilities such as iostat, iotop and pidstat. + - Understand how to run fio for microbenchmarking a block storage device. prerequisites: - - Access to an Arm-based server - - Basic understanding of Linux + - An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an Arm Linux server. + - Familiarity with Linux. author: Kieran Hejmadi @@ -31,7 +31,6 @@ tools_software_languages: operatingsystems: - Linux - further_reading: - resource: title: Fio documentation diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md index 06116a2a22..af58524576 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md @@ -1,5 +1,5 @@ --- -title: Characterising a Workload +title: Characterizing a workload weight: 3 ### FIXED, DO NOT MODIFY @@ -16,42 +16,44 @@ The basic attributes of a given workload are the following. - Read to Write Ratio - Random vs Sequential access -There are many more characteristics to observe, just as latency but since this is an introductory topic we will mostly stick to the high-level metrics listed above. +There are many more characteristics to observe, such as latency, but since this is an introductory topic you will mostly stick to the high-level metrics listed above. -## Running an Example Workload +## Run an Example Workload -Connect to an Arm-based cloud instance. As an example workload, we will be using the media manipulation tool, FFMPEG on an AWS `t4g.medium` instance. +Connect to an Arm-based server or cloud instance. -First install the prequistite tools. +As an example workload, you can use the media manipulation tool, FFMPEG, on an AWS `t4g.medium` instance. The `t4g.medium` is an Arm-based (AWS Graviton2) virtual machine with 2 vCPUs, 4 GiB of memory, and is designed for general-purpose workloads with a balance of compute, memory, and network resources. + +First, install the required tools. ```bash sudo apt update sudo apt install ffmpeg iotop -y ``` -Download the popular reference video for transcoding, `BigBuckBunny.mp4` which is available under the [Creative Commons 3.0 License](https://creativecommons.org/licenses/by/3.0/). +Download the popular reference video for transcoding, `BigBuckBunny.mp4`, which is available under the [Creative Commons 3.0 License](https://creativecommons.org/licenses/by/3.0/). ```bash cd ~ -mkdir src -cd src +mkdir src && cd src wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4 ``` -Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. We use the `-flush_packets` flag to write each chunk of video back to storage from memory. +Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. The `-flush_packets` flag forces FFMPEG to write each chunk of video data from memory to storage immediately, rather than buffering it in memory. This reduces the risk of data loss in case of a crash and allows you to observe more frequent disk writes during the transcoding process. ```bash ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -flush_packets 1 output_video.mp4 ``` -### Observing Disk Usage +### Observe Disk Usage -Whilst the transcoding is running, we can use the `pidstat` command to see the disk statistics of that specific process. +While the transcoding is running, you can use the `pidstat` command to see the disk statistics of that specific process. ```bash pidstat -d -p $(pgrep ffmpeg) 1 ``` -Since this example `151MB` video fits within memory, we observe no `kB_rd/s` for the storage device after the initial read. However, since we are flushing to storage we observe period ~275 `kB_wr/s`. + +Since this example video (151 MB) fits within memory, you observe no `kB_rd/s` for the storage device after the initial read. However, because you are flushing to storage, you observe periodic writes of approximately 275 `kB_wr/s`. ```output Linux 6.8.0-1024-aws (ip-10-248-213-118) 04/15/25 _aarch64_ (2 CPU) @@ -67,11 +69,11 @@ Linux 6.8.0-1024-aws (ip-10-248-213-118) 04/15/25 _aarch64_ 10:01:32 1000 24250 0.00 344.00 0.00 0 ffmpeg ``` -{{% notice Please Note%}} -In this simple example, since we are interacting with a file on the mounted filesystem, we are also observing the behaviour of the filesystem. +{{% notice Note%}} +In this simple example, since you are interacting with a file on the mounted filesystem, you are also observing the behavior of the filesystem. {{% /notice %}} -Of course, there may be other processes or background services that are writing to this disk. We can use `iotop` command for inspection. As per the output below, the `ffmpeg` process has the greatest disk utilisation. +There may be other processes or background services that are writing to this disk. You can use the `iotop` command for inspection. As shown in the output below, the `ffmpeg` process has the highest disk utilization. ```bash sudo iotop @@ -86,33 +88,34 @@ Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s 2 be/4 root 0.00 B/s 0.00 B/s [kthreadd] ``` -Using the input, output statistics command (`iostat`) we can observe the system-wide metrics from the `nvme0n1` drive. Please Note that we are using a snapshot of this workload, more accurate characteristics can be obtained by measuring the distribution of a workload. +Using the input/output statistics command (`iostat`), you can observe the system-wide metrics from the `nvme0n1` drive. Please note that you are using a snapshot of this workload; more accurate characteristics can be obtained by measuring the distribution of a workload. ```bash watch -n 0.1 iostat -z nvme0n1 ``` -You should see output similar to that below. +You see output similar to that below. ```output Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd nvme0n1 3.81 31.63 217.08 0.00 831846 5709210 0 ``` -To observe the more detailed metrics we can run `iostat` with the `-x` option. +To observe more detailed metrics, you can run `iostat` with the `-x` option. ```bash iostat -xz nvme0n1 ``` +The output is similar to: + ```output Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 0.66 29.64 0.24 26.27 0.73 44.80 2.92 203.88 3.17 52.01 2.16 69.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.15 ``` -### Basic Characteristics of our Example Workload - -This is a simple transcoding workload with flushed writes, where most data is processed and stored in memory. Disk I/O is minimal, with an IOPS of just 3.81, low throughput (248.71 kB/s), and an average IO depth of 0.01 — all summarised in very low disk utilization. The 52% write merge rate and low latencies further suggest sequential, infrequent disk access, reinforcing that the workload is primarily memory-bound. +### Basic Characteristics of the Example Workload +This is a simple transcoding workload with flushed writes, where most data is processed and stored in memory. Disk I/O is minimal, with an IOPS of just 3.81, low throughput (248.71 kB/s), and an average IO depth of 0.01 — all summarized in very low disk utilization. The 52% write merge rate and low latencies further suggest sequential, infrequent disk access, reinforcing that the workload is primarily memory-bound. | Metric | Calculation Explanation | Value | |--------------------|-------------------------------------------------------------------------------------------------------------|---------------| @@ -124,9 +127,8 @@ This is a simple transcoding workload with flushed writes, where most data is pr | Read Ratio | Read throughput ÷ total throughput: 31.63 / 248.71 | ~13% | | Write Ratio | Write throughput ÷ total throughput: 217.08 / 248.71 | ~87% | | IO Depth | Taken directly from `aqu-sz` (average number of in-flight I/Os) | 0.01 | -| Access Pattern | Based on cache hits, merge rates, and low wait times. 52% of writes were merged (`wrqm/s` = 3.17, `w/s` = 2.92) → suggests mostly sequential access | Sequential-ish (52.01% merged) | - +| Access Pattern | 52% of writes were merged (`wrqm/s` = 3.17, `w/s` = 2.92), indicating mostly sequential disk access with low wait times and frequent cache hits | Sequential (52.01% merged) | -{{% notice Please Note%}} -If you have access to the workloads source code, the expected access patterns can more easily be observed. +{{% notice Note %}} +If you have access to the workload's source code, you can more easily observe the expected access patterns. {{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md index 08a9a873d5..fac623a0d1 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md @@ -1,5 +1,5 @@ --- -title: Fundamentals of Storage Systems +title: Fundamentals of storage systems weight: 2 ### FIXED, DO NOT MODIFY @@ -8,39 +8,40 @@ layout: learningpathall ## Introduction -The ideal storage activity of your system is 0. In this situation all of your application data and instructions are available in memory or caches with no reads or writes to a spinning hard-disk drive or solid-state SSD required. However, due to physical capacity limitations, data volatility and need to store large amounts of data, many applications require frequent access to storage media. +Ideally, your system's storage activity should be zero—meaning all application data and instructions are available in memory or cache, with no reads or writes to hard disk drives (HDDs) or solid-state drives (SSDs) required. However, due to physical capacity limits, data volatility, and the need to store large amounts of data, most applications frequently access storage media. ## High-Level Flow of Data -The diagram below is a high-level overview of how data can be written or read from a storage device. This diagram illustrates a multi-disk I/O architecture where each disk (Disk 1 to Disk N) has an I/O queue and optional disk cache, communicating with a central CPU via a disk controller. Memory is not explicitly shown but resides between the CPU and storage, offering fast access times with the tradeoff of volatile. File systems, though not depicted, operate at the OS/kernel level to handling file access metadata and offer a friendly way to interact through files and directories. +The diagram below provides a high-level overview of how data is written to or read from a storage device. It illustrates a multi-disk I/O architecture, where each disk (Disk 1 to Disk N) has its own I/O queue and optional disk cache, communicating with a central CPU via a disk controller. Memory, not explicitly shown, sits between the CPU and storage, offering fast but volatile access. File systems, also not depicted, operate at the OS/kernel level to handle file access metadata and provide a user-friendly interface through files and directories. ![disk i/o](./diskio.jpeg) - ## Key Terms #### Sectors and Blocks -Sectors are the basic physical units on a storage device. For instance, traditional hard drives typically use a sector size of 512 bytes, while many modern disks use 4096 bytes (or 4K sectors) to improve error correction and efficiency. +Sectors are the basic physical units on a storage device. Traditional hard drives typically use a sector size of 512 bytes, while many modern disks use 4096 bytes (4K sectors) for improved error correction and efficiency. + +Blocks are logical groupings of one or more sectors used by filesystems for data organization. A common filesystem block size is 4096 bytes, meaning each block might consist of eight 512-byte sectors, or map directly to a 4096-byte physical sector if supported by the disk. -Blocks are the logical grouping of one or more sectors used by filesystems for data organization. A common filesystem block size is 4096 bytes, meaning that each block might consist of 8 of the 512-byte sectors, or simply map directly to a 4096-byte physical sector layout if the disk supports it. +#### Input/Output Operations per Second (IOPS) -#### Input Output Operations per second (IOPS) -IOPS is a measure of how much random read or write requests your storage system can manage. It is worth noting that IOPS can vary by block size depending on the storage medium (e.g., flash drives). Importantly, traditional hard disk drives (HDDs) often don't specify the IOPS. For example the IOPS value for HDD volume on AWS is not shown. +IOPS measures how many random read or write requests your storage system can handle per second. IOPS can vary by block size and storage medium (e.g., flash drives). Traditional HDDs often do not specify IOPS; for example, AWS does not show IOPS values for HDD volumes. ![iops_hdd](./IOPS.png) -#### Throughput / Bandwidth -Throughput is the data transfer rate normally in MB/s with bandwidth specifying the maximum amount that a connection can transfer. IOPS x block size can be used to calculate the storage throughput of your application. +#### Throughput and Bandwidth + +Throughput is the data transfer rate, usually measured in MB/s. Bandwidth specifies the maximum amount of data a connection can transfer. You can calculate storage throughput as IOPS × block size. #### Queue Depth -Queue depth refers to the number of simultaneous I/O operations that can be pending on a device. Consumer SSDs might typically have a queue depth in the range of 32 to 64, whereas enterprise-class NVMe drives can support hundreds or even thousands of concurrent requests per queue. This parameter affects how much the device can parallelize operations and therefore influences overall I/O performance. -#### I/O Schedule Engine +Queue depth is the number of simultaneous I/O operations that can be pending on a device. Consumer SSDs typically have a queue depth of 32–64, while enterprise-class NVMe drives can support hundreds or thousands of concurrent requests per queue. Higher queue depth allows more parallelism and can improve I/O performance. + +#### I/O Engine -The I/O engine is the software component within Linux responsible for managing I/O requests between applications and the storage subsystem. For example, in Linux, the kernel’s block I/O scheduler acts as an I/O engine by queuing and dispatching requests to device drivers. Schedulers use multiple queues to reorder requests optimal disk access. -In benchmarking tools like fio, you might select I/O engines such as sync (synchronous I/O), `libaio` (Linux native asynchronous I/O library), or `io_uring` (which leverages newer Linux kernel capabilities for asynchronous I/O). +The I/O engine is the software component in Linux that manages I/O requests between applications and the storage subsystem. For example, the Linux kernel’s block I/O scheduler queues and dispatches requests to device drivers, using multiple queues to optimize disk access. In benchmarking tools like fio, you can select I/O engines such as sync (synchronous I/O), `libaio` (Linux native asynchronous I/O), or `io_uring` (which uses newer Linux kernel features for asynchronous I/O). #### I/O Wait -This is the perceived time spent waiting for I/O to return the value from the perspective of the CPU core. +I/O wait is the time a CPU core spends waiting for I/O operations to complete. diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md index bc50025ac4..6e7e0a40ed 100644 --- a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md +++ b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md @@ -1,17 +1,36 @@ --- -title: Using FIO +title: Using fio weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Setup and Install Fio +## Install fio -I will be using the same `t4g.medium` instance from the previous section with 2 different types of SSD-based block storage devices as per the console screenshot below. Both block devices have the same, 8GiB capacity but the `io1` is geared towards throughput as opposed to the general purpose SSD `gp2`. In this section we want to observe what the real-world performance for our workload is so that it can inform our selection. +You can use the same `t4g.medium` instance from the previous section with 2 different types of SSD-based block storage devices as per the console screenshot below. + +To add the required EBS volumes to your EC2 instance: + +1. In the AWS Console, navigate to EC2 > Volumes > Create Volume +2. Create a volume with the following settings: + - Volume Type: io2 (Provisioned IOPS SSD) + - Size: 8 GiB + - IOPS: 400 + - Availability Zone: Same as your EC2 instance +3. Create another volume with the following settings: + - Volume Type: gp2 (General Purpose SSD) + - Size: 8 GiB + - Availability Zone: Same as your EC2 instance +4. Once created, select each volume and choose Actions > Attach Volume +5. Select your t4g.medium instance from the dropdown and attach each volume + +Both block devices have the same, 8GiB capacity but the `io2` is geared towards throughput as opposed to the general purpose SSD `gp2`. ![EBS](./EBS.png) +In this section you will observe what the real-world performance for your workload is so that it can inform your selection. + Flexible I/O (fio) is a command-line tool to generate a synthetic workload with specific I/O characteristics. This serves as a simpler alternative to full record and replay testing. Fio is available through most Linux distribution packages, please refer to the [documentation](https://github.com/axboe/fio) for the binary package availability. ```bash @@ -25,15 +44,17 @@ Confirm installation with the following commands. fio --version ``` +The version is printed: + ```output -fio-3.36 +fio-3.37 ``` ## Locate Device -`Fio` allows us to microbenchmark either the block device or a mounted filesystem. The disk free, `df` command to confirm our EBS volumes are not mounted. Writing to drives that hold critical information may cause issues. Hence we are writing to blank, unmounted block storage device. +Fio allows you to microbenchmark either the block device or a mounted filesystem. Use the disk free, `df` command to confirm your EBS volumes are not mounted. Writing to drives that hold critical information may cause issues. Hence you are writing to blank, unmounted block storage device. -Using the `lsblk` command to view the EBS volumes attached to the server (`nvme1n1` and `nvme2n1`). The immediate number appended to `nvme`, e.g., `nvme0`, shows it is a physically separate device. `nvme1n1` corresponds to the faster `io2` block device and `nvme2n1` corresponds to the slower `gp2` block device. +Use the `lsblk` command to view the EBS volumes attached to the server (`nvme1n1` and `nvme2n1`). The immediate number appended to `nvme`, e.g., `nvme0`, shows it is a physically separate device. `nvme1n1` corresponds to the faster `io2` block device and `nvme2n1` corresponds to the slower `gp2` block device. ```bash lsblk -e 7 @@ -50,22 +71,22 @@ nvme2n1 259:2 0 8G 0 disk ``` {{% notice Please Note%}} -If you have more than 1 block volumes attached to an instance, the `sudo nvme list` command from the `nvme-cli` package and be used to differentiate between volumes +If you have more than 1 block volumes attached to an instance, the `sudo nvme list` command from the `nvme-cli` package can be used to differentiate between volumes {{% /notice %}} ## Generating a Synthetic Workload -Let us say we want to simulate a fictional logging application with the following characteristics observed using the tools from the previous section. +Suppose you want to simulate a fictional logging application with the following characteristics observed using the tools from the previous section. {{% notice Workload%}} The logging workload has light sequential read and write characteristics. The system write throughput per thread is 5 MB/s with 83% writes. There are infrequent bursts of reads for approximately 5 seconds, operating at up to 16MB/s per thread. The workload can scale the infrequent reads and writes to use up to 16 threads each. The block size for the writes and reads are 64KiB and 256KiB respectively (as opposed to the standard 4KiB Page size). -Further, the application latency sensitive and given it holds critical information, needs to write directly to non-volatile storage through direct IO. +Further, the application is latency sensitive and given it holds critical information, needs to write directly to non-volatile storage through direct IO. {{% /notice %}} -The fio tool uses simple configuration `jobfiles` to describe the characterisics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, we have created 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details. +The fio tool uses simple configuration `jobfiles` to describe the characteristics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, you can create 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details. -Copy and paste the configuration file below into 2 files named `nvme.fio`. Replace the `` with the block devices we are comparing and just the `filename` parameter accordingly. +Copy and paste the configuration file below into 2 files named `nvme.fio`. Replace the `` with the block devices you are comparing and adjust the `filename` parameter accordingly. ```ini ; -- start job file including.fio -- @@ -89,19 +110,24 @@ bs=64k ; Block size of 64KiB (default block size of 4 KiB) [burst_read] name=burst_read rw=read -bs=256k ; adjust the block size to 64KiB writes (default is 4KiB) +bs=256k ; Block size of 256KiB for reads (default is 4KiB) startdelay=10 ; simulate infrequent reads (5 seconds out 30) runtime=5 ; -- end job file including.fio -- ``` + +{{% notice Note %}} +Running fio directly on block devices requires root privileges (hence the use of `sudo`). Be careful: writing to the wrong device can result in data loss. Always ensure you are targeting a blank, unmounted device. +{{% /notice %}} + Run the following commands to run each test back to back. ```bash sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme1.fio ``` -Then +Then run again with the following command: ```bash sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme2.fio @@ -131,10 +157,9 @@ Disk stats (read/write): nvme2n1: ios=1872/28855, sectors=935472/3693440, merge=0/0, ticks=159753/1025104, in_queue=1184857, util=89.83% ``` -Here we can see that the faster `io2` block storage (`nvme1`) is able to meet the throughput requirement of 80MB/s for steady writes when all 16 write threads are running (5MB/s per thread). However `gp2` saturates at 60.3 MiB/s with over 89.8% SSD utilisation. - -We are told the fictional logging application is sensitive to operation latency. The output belows highlights that over ~35% operations have a latency above 1s on nvme2 compared to ~7% on nvme1. +Here you can see that the faster `io2` block storage (`nvme1`) is able to meet the throughput requirement of 80MB/s for steady writes when all 16 write threads are running (5MB/s per thread). However, `gp2` saturates at 60.3 MiB/s with over 89.8% SSD utilization. +Suppose your fictional logging application is sensitive to operation latency. The output below highlights that over ~35% of operations have a latency above 1s on nvme2 compared to ~7% on nvme1. High latency percentiles can significantly impact application responsiveness, especially for latency-sensitive workloads like logging. ```output @@ -153,14 +178,14 @@ We are told the fictional logging application is sensitive to operation latency. lat (msec) : 2000=3.62%, >=2000=2.38% ``` -This insights above suggest the SSD designed for throughput, `io2` is more suitable than the general purpose `gp2` storage to meet the requirements of our logging application. +These insights suggest the SSD designed for throughput, `io2`, is more suitable than the general purpose `gp2` storage to meet the requirements of your logging application. {{% notice Tip%}} -If the text output is hard to follow, you can use the `fio2gnuplot` package to plot the data graphically or use the visualisations available from the cloud service provider's dashboard. See image below for an example. +If the text output is hard to follow, you can use the `fio2gnuplot` package to plot the data graphically or use the visualizations available from your cloud service provider's dashboard. See image below for an example. ![plot](./visualisations.png) {{% /notice %}} -The insights gathered by microbenchmarking with fio above can lead to more informed decisions about which block storage to connect to your Arm-based instance. +The insights gathered by microbenchmarking with fio above can lead to more informed decisions about which block storage to connect to your Arm-based instance.