All FIO threads spawn on the same cpu core #50

theofficialgman · 2021-03-19T06:21:50Z

Linux-distro (kernel version): ubuntu bionic 4.9.140
Desktop Environment (KDE/GNOME etc.): GNOME
Qt Version: 5.9.5
KDiskMark Version: current bionic apt repo version
FIO Version: fio-3.1-1

Description:

I am testing this on a relatively low powered SBC (tegra x1). All the FIO threads spawn on the same cpu core, which pins one core to 100% on the heavy IO tests (like the 4k read/writes). This results in much worse performance than the storage drives are capable of (getting 6,600 IOPS on the tegra x1 and 40,000 IOPS using the same drive on a MUCH more powerful windows machiene). Note: crystaldiskmark appears to use all cpu cores (even if it doesn't show in taskmanager).

theofficialgman · 2021-03-19T21:19:00Z

I get 14,000 IOPS when I run that fio command. The fio command does still pin one of the cpu threads at 100% which I guess is limiting me below 40,000 from my more powerful machine. Results are repeatable for both the fio as well as KDiskMark test (and yes I am making sure to do RND4K Q32 T16 in KDiskMark).

Cpu speeds are the same between both tests

theofficialgman · 2021-04-02T01:03:11Z

@JonMagon any ideas?
I can consistently procure these numbers and am available to troubleshoot

theofficialgman · 2021-04-02T01:05:41Z

output.txt
here is the output from the fio command you asked me to run before
while running it stayed around 14,000 IOPS
and if I add all the numbers up at the end of the benchmark it comes out as that as well

note I only did a 12M test here but I also ran a full 128M test and got the same results

polarathene · 2021-09-01T06:24:05Z

This disk you're testing is likely external via USB? There is quite a few differences between an SBC and desktop that would affect the performance between the two systems.

But this only with hardware differences.. it appears you also have software differences (OS and benching tool) which can further complicate a fair comparison between two systems.

Usually single threaded workloads aren't pinned to a specific core. It may appear that 1 core is under full load, but the workload is often being load balanced across each core (switching cores), unless you explicitly pinned it.

You should also try to compare the same OS and kernel against the desktop system. Your SBC is running a rather old kernel (4.9 released in Dec 2016), notable changes have been made since then, especially with disk I/O schedulers from 5.0 release in early 2019. Or perhaps you can update what your SBC is running.

As your testing against Windows on desktop for comparison, try test on the desktop with a distro that has more recent kernel and perhaps switch the fio --ioengine from libaio to io_uring (requires kernel 5.1 minimum, but initially bit buggy prefer as new as possible).

Just in the past year for example, an Intel Optane Gen2 disk was found to achieve 2.58M IOPS via fio, but with all the new optimizations since, the upcoming 5.15 kernel is presently achieving 3.5M IOPS on the same device. libaio performs far worse to io_uring in comparison AFAIK.

Additionally... like the kernel, you're using a rather old version of fio (3.1 released in Sep 2017), fixes and improvements can be made since then, especially to adapt to any advances elsewhere (eg io_uring). Current release is 3.27 (May 2021).

It's probably not relevant, but one user noticed a reporting output difference with fio v3.0 vs a newer release from early 2019 (v3.14), the difference of before/after is conveyed here which tests two jobs with group reporting enabled (originally I thought this was about the --numjobs, but seem mistaken).

This fio issue from 2018 identifies the performance issue due to not fio itself, but too much overhead in the kernel queuing up, reducing performance as the response details. They suggest to make the I/O less intensive and scale it up until you notice where CPU starts to bottleneck. Might be relevant to you since that issue is primarily focused on why CPU usage is too high. You might want to use iotop or similar to see disk latency due to io wait, as they mention earlier in the issue.

Also keep point 3 from this response in mind, from your info so far it's probably unrelated, but when using USB devices (or enclosures) a bridge chipset for disk to USB sometimes constrains a devices capabilities, for example some SATA bridges have a much lower queue depth supported, among other SATA features and USB itself in a variety of ways can likewise restrict that, some of which are system specific (USB controller chipset, USB version, power supplied, USB driver, kernel, filesystem driver etc).

Consider looking over the end of this response which investigates similar performance/CPU issues, noting the kernel being an issue again:

we see submission latencies (time to submit the I/O and have the kernel tell us it's queued it up for sending) in the 1-6 microsecond range (which is fine).

Unfortunately your completion latencies (time from when kernel accepted it for queuing until it got a reply back from the underlying disk saying the I/O completed and then we notice the kernel saying to us the I/O completed) are in the 1-108 millisecond range with an average of 11.2ms (this is not fine for fast speeds).

In short fio is telling you that I/Os to /dev/sdc have a high latency.

So your maximum speed is being limited by the (high) latency of your I/Os and NOT by fio.

Related to that is this issue comment, noting queue depth multiplied by number of jobs raises the effective queue depth (to 512 in this case). This can contribute to the issue, but is unlikely a cause between the two systems, the block I/O scheduler might be more relevant of a difference, or CPU governor. Note that the kernel can also be built with settings that favor throughput vs responsiveness (lower latency for interactive feedback at expense of throughput). These variables all contribute towards performance.

As for differences between KDiskMark and the direct fio command you ran earlier, ~~some settings were omitted that KDiskMark uses, such as --end_fsync~~ (EDIT: I'm tired and was mistaken, ignore that) which may contribute to that, as noted here, earlier in the issue it's also mentioned that you could try adjusting the CPUs allowed policy, that links to the docs for the cpus_allowed_policy param which states shared is the default value (all jobs share CPU), but could be changed to split (each job gets unique CPU), I'm not familiar if this reference to a "job" is the same as --numjobs or separate "jobs"/tests (I don't have much experience with fio).

This is further supported by this StackOverflow answer:

stop using fsync because then each I/O will bypass Linux's page cache and won't return until the disk has acknowledged it received the I/O.

Thus each I/O will have the "latency to disk" built into its time but as stated this does NOT imply points 2 or 3.
Also note not all filesystems support this option (which is a pity).

Another answer on that SO link also mentions fsync time measurements in newer fio releases (3.5+).

One last likely influence is that since this is an external disk between Linux and Windows, are you testing against the same filesystem? exFAT or NTFS for example? On your old kernel, you're like using a FUSE userspace driver. These have significant overhead vs in-kernel supported filesystems. Modern kernels have kernel support for exFAT these days with NTFS still a WIP.

It's quite likely that this could be causing a big difference for you as it will slow down the total throughput you can achieve and IIRC whenever I did such transfers will use up considerable CPU. This would be the first concern to test and verify, followed by using newer versions of fio and kernel.

theofficialgman · 2021-09-01T15:28:51Z

@polarathene thanks for the detailed response.
I have a few things to say... I am aware there are many differences between the lowe powered SBC vs my powerful desktop and I wasn't expecting them to have the same IOPS performance. BUT, I do expect to have the same performance from KDiskmark on the tegra x1 as native fio on the tegra x1, which is what this issue responds to. Kdiskmark gets 6,600 IOPS vs fio at 14,000 IOPS running the command that was instructed to me by JonMagon.
I'll try out the shared vs split cpus_allowed_policy setting and see how that affects things.... I'll get back with more answers to everything but just wanted to say thank you.

Filesystem is ext4 and I'll likely move to testing it on linux on both the tegra-x1 and my more powerful desktop for a more fare comparison instead of crystaldiskmark (which was just used since this is a somewhat clone of that tools functionality). Newer version of the kernel is not an option for this SBC

polarathene · 2021-09-02T00:08:20Z

BUT, I do expect to have the same performance from KDiskmark on the tegra x1 as native fio on the tegra x1, which is what this issue responds to.

Yeah, I only realized at the very end of my response that I had slipped up reading that command, that it was matching what KDiskMark is doing internally 😬

Kdiskmark gets 6,600 IOPS vs fio at 14,000 IOPS running the command that was instructed to me by JonMagon.

They're definitely using the same fio? I assume they are, would be interested if your results change when you try a more recent version of fio.

just wanted to say thank you.

You're welcome :) Hopefully one of those suggestions makes a difference!

Filesystem is ext4

I wasn't expecting that 😅

How well does that run via Windows? Or are you using it via WSL? Your version on Windows is probably newer than the ext4 filesystem on the 4.9 kernel, maybe different mount settings too.

Newer version of the kernel is not an option for this SBC

That's unfortunate :(

I did have a look around, and it seemed it was possible but quite a hassle, back between 2016-2018 there was more activity in content for getting newer kernels or customizations built, but seems to be little after that.

The most promising might be this Debian guide which was also last updated mid 2018, but mentions working with upstream/mainline kernel that Debian uses (eg 4.14 or 4.16 are mentioned at the bottom). Not sure how many of the issues it points out remain as that page doesn't look maintained anymore.

For the purpose of trying a new kernel to perform some I/O with fio, it might be viable, but probably not worth the hassle 😅

Really surprised that nvidia continues to make releases of L4T but keeps it stuck on 4.9 kernel :\

theofficialgman · 2021-09-02T00:25:43Z

I did have a look around, and it seemed it was possible but quite a hassle, back between 2016-2018 there was more activity in content for getting newer kernels or customizations built, but seems to be little after that.

I didn't specify before but we actually do technically have a custom kernel (just based on the l4t kernel with lots of fixes, patches) and this is running on a nintendo switch (tegra-x1 based system)

We have mainline running as well but are stuck with nouveau drivers (which is less than ideal based on their performance and lack of vulkan), hence why sticking with L4T is the best option right now. I'll see if I can figure out of kdiskmark is using the system fio (I assume it is but just need to check)

I wasn't expecting that 😅

How well does that run via Windows? Or are you using it via WSL? Your version on Windows is probably newer than the ext4 filesystem on the 4.9 kernel, maybe different mount settings too.

yeah I think I had it mounted with wsl2... anyway I'll be testing in linux on the same pc in the future.

I'll see about trying out a newer fio version soon enough but it won't be for a few days probably

BugReporterZ · 2021-12-26T13:34:55Z

I think I am observing the same issue. Compared to command-line fio, results barely change with KDiskMark by varying the number of threads in the benchmark.

Using OpenSUSE Tumbleweed (rolling release) with 5.18.5 kernel, fio 3.29, KDiskMark 2.3.0.

I verified this also using the command suggested in the second post: #50 (comment)

With it and a WD SN850 NVMe SSD I get 1050 kIOPS for read performance, but with KDiskMark and similar settings only about half this value, with the number of threads set in in the program having only very limited influence on the results.

theofficialgman added bug Something isn't working unconfirmed labels Mar 19, 2021

theofficialgman mentioned this issue Apr 12, 2022

KDiskMark Botspot/pi-apps#526

Closed

JonMagon added confirmed and removed unconfirmed labels Aug 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All FIO threads spawn on the same cpu core #50

All FIO threads spawn on the same cpu core #50

theofficialgman commented Mar 19, 2021

theofficialgman commented Mar 19, 2021 •

edited

theofficialgman commented Apr 2, 2021

theofficialgman commented Apr 2, 2021

polarathene commented Sep 1, 2021 •

edited

theofficialgman commented Sep 1, 2021 •

edited

polarathene commented Sep 2, 2021

theofficialgman commented Sep 2, 2021 •

edited

BugReporterZ commented Dec 26, 2021 •

edited

All FIO threads spawn on the same cpu core #50

All FIO threads spawn on the same cpu core #50

Comments

theofficialgman commented Mar 19, 2021

Description:

theofficialgman commented Mar 19, 2021 • edited

theofficialgman commented Apr 2, 2021

theofficialgman commented Apr 2, 2021

polarathene commented Sep 1, 2021 • edited

theofficialgman commented Sep 1, 2021 • edited

polarathene commented Sep 2, 2021

theofficialgman commented Sep 2, 2021 • edited

BugReporterZ commented Dec 26, 2021 • edited

theofficialgman commented Mar 19, 2021 •

edited

polarathene commented Sep 1, 2021 •

edited

theofficialgman commented Sep 1, 2021 •

edited

theofficialgman commented Sep 2, 2021 •

edited

BugReporterZ commented Dec 26, 2021 •

edited