Slower reads using iouring than os for ReadFile type operations at all file sizes #27

johnsiilver · 2023-08-20T20:49:57Z

Code discussed here can be found at:
https://github.com/johnsiilver/iouringfile

Expectations:
I expected faster Reads with io_uring, but found it to be slower in every benchmark. This could be due to multiple factors:

I don't know what I'm doing, which is easily possible
Single buffer reads, no matter what the size, have no speed advantage in io_uring vs epoll
Go's concurrency model and io_uring aren't good fits
There are bugs in this package

I could not get WithEQPoll() to work as I documented in another bug. Most benchmarks on io_uring show significant savings without it and multipliers if enabled. I found no examples in this repo that used it and using sourcegraph I could find no external repos that used it: https://sourcegraph.com/search?q=context%3Aglobal+WithSQPoll%28%29&patternType=standard&sm=1&groupBy=repo

Here are various benchmarks I ran on an Amazon VM running Ubnutu (I don't have any native linux boxes on bare metal).

Initially used runtime.LockOSThread() for my event loop, which yielded these results:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
BenchmarkReadFile/OS_1KiB-2 108146 11022 ns/op
BenchmarkReadFile/OS_16KiB-2 72103 16457 ns/op
BenchmarkReadFile/OS_32KiB-2 51840 23956 ns/op
BenchmarkReadFile/OS_64KiB-2 33656 36336 ns/op
BenchmarkReadFile/OS_128KiB-2 18320 66011 ns/op
BenchmarkReadFile/OS_512KiB-2 5438 239014 ns/op
BenchmarkReadFile/OS_1MiB-2 4478 277964 ns/op
BenchmarkReadFile/OS_10MiB-2 326 3560091 ns/op
BenchmarkReadFile/OS_100MiB-2 24 45336144 ns/op
BenchmarkReadFile/OS_1GiB-2 2 533741636 ns/op
BenchmarkReadFile/IOURING_1KiB_-2 20305 50978 ns/op <- Almost 5x slower
BenchmarkReadFile/IOURING_16KiB-2 19050 59990 ns/op
BenchmarkReadFile/IOURING_32KiB-2 18846 67469 ns/op
BenchmarkReadFile/IOURING_64KiB-2 14314 81480 ns/op
BenchmarkReadFile/IOURING_128KiB-2 10000 162159 ns/op
BenchmarkReadFile/IOURING_512KiB-2 2743 456180 ns/op
BenchmarkReadFile/IOURING_1MiB-2 1564 757464 ns/op
BenchmarkReadFile/IOURING_10MiB-2 278 4013378 ns/op
BenchmarkReadFile/IOURING_100MiB-2 25 42532022 ns/op
BenchmarkReadFile/IOURING_1GiB-2 2 627497596 ns/op <- Still slower, but at least not 5x

Improved with removal of the the lock, and got:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
BenchmarkReadFile/OS_1KiB-2 104364 11190 ns/op
BenchmarkReadFile/OS_16KiB-2 70800 16552 ns/op
BenchmarkReadFile/OS_32KiB-2 48690 24940 ns/op
BenchmarkReadFile/OS_64KiB-2 29878 37414 ns/op
BenchmarkReadFile/OS_128KiB-2 17282 72963 ns/op
BenchmarkReadFile/OS_512KiB-2 4719 227736 ns/op
BenchmarkReadFile/OS_1MiB-2 3877 285898 ns/op
BenchmarkReadFile/OS_10MiB-2 326 3557621 ns/op
BenchmarkReadFile/OS_100MiB-2 24 44912433 ns/op
BenchmarkReadFile/OS_1GiB-2 2 552437373 ns/op
BenchmarkReadFile/IOURING_1KiB_-2 56310 21982 ns/op <- Almost 2x slower
BenchmarkReadFile/IOURING_16KiB-2 43414 28685 ns/op
BenchmarkReadFile/IOURING_32KiB-2 34561 34932 ns/op
BenchmarkReadFile/IOURING_64KiB-2 24840 45259 ns/op
BenchmarkReadFile/IOURING_128KiB-2 17257 84012 ns/op
BenchmarkReadFile/IOURING_512KiB-2 2972 375026 ns/op
BenchmarkReadFile/IOURING_1MiB-2 --- FAIL: BenchmarkReadFile/IOURING_1MiB-2
benchmark_test.go:148: bad file descriptor
BenchmarkReadFile/IOURING_10MiB-2 273 4144841 ns/op
BenchmarkReadFile/IOURING_100MiB-2 25 42300929 ns/op
BenchmarkReadFile/IOURING_1GiB-2 2 632643157 ns/op

NOTE: For this one run I got a bad file descriptor error, which I suspect is a bug in iouring-go, but it hasn't happened again so I can't confirm it is. It is also just as likely I did something silly.

I removed the event loop and pre-allocation of return channels and just did the iouring calls inside ReadFile itself, which yielded worse results:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
BenchmarkReadFile/OS_1KiB-2 105922 11093 ns/op
BenchmarkReadFile/OS_16KiB-2 73557 16656 ns/op
BenchmarkReadFile/OS_32KiB-2 53299 22797 ns/op
BenchmarkReadFile/OS_64KiB-2 36957 36118 ns/op
BenchmarkReadFile/OS_128KiB-2 19062 66646 ns/op
BenchmarkReadFile/OS_512KiB-2 5625 234702 ns/op
BenchmarkReadFile/OS_1MiB-2 3667 318104 ns/op
BenchmarkReadFile/OS_10MiB-2 325 3617478 ns/op
BenchmarkReadFile/OS_100MiB-2 25 43581482 ns/op
BenchmarkReadFile/OS_1GiB-2 2 552909954 ns/op
BenchmarkReadFile/IOURING_1KiB_-2 11348 97162 ns/op <- 9x slower, so by far my worse idea
BenchmarkReadFile/IOURING_16KiB-2 10000 111489 ns/op
BenchmarkReadFile/IOURING_32KiB-2 10104 109561 ns/op
BenchmarkReadFile/IOURING_64KiB-2 6470 186679 ns/op
BenchmarkReadFile/IOURING_128KiB-2 7903 220638 ns/op
BenchmarkReadFile/IOURING_512KiB-2 2329 614333 ns/op
BenchmarkReadFile/IOURING_1MiB-2 1212 896992 ns/op
BenchmarkReadFile/IOURING_10MiB-2 255 4715659 ns/op
BenchmarkReadFile/IOURING_100MiB-2 25 43412745 ns/op
BenchmarkReadFile/IOURING_1GiB-2 2 631509550 ns/op
PASS
ok github.com/johnsiilver/iouringfile 47.412s

So I went back to the previous and measured the allocs:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iour cpu: Intel(R) Xeon(R) CPU BenchmarkReadFile/OS_1KiB-2 BenchmarkReadFile/OS_16KiB-2 BenchmarkReadFile/OS_32KiB-2 BenchmarkReadFile/OS_64KiB-2 BenchmarkReadFile/OS_128KiB-2 BenchmarkReadFile/OS_512KiB-2 BenchmarkReadFile/OS_1MiB-2 BenchmarkReadFile/OS_10MiB-2 BenchmarkReadFile/OS_100MiB-2 BenchmarkReadFile/OS_1GiB-2 BenchmarkReadFile/IOURING_1KiB_-2 BenchmarkReadFile/IOURING_16KiB-2 BenchmarkReadFile/IOURING_32KiB-2 BenchmarkReadFile/IOURING_64KiB-2 BenchmarkReadFile/IOURING_128KiB-2 BenchmarkReadFile/IOURING_512KiB-2 BenchmarkReadFile/IOURING_1MiB-2 BenchmarkReadFile/IOURING_10MiB-2 BenchmarkReadFile/IOURING_100MiB-2 BenchmarkReadFile/IOURING_1GiB-2 ingfile
E5-2676 v3 @ 2.40GHz
106183 11050 ns/op 1560 B/op 6 allocs/op
76435 15818 ns/op 18840 B/op 6 allocs/op
54651 23067 ns/op 41368 B/op 6 allocs/op
37386 32897 ns/op 74136 B/op 6 allocs/op
20600 67160 ns/op 139672 B/op 6 allocs/op
5833 213830 ns/op 532888 B/op 6 allocs/op
3462 339948 ns/op 1057176 B/op 6 allocs/op
332 3739868 ns/op 10494360 B/op 6 allocs/op
24 44693581 ns/op 104866200 B/op 6 allocs/op
2 556463702 ns/op 1073750424 B/op 6 allocs/op
54560 21808 ns/op 1904 B/op 12 allocs/op
39284 30109 ns/op 17264 B/op 12 allocs/op
32830 34880 ns/op 33648 B/op 12 allocs/op
24984 45831 ns/op 66416 B/op 12 allocs/op
15892 70437 ns/op 131952 B/op 12 allocs/op
5031 278818 ns/op 525169 B/op 12 allocs/op
1681 725118 ns/op 1049459 B/op 12 allocs/op
313 3600860 ns/op 10486647 B/op 12 allocs/op
25 43821402 ns/op 104858486 B/op 12 allocs/op
2 629058212 ns/op 1073742704 B/op 12 allocs/op

There is some lower byte allocations, but nothing to call home about, once you get over 1KiB.

I noticed that another open issue states that the examples directory doesn't have any examples that run faster than the system tools. Since the system tools are C related and have been optimized for years, io_uring may not be the contributing factor.

But in this case it is Go (epoll) and Go (io_uring).

If you see something here that is an obvious mistake on my part or a knowledge gap on my part, that would be great. And I completely understand if you are either no longer interested in this or don't have time to work on it. If so, I'll be glad to go away :)

Cheers and thanks for making your implementation available and open source!

eatonphil · 2023-10-16T20:50:47Z

I am seeing similar slower writes using this library compared to Go's builtin WriteAt/pwrite support.

I've got a benchmark that just writes 100_000 4KB byte chunks to disk. io_uring can get close but not as good as just having 10 worker goroutines write out 4KB chunks at a time.

Source: https://github.com/eatonphil/go-io-playground

┌───────────────────────────────────────────┬───────────────────────┬────────────┐
│                  method                   │       avg_time        │ throughput │
│                  varchar                  │        varchar        │  varchar   │
├───────────────────────────────────────────┼───────────────────────┼────────────┤
│ 10_goroutines_io_uring_pwrite_1_entry     │ 0.1633701s            │ 2.5GB/s    │
│ 10_goroutines_io_uring_pwrite_100_entries │ 0.0394018s            │ 10.4GB/s   │
│ buf                                       │ 0.257824s             │ 1.5GB/s    │
│ 10_goroutines_pwrite                      │ 0.025590199999999997s │ 16.0GB/s   │
│ nobuf                                     │ 0.2549947s            │ 1.6GB/s    │
└───────────────────────────────────────────┴───────────────────────┴────────────┘

eatonphil · 2023-10-17T18:45:04Z

I'm no longer able to reproduce this. I'm currently getting:

┌───────────────────────────────────────────┬──────────────────────┬────────────┐
│                  method                   │       avg_time       │ throughput │
│                  varchar                  │       varchar        │  varchar   │
├───────────────────────────────────────────┼──────────────────────┼────────────┤
│ 10_goroutines_io_uring_pwrite_1_entry     │ 0.5154280000000001s  │ 794.7MB/s  │
│ blocking                                  │ 0.2821061s           │ 1.4GB/s    │
│ 10_goroutines_pwrite                      │ 0.32600510000000005s │ 1.2GB/s    │
│ 10_goroutines_io_uring_pwrite_512_entries │ 0.35256730000000003s │ 1.1GB/s    │
└───────────────────────────────────────────┴──────────────────────┴────────────┘

io_uring is still lower than even blocking io. I'll keep looking into it.

I was able to find some more bugs in my code while porting this to Zig.

eatonphil · 2023-10-17T21:20:11Z

I ported my Go benchmark from Go to Zig and I get basically the same results from Zig and from Go.

So I don't think this is the fault of this library. Not that OP was saying that either necessarily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower reads using iouring than os for ReadFile type operations at all file sizes #27

Slower reads using iouring than os for ReadFile type operations at all file sizes #27

johnsiilver commented Aug 20, 2023

eatonphil commented Oct 16, 2023 •

edited

Loading

eatonphil commented Oct 17, 2023 •

edited

Loading

eatonphil commented Oct 17, 2023

Slower reads using iouring than os for ReadFile type operations at all file sizes #27

Slower reads using iouring than os for ReadFile type operations at all file sizes #27

Comments

johnsiilver commented Aug 20, 2023

eatonphil commented Oct 16, 2023 • edited Loading

eatonphil commented Oct 17, 2023 • edited Loading

eatonphil commented Oct 17, 2023

eatonphil commented Oct 16, 2023 •

edited

Loading

eatonphil commented Oct 17, 2023 •

edited

Loading