Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower reads using iouring than os for ReadFile type operations at all file sizes #27

Open
johnsiilver opened this issue Aug 20, 2023 · 3 comments

Comments

@johnsiilver
Copy link

Code discussed here can be found at:
https://github.com/johnsiilver/iouringfile

Expectations:
I expected faster Reads with io_uring, but found it to be slower in every benchmark. This could be due to multiple factors:

  • I don't know what I'm doing, which is easily possible
  • Single buffer reads, no matter what the size, have no speed advantage in io_uring vs epoll
  • Go's concurrency model and io_uring aren't good fits
  • There are bugs in this package

I could not get WithEQPoll() to work as I documented in another bug. Most benchmarks on io_uring show significant savings without it and multipliers if enabled. I found no examples in this repo that used it and using sourcegraph I could find no external repos that used it: https://sourcegraph.com/search?q=context%3Aglobal+WithSQPoll%28%29&patternType=standard&sm=1&groupBy=repo

Here are various benchmarks I ran on an Amazon VM running Ubnutu (I don't have any native linux boxes on bare metal).

Initially used runtime.LockOSThread() for my event loop, which yielded these results:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
BenchmarkReadFile/OS_1KiB-2 108146 11022 ns/op
BenchmarkReadFile/OS_16KiB-2 72103 16457 ns/op
BenchmarkReadFile/OS_32KiB-2 51840 23956 ns/op
BenchmarkReadFile/OS_64KiB-2 33656 36336 ns/op
BenchmarkReadFile/OS_128KiB-2 18320 66011 ns/op
BenchmarkReadFile/OS_512KiB-2 5438 239014 ns/op
BenchmarkReadFile/OS_1MiB-2 4478 277964 ns/op
BenchmarkReadFile/OS_10MiB-2 326 3560091 ns/op
BenchmarkReadFile/OS_100MiB-2 24 45336144 ns/op
BenchmarkReadFile/OS_1GiB-2 2 533741636 ns/op
BenchmarkReadFile/IOURING_1KiB_-2 20305 50978 ns/op <- Almost 5x slower
BenchmarkReadFile/IOURING_16KiB-2 19050 59990 ns/op
BenchmarkReadFile/IOURING_32KiB-2 18846 67469 ns/op
BenchmarkReadFile/IOURING_64KiB-2 14314 81480 ns/op
BenchmarkReadFile/IOURING_128KiB-2 10000 162159 ns/op
BenchmarkReadFile/IOURING_512KiB-2 2743 456180 ns/op
BenchmarkReadFile/IOURING_1MiB-2 1564 757464 ns/op
BenchmarkReadFile/IOURING_10MiB-2 278 4013378 ns/op
BenchmarkReadFile/IOURING_100MiB-2 25 42532022 ns/op
BenchmarkReadFile/IOURING_1GiB-2 2 627497596 ns/op <- Still slower, but at least not 5x

Improved with removal of the the lock, and got:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
BenchmarkReadFile/OS_1KiB-2 104364 11190 ns/op
BenchmarkReadFile/OS_16KiB-2 70800 16552 ns/op
BenchmarkReadFile/OS_32KiB-2 48690 24940 ns/op
BenchmarkReadFile/OS_64KiB-2 29878 37414 ns/op
BenchmarkReadFile/OS_128KiB-2 17282 72963 ns/op
BenchmarkReadFile/OS_512KiB-2 4719 227736 ns/op
BenchmarkReadFile/OS_1MiB-2 3877 285898 ns/op
BenchmarkReadFile/OS_10MiB-2 326 3557621 ns/op
BenchmarkReadFile/OS_100MiB-2 24 44912433 ns/op
BenchmarkReadFile/OS_1GiB-2 2 552437373 ns/op
BenchmarkReadFile/IOURING_1KiB_-2 56310 21982 ns/op <- Almost 2x slower
BenchmarkReadFile/IOURING_16KiB-2 43414 28685 ns/op
BenchmarkReadFile/IOURING_32KiB-2 34561 34932 ns/op
BenchmarkReadFile/IOURING_64KiB-2 24840 45259 ns/op
BenchmarkReadFile/IOURING_128KiB-2 17257 84012 ns/op
BenchmarkReadFile/IOURING_512KiB-2 2972 375026 ns/op
BenchmarkReadFile/IOURING_1MiB-2 --- FAIL: BenchmarkReadFile/IOURING_1MiB-2
benchmark_test.go:148: bad file descriptor
BenchmarkReadFile/IOURING_10MiB-2 273 4144841 ns/op
BenchmarkReadFile/IOURING_100MiB-2 25 42300929 ns/op
BenchmarkReadFile/IOURING_1GiB-2 2 632643157 ns/op

NOTE: For this one run I got a bad file descriptor error, which I suspect is a bug in iouring-go, but it hasn't happened again so I can't confirm it is. It is also just as likely I did something silly.

I removed the event loop and pre-allocation of return channels and just did the iouring calls inside ReadFile itself, which yielded worse results:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
BenchmarkReadFile/OS_1KiB-2 105922 11093 ns/op
BenchmarkReadFile/OS_16KiB-2 73557 16656 ns/op
BenchmarkReadFile/OS_32KiB-2 53299 22797 ns/op
BenchmarkReadFile/OS_64KiB-2 36957 36118 ns/op
BenchmarkReadFile/OS_128KiB-2 19062 66646 ns/op
BenchmarkReadFile/OS_512KiB-2 5625 234702 ns/op
BenchmarkReadFile/OS_1MiB-2 3667 318104 ns/op
BenchmarkReadFile/OS_10MiB-2 325 3617478 ns/op
BenchmarkReadFile/OS_100MiB-2 25 43581482 ns/op
BenchmarkReadFile/OS_1GiB-2 2 552909954 ns/op
BenchmarkReadFile/IOURING_1KiB_-2 11348 97162 ns/op <- 9x slower, so by far my worse idea
BenchmarkReadFile/IOURING_16KiB-2 10000 111489 ns/op
BenchmarkReadFile/IOURING_32KiB-2 10104 109561 ns/op
BenchmarkReadFile/IOURING_64KiB-2 6470 186679 ns/op
BenchmarkReadFile/IOURING_128KiB-2 7903 220638 ns/op
BenchmarkReadFile/IOURING_512KiB-2 2329 614333 ns/op
BenchmarkReadFile/IOURING_1MiB-2 1212 896992 ns/op
BenchmarkReadFile/IOURING_10MiB-2 255 4715659 ns/op
BenchmarkReadFile/IOURING_100MiB-2 25 43412745 ns/op
BenchmarkReadFile/IOURING_1GiB-2 2 631509550 ns/op
PASS
ok github.com/johnsiilver/iouringfile 47.412s

So I went back to the previous and measured the allocs:

goos: linux
goarch: amd64
pkg: github.com/johnsiilver/iouringfile
cpu: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
BenchmarkReadFile/OS_1KiB-2 106183 11050 ns/op 1560 B/op 6 allocs/op
BenchmarkReadFile/OS_16KiB-2 76435 15818 ns/op 18840 B/op 6 allocs/op
BenchmarkReadFile/OS_32KiB-2 54651 23067 ns/op 41368 B/op 6 allocs/op
BenchmarkReadFile/OS_64KiB-2 37386 32897 ns/op 74136 B/op 6 allocs/op
BenchmarkReadFile/OS_128KiB-2 20600 67160 ns/op 139672 B/op 6 allocs/op
BenchmarkReadFile/OS_512KiB-2 5833 213830 ns/op 532888 B/op 6 allocs/op
BenchmarkReadFile/OS_1MiB-2 3462 339948 ns/op 1057176 B/op 6 allocs/op
BenchmarkReadFile/OS_10MiB-2 332 3739868 ns/op 10494360 B/op 6 allocs/op
BenchmarkReadFile/OS_100MiB-2 24 44693581 ns/op 104866200 B/op 6 allocs/op
BenchmarkReadFile/OS_1GiB-2 2 556463702 ns/op 1073750424 B/op 6 allocs/op
BenchmarkReadFile/IOURING_1KiB_-2 54560 21808 ns/op 1904 B/op 12 allocs/op
BenchmarkReadFile/IOURING_16KiB-2 39284 30109 ns/op 17264 B/op 12 allocs/op
BenchmarkReadFile/IOURING_32KiB-2 32830 34880 ns/op 33648 B/op 12 allocs/op
BenchmarkReadFile/IOURING_64KiB-2 24984 45831 ns/op 66416 B/op 12 allocs/op
BenchmarkReadFile/IOURING_128KiB-2 15892 70437 ns/op 131952 B/op 12 allocs/op
BenchmarkReadFile/IOURING_512KiB-2 5031 278818 ns/op 525169 B/op 12 allocs/op
BenchmarkReadFile/IOURING_1MiB-2 1681 725118 ns/op 1049459 B/op 12 allocs/op
BenchmarkReadFile/IOURING_10MiB-2 313 3600860 ns/op 10486647 B/op 12 allocs/op
BenchmarkReadFile/IOURING_100MiB-2 25 43821402 ns/op 104858486 B/op 12 allocs/op
BenchmarkReadFile/IOURING_1GiB-2 2 629058212 ns/op 1073742704 B/op 12 allocs/op

There is some lower byte allocations, but nothing to call home about, once you get over 1KiB.

I noticed that another open issue states that the examples directory doesn't have any examples that run faster than the system tools. Since the system tools are C related and have been optimized for years, io_uring may not be the contributing factor.

But in this case it is Go (epoll) and Go (io_uring).

If you see something here that is an obvious mistake on my part or a knowledge gap on my part, that would be great. And I completely understand if you are either no longer interested in this or don't have time to work on it. If so, I'll be glad to go away :)

Cheers and thanks for making your implementation available and open source!

@eatonphil
Copy link

eatonphil commented Oct 16, 2023

I am seeing similar slower writes using this library compared to Go's builtin WriteAt/pwrite support.

I've got a benchmark that just writes 100_000 4KB byte chunks to disk. io_uring can get close but not as good as just having 10 worker goroutines write out 4KB chunks at a time.

Source: https://github.com/eatonphil/go-io-playground

┌───────────────────────────────────────────┬───────────────────────┬────────────┐
│                  method                   │       avg_time        │ throughput │
│                  varchar                  │        varchar        │  varchar   │
├───────────────────────────────────────────┼───────────────────────┼────────────┤
│ 10_goroutines_io_uring_pwrite_1_entry     │ 0.1633701s            │ 2.5GB/s    │
│ 10_goroutines_io_uring_pwrite_100_entries │ 0.0394018s            │ 10.4GB/s   │
│ buf                                       │ 0.257824s             │ 1.5GB/s    │
│ 10_goroutines_pwrite                      │ 0.025590199999999997s │ 16.0GB/s   │
│ nobuf                                     │ 0.2549947s            │ 1.6GB/s    │
└───────────────────────────────────────────┴───────────────────────┴────────────┘

@eatonphil
Copy link

eatonphil commented Oct 17, 2023

I'm no longer able to reproduce this. I'm currently getting:

┌───────────────────────────────────────────┬──────────────────────┬────────────┐
│                  method                   │       avg_time       │ throughput │
│                  varchar                  │       varchar        │  varchar   │
├───────────────────────────────────────────┼──────────────────────┼────────────┤
│ 10_goroutines_io_uring_pwrite_1_entry     │ 0.5154280000000001s  │ 794.7MB/s  │
│ blocking                                  │ 0.2821061s           │ 1.4GB/s    │
│ 10_goroutines_pwrite                      │ 0.32600510000000005s │ 1.2GB/s    │
│ 10_goroutines_io_uring_pwrite_512_entries │ 0.35256730000000003s │ 1.1GB/s    │
└───────────────────────────────────────────┴──────────────────────┴────────────┘

io_uring is still lower than even blocking io. I'll keep looking into it.

I was able to find some more bugs in my code while porting this to Zig.

@eatonphil
Copy link

I ported my Go benchmark from Go to Zig and I get basically the same results from Zig and from Go.

So I don't think this is the fault of this library. Not that OP was saying that either necessarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants