Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive futex syscalling with no-mmap #54

Closed
riptl opened this issue Feb 3, 2020 · 18 comments
Closed

Excessive futex syscalling with no-mmap #54

riptl opened this issue Feb 3, 2020 · 18 comments

Comments

@riptl
Copy link

riptl commented Feb 3, 2020

Problem:
b3sum --no-mmap is using a tiny read buffer size (8KB) and syscalls read(2) and futex(2) aggressively. I find that hashing at only 110 MB/s maxes out all 24 cores, doing about 10k syscalls per second (tail -f strace.log | pv -l >/dev/null).

top

CPU% MEM%   TIME+  Command                                                                                                                                                                                                           
2025  0.0 22:58.81 b3sum --no-mmap idsab.tar.zst

strace.log

futex(0x563438b1804c, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x563438b17ff0, FUTEX_WAKE_PRIVATE, 1) = 0
read(3, "\3\352\t\225\213r\371\3421\231\361\303\306\237\252\362\274\212>\33\t\f&\240u\31\252\331\310\242\265T"..., 8192) = 8192
futex(0x563438b14028, FUTEX_WAKE_PRIVATE, 2147483647) = 2
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "9\22\373\5\17\357\35\0364\35\371{\324\220\275J~B0$\273\332\264\20\v\352\304\23\301l\316\320"..., 8192) = 8192
futex(0x563438b18048, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x563438b17ff0, FUTEX_WAKE_PRIVATE, 1) = 0
read(3, "\225/\247\216\367Q\375,E\357\305\367\323\321\2\223f\266\n\222\257\305\260qLJ\f\247\370\202\6\306"..., 8192) = 8192
futex(0x563438b1402c, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "7\365\30n\370K\f\265\200\363r7\35E\321\347_\226\351-\331\24\232\200\352\233\316W\20\2121\264"..., 8192) = 8192
futex(0x563438b14028, FUTEX_WAKE_PRIVATE, 2147483647) = 2
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "\234\235\250\352\315\343*\t\215\205o(\231\230\325\f\222\257o\302A3\315z\240}\364'\327\33Wj"..., 8192) = 8192
... and so on

Environment:

  • Built with cargo install b3sum (as of 2020-02-03)
  • Stable cargo 1.41.0 (626f0f40e 2019-12-03)
  • ZFS on Linux
  • Ubuntu 20.04
  • Kernel 5.4.0-9-generic
@riptl
Copy link
Author

riptl commented Feb 3, 2020

Possibly related #32 (should ask OP about CPU usage)

@oconnor663
Copy link
Member

Note that 8 KiB is the default buffer size of std::io::copy. Not sure about the locking off the top of my head.

@oconnor663
Copy link
Member

What size is your test file? And how does that compare to your available RAM?

@riptl
Copy link
Author

riptl commented Feb 3, 2020

@oconnor663 My test file is about 100 GB and my RAM is 128 GB. Page cache or not, b3sum will still do thousands of syscalls per seconds, destroying performance. Perhaps put a simple https://doc.rust-lang.org/std/io/struct.BufReader.html inbetween the file and hasher?
Not optimal, since bytes are copied twice but it would solve the problem.

@riptl
Copy link
Author

riptl commented Feb 3, 2020

Note that 8 KiB is the default buffer size of std::io::copy.

Can you point me to the source? I was looking for an issue like this. imo this should be filed as a Rust issue, no stdlib should behave that badly

@riptl
Copy link
Author

riptl commented Feb 4, 2020

@oconnor663 Sorry, seems like my diagnosis was wrong. 8KB reads are totally fine (no significant CPU load)

dd if=idsac.tar.zst of=/dev/null bs=8192
1294801+0 records in
1294800+0 records out
10607001600 bytes (11 GB, 9.9 GiB) copied, 46.4309 s, 228 MB/s

-- I think the big problem is the futex spam.

@riptl riptl changed the title Excessive syscalling Excessive futex syscalling with no-mmap Feb 4, 2020
@oconnor663
Copy link
Member

oconnor663 commented Feb 4, 2020

I'm not seeing any futex calls at all on my machine, using a file that's bigger than RAM (though not as big as yours):

$ ls -lh /var/tmp/big_f
-rw-r--r-- 1 jacko jacko 8.0G Feb  3 21:58 /var/tmp/big_f
$ strace b3sum /var/tmp/big_f --no-mmap 2>&1 | wc -l
1041464
$ b3sum /var/tmp/big_f --no-mmap 2>&1 | grep futex | wc -l
0

I do see futex calls if I re-enable mmapping, though there are very few system calls overall:

$ strace b3sum /var/tmp/big_f 2>&1 | wc -l             
181
$ strace b3sum /var/tmp/big_f 2>&1 | grep futex | wc -l
49

Could you give me a complete series of commands that gives the result you saw? Does it only happen for gigantic files, or can you get a repro with something smaller?

@riptl
Copy link
Author

riptl commented Feb 4, 2020

I think I found something: I don't get any futex calls when compiling without default features. Maybe the rayon crate has something to do with this?

# mkdir default no-default-feat
# cargo install --root=default/ b3sum
# cargo install --root=no-default-feat/ --no-default-features b3sum
# dd if=/dev/zero bs=1M count=10 2>/dev/null | strace ./default/bin/b3sum 2>&1 | grep futex | wc -l
2566
# dd if=/dev/zero bs=1M count=10 2>/dev/null | strace ./no-default-feat/bin/b3sum 2>&1 | grep futex | wc -l
0

Summary of environments with confirmed futex spam:

  • Ubuntu 18.04, 19.10 and 20.04
  • rust Docker image
  • pipes, ext4 and ZFS
  • Machine with 2 NUMA nodes and single-processor one
  • Latest Rust stable, nightly and a nightly build from 2019-11
  • All of above with cargo install b3sum

Does it only happen for gigantic files, or can you get a repro with something smaller?

Test commands fyi:

$ dd if=/dev/zero bs=1K count=1000 2>/dev/null | strace b3sum 2>&1 | grep futex | wc -l
252

$ dd if=/dev/zero bs=1M count=1 2>/dev/null | strace b3sum 2>&1 | grep futex | wc -l
257

$ dd if=/dev/zero bs=1M count=10 2>/dev/null | strace b3sum 2>&1 | grep futex | wc -l
2560

$ dd if=/dev/zero bs=1M count=100 2>/dev/null | strace b3sum 2>&1 | grep futex | wc -l
25664

EDIT: Benchmarking ... Seems like there is a performance improvement, too

$ dd if=/dev/zero bs=1M count=1000 | b3sum
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.23411 s, 469 MB/s

$ dd if=/dev/zero bs=1M count=1000 | b3sum_no_default_features
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.928055 s, 1.1 GB/s

@riptl
Copy link
Author

riptl commented Feb 4, 2020

The default b3sum binary maxes out all available processor cores, with a lot of kernel time.
The single-threaded version only uses a single core, with very little kernel time. It seems the multithreading (probably caused by rayon) is detrimental to performance when used with small buffer sizes (8KB).

@oconnor663
Copy link
Member

Hmm, normally with --no-mmap and the 8 KiB buffer size, rayon::join should never get invoked. That buffer size is just enough to feed AVX2 on a single thread. But...I wonder if your processor might not support AVX2? That could explain some weirdness here. What's the output of cat /proc/cpuinfo?

@oconnor663
Copy link
Member

Specifically, if you have simd_degree() * CHUNK_LEN bytes or less, we do this instead of this.

@sneves
Copy link
Collaborator

sneves commented Feb 5, 2020

Hardcoding simd_degree() to 4 does result in an enormous amount of futex calls, as reported.

@oconnor663
Copy link
Member

That must be it. I think the right answer here is to figure out a solution to #25 and then to make the std::io::copy path explicitly single-threaded. Nothing technically difficult, just a question of the right Rust API for controlling which join implementation gets used.

@riptl
Copy link
Author

riptl commented Feb 5, 2020

But...I wonder if your processor might not support AVX2? That could explain some weirdness here. What's the output of cat /proc/cpuinfo?

This is the one thing I didn't consider :D No, neither of the machines tested support AVX2.

@oconnor663
Copy link
Member

No worries. This is a dumb bug, and I'm ashamed of it :-D

oconnor663 added a commit that referenced this issue Feb 6, 2020
This is a new interface that allows the caller to provide a
multi-threading implementation. It's defined in terms of a new `Join`
trait, for which we provide two implementations, `SerialJoin` and
`RayonJoin`. This lets the caller control when multi-threading is used,
rather than the previous all-or-nothing design of the "rayon" feature.

Although existing callers should keep working, this is a compatibility
break, because callers who were relying on automatic multi-threading
before will now be single-threaded. Thus the next release of this crate
will need to be version 0.2.

See #25 and
#54.
@oconnor663
Copy link
Member

I just landed fc219f4 above. It adds a generic way to control multi-threading, and it also fixes b3sum so that it's always single-threaded when --no-mmap is used. (Similarly, when standard input is used.) Could you try building at that commit and see if it fixes your issue?

Tests are failing on the latest nightly compiler at the moment because of rust-lang/rust#68905. Hopefully that'll get fixed soon. In the meantime, you might need to use stick to stable or beta.

@riptl
Copy link
Author

riptl commented Feb 9, 2020

Could you try building at that commit and see if it fixes your issue?

@oconnor663 No more futex calls with i7-3770 on latest master! Thanks for the quick fix.

@riptl riptl closed this as completed Feb 9, 2020
@oconnor663
Copy link
Member

Awesome. This change will be part of our 0.2 release, and there's some unrelated work to finish before we ship it, but hopefully sometime in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants