-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DecodeBuffer::repeat performance again #18
Comments
I'll too look at this. It sounds like a great weekend project where I can micro optimize the hell out of it via flamegraphs and Compiler Explorer 😄 |
Sure go for it :) Might also be worth to get some people from the std lib to look at this when you're done. Vec::extend_from_within was motivated by usecase like this crate. I think they'd accept an implementation of it for vecdequeue too. |
https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#288 This is somewhat what I had in mind but I think it solves an even more general problem. I think we can do something simpler. But as an inspiration for edgecases this is very likely a good source ;) |
I think this is implementable completly branchless, which would probably be huge! Maybe checks against zero length copy_nonoverlapping could be better but that is for benchmarks in real-world workloads to show. // input: which region do we want to copy
let len: usize = 0;
let start: usize = 0;
// data slices in raw parts
let s1_ptr: *const u8 = std::ptr::null();
let s1_len: usize = 0;
let s2_ptr: *const u8 = std::ptr::null();
let s2_len: usize = 0;
debug_assert!(len <= s1_len + s2_len);
// calc the actually wanted slices in raw parts
let start_in_s1 = usize::min(s1_len, start);
let end_in_s1 = usize::min(s1_len, start + len);
let m1_ptr = unsafe { s1_ptr.add(start_in_s1) };
let m1_len = end_in_s1 - start_in_s1;
debug_assert!(end_in_s1 <= s1_len);
debug_assert!(start_in_s1 <= s1_len);
let start_in_s2 = start.saturating_sub(s1_len);
let end_in_s2 = start_in_s2 + (len - m1_len);
let m2_ptr = unsafe { s2_ptr.add(start_in_s2) };
let m2_len = end_in_s2 - start_in_s2;
debug_assert!(start_in_s2 <= s2_len);
debug_assert!(end_in_s2 <= s2_len);
debug_assert_eq!(len, m1_len + m2_len);
// the free slices, must hold: f1_len + f2_len >= m1_len + m2_len
let f1_ptr: *mut u8 = std::ptr::null_mut();
let f1_len: usize = 0;
let f2_ptr: *mut u8 = std::ptr::null_mut();
let f2_len: usize = 0;
debug_assert!(f1_len + f2_len >= m1_len + m2_len);
// calc how many from where bytes go where
let m1_in_f1 = usize::min(m1_len, f1_len);
let m1_in_f2 = m1_len - m1_in_f1;
let m2_in_f1 = usize::min(f1_len - m1_in_f1, m2_len);
let m2_in_f2 = m2_len - m2_in_f1;
debug_assert_eq!(m1_len, m1_in_f1 + m1_in_f2);
debug_assert_eq!(m2_len, m2_in_f1 + m2_in_f2);
debug_assert!(f1_len >= m1_in_f1 + m2_in_f1);
debug_assert!(f2_len >= m1_in_f2 + m2_in_f2);
debug_assert_eq!(len, m1_in_f1 + m2_in_f1 + m1_in_f2 + m2_in_f2);
debug_assert!((m1_in_f2 > 0) != (m2_in_f1 > 0));
unsafe {
f1_ptr.copy_from_nonoverlapping(m1_ptr, m1_in_f1);
f1_ptr
.add(m1_in_f1)
.copy_from_nonoverlapping(m2_ptr, m2_in_f1);
f2_ptr.copy_from_nonoverlapping(m1_ptr.add(m1_in_f1), m1_in_f2);
f2_ptr
.add(m1_in_f2)
.copy_from_nonoverlapping(m2_ptr.add(m2_in_f1), m2_in_f2);
} |
If I had to guess it's faster to check that the copy isn't zero length than incurring the cost of calling the function, but we'll see. I'll try finding some time to work on it |
The function below seems to be compiled into a jumptable, so coming up with a small integer for each case would be enough to stay branchless and get less memcpy invokations! This could probably be converted to target even smaller integers because c2 and c3 cannot both be 1, but having a range of [0..16) should be small enough. And being smarter would probably involve branching which is what we are trying to avoid ^^ let c1 = m1_in_f1 > 0;
let c2 = m2_in_f1 > 0;
let c3 = m1_in_f2 > 0;
let c4 = m2_in_f2 > 0;
let case = c1 as usize | (c2 as usize << 1) | (c3 as usize << 2) | (c4 as usize << 3); Then we pick the ones that are possible in a match and mark all others as unreachable to the compiler. Obviously this is just one of the options we need to benchmark. If a more readable version has similar performance I'd gladly use that instead. pub fn is_jump_table(x1: *mut usize, x2: *mut usize, x3: *mut usize, y1: *const usize, y2: *const usize, y3: *const usize, len: usize, len2: usize, len3: usize, case: usize){
unsafe {
match case {
0 => {}
1 => {
x1.copy_from_nonoverlapping(y1, len);
}
2 => {
x1.copy_from_nonoverlapping(y2, len);
x2.copy_from_nonoverlapping(y3, len2);
}
3 => {
x1.copy_from_nonoverlapping(y3, len);
x3.copy_from_nonoverlapping(y1, len2);
}
16 => {
x1.copy_from_nonoverlapping(y1, len);
x2.copy_from_nonoverlapping(y2, len2);
x3.copy_from_nonoverlapping(y3, len3);
}
_ => { std::hint::unreachable_unchecked() }
}
}
} |
Considering how good the std |
I've sent rust-lang/rust#95904 to start improving VecDeque so that one day we'll be able to go back to it instead of having to rely on our own ring buffer for ever. The improvements are very big, so it should reduce the gap caused by the missing extend_from_within implementation |
Nice! Yeah I think it would be nice to use vecdequeue once it has this API. I don't think it will take that long. It is pretty clear what the API needs to do so stabilization should be fast. Still fun to tinker with this stuff tho :) |
I hate when I come up with clever things and benchmarks say no :( Anyways, at least on my machine and decoding enwik9, just doing the checks is the fastest option. Coming in at about 5.7x what real zstd is able to do. That program is irritatingly well optimized. |
Yeah |
Yeah at this point I am just playing around. What worries me is, that A) There are still bugs in this code |
I think part of the blame might be the fact that you make a new allocation instead of extending the existing one. |
Before we forget it: this returns a null pointer if there's an error zstd-rs/src/decoding/ringbuffer.rs Line 35 in d225dff
|
That is checked in the line below. But you are right, this should either panic or return an error in the case that |
FYI smoltcp has a working ring buffer implementation, too: https://github.com/smoltcp-rs/smoltcp/blob/master/src/storage/ring_buffer.rs |
For anyone interested, work on this is happening in the ringbuffer branch |
The EDIT: it will be in tomorrows nightly (2022-04-28) |
That's just for the extend, not the extend_from_within operation right? |
Right |
But it does make |
Benchmarks on a big file with a high compression ratio
|
That's very nice! I am actually a bit surprised that VecDeque did not have this optimization yet. I think you did a lot of people a big favor with that. Nonetheless I'd say your benchmarks speak towards keeping the custom ringbuffer. a 1.2x speedup is pretty significant in my opinion. Do you plan on doing a extend_from_within for VecDeque? |
Yes. I've already looked into it and it doesn't seem like the implementation would be too hard. |
I'm trying to see if there's a way of doing it without having to implement
This got me really curious to how the memcpy optimizer works in LLVM. I'll take a look at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp |
I may have found a way to convince LLVM, if only it reliably eliminated |
This is a pretty interesting thing: https://github.com/gnzlbg/slice_deque/ I don't think we should use that (because they seem to rely heavily on linux syscalls) but I thought anyone interested in this issue might also be interested in this crate. It looks very cool. I might just out of interest play with it and see how it performs. |
I merged the work from the ringbuffer branch into master. It will need more test before I feel comfortable to make a new release with it, but I think (at least for now) this is the way to go. |
With the Pullrequest from @paolobarbolini the draining of the decodebuffer has improved a lot (because it now has no memmove anymore which was part of the Vec::drain(..)). As a consequence we now have a 2x increase in memcpys in the sequence execution.
It should be possible to get the best of both worlds. I will have to implement a custom ringbuffer though.
Most operations this needs to support are pretty trivial:
The only non-trivial part is the optimization we need in the sequence execution (which is implemented in DecodeBuffer::repeat).
We need a kind of equivalent of Vec's extend_from_within. Which is a bit harder to do for A Ringbuffer.
The text was updated successfully, but these errors were encountered: