Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DecodeBuffer::repeat performance again #18

Closed
KillingSpark opened this issue Apr 5, 2022 · 28 comments
Closed

Improve DecodeBuffer::repeat performance again #18

KillingSpark opened this issue Apr 5, 2022 · 28 comments

Comments

@KillingSpark
Copy link
Owner

With the Pullrequest from @paolobarbolini the draining of the decodebuffer has improved a lot (because it now has no memmove anymore which was part of the Vec::drain(..)). As a consequence we now have a 2x increase in memcpys in the sequence execution.

It should be possible to get the best of both worlds. I will have to implement a custom ringbuffer though.

Most operations this needs to support are pretty trivial:

  1. push()
  2. extend()
  3. len()/capacity()
  4. get_slices()
  5. clear()
  6. reserve()
  7. as_slices()

The only non-trivial part is the optimization we need in the sequence execution (which is implemented in DecodeBuffer::repeat).
We need a kind of equivalent of Vec's extend_from_within. Which is a bit harder to do for A Ringbuffer.

fn extend_from_within(offset, match_length) {
    // This is expected to not need to do anything after the first few calls. 
    // The decodebuffer should grow to a certain capacity
    // at which it can hold all the decoded data until the next drain.
    self.reserve(match_length);

    let (s1, s2) = self.slices();
    let (match1, match2) = figure_out_which_regions_need_copying(offset, match_length, s1, s2);

    let (f1, f2) = self.free_capacity();
    // Now the compilcated part that I am too lazy to prototype this evening:
    // copy match1 and match2 into f1 and f2. 
    // With the call to reserve these are guaranteed to have enough space to hold them.
    // In the expected case all that is happening here is a memcpy from match1 into f1 and we are done
    //
    // But since this is a Ringbuffer we need to deal with the occasional fractioning of these regions
    // In the worst case this results in 4 memcpys but that should be quite rare
}
@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 5, 2022

I'll too look at this. It sounds like a great weekend project where I can micro optimize the hell out of it via flamegraphs and Compiler Explorer 😄

@KillingSpark
Copy link
Owner Author

Sure go for it :)

Might also be worth to get some people from the std lib to look at this when you're done. Vec::extend_from_within was motivated by usecase like this crate. I think they'd accept an implementation of it for vecdequeue too.

@KillingSpark
Copy link
Owner Author

https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#288

This is somewhat what I had in mind but I think it solves an even more general problem. I think we can do something simpler. But as an inspiration for edgecases this is very likely a good source ;)

@KillingSpark
Copy link
Owner Author

KillingSpark commented Apr 7, 2022

I think this is implementable completly branchless, which would probably be huge!

Maybe checks against zero length copy_nonoverlapping could be better but that is for benchmarks in real-world workloads to show.

    // input: which region do we want to copy
    let len: usize = 0;
    let start: usize = 0;

    // data slices in raw parts
    let s1_ptr: *const u8 = std::ptr::null();
    let s1_len: usize = 0;
    let s2_ptr: *const u8 = std::ptr::null();
    let s2_len: usize = 0;

    debug_assert!(len <= s1_len + s2_len);

    // calc the actually wanted slices in raw parts
    let start_in_s1 = usize::min(s1_len, start);
    let end_in_s1 = usize::min(s1_len, start + len);
    let m1_ptr = unsafe { s1_ptr.add(start_in_s1) };
    let m1_len = end_in_s1 - start_in_s1;

    debug_assert!(end_in_s1 <= s1_len);
    debug_assert!(start_in_s1 <= s1_len);

    let start_in_s2 = start.saturating_sub(s1_len);
    let end_in_s2 = start_in_s2 + (len - m1_len);
    let m2_ptr = unsafe { s2_ptr.add(start_in_s2) };
    let m2_len = end_in_s2 - start_in_s2;

    debug_assert!(start_in_s2 <= s2_len);
    debug_assert!(end_in_s2 <= s2_len);

    debug_assert_eq!(len, m1_len + m2_len);

    // the free slices, must hold: f1_len + f2_len >= m1_len + m2_len
    let f1_ptr: *mut u8 = std::ptr::null_mut();
    let f1_len: usize = 0;
    let f2_ptr: *mut u8 = std::ptr::null_mut();
    let f2_len: usize = 0;

    debug_assert!(f1_len + f2_len >= m1_len + m2_len);

    // calc how many from where bytes go where
    let m1_in_f1 = usize::min(m1_len, f1_len);
    let m1_in_f2 = m1_len - m1_in_f1;
    let m2_in_f1 = usize::min(f1_len - m1_in_f1, m2_len);
    let m2_in_f2 = m2_len - m2_in_f1;

    debug_assert_eq!(m1_len, m1_in_f1 + m1_in_f2);
    debug_assert_eq!(m2_len, m2_in_f1 + m2_in_f2);
    debug_assert!(f1_len >= m1_in_f1 + m2_in_f1);
    debug_assert!(f2_len >= m1_in_f2 + m2_in_f2);
    debug_assert_eq!(len, m1_in_f1 + m2_in_f1 + m1_in_f2 + m2_in_f2);

    debug_assert!((m1_in_f2 > 0) != (m2_in_f1 > 0));

    unsafe {
        f1_ptr.copy_from_nonoverlapping(m1_ptr, m1_in_f1);
        f1_ptr
            .add(m1_in_f1)
            .copy_from_nonoverlapping(m2_ptr, m2_in_f1);

        f2_ptr.copy_from_nonoverlapping(m1_ptr.add(m1_in_f1), m1_in_f2);
        f2_ptr
            .add(m1_in_f2)
            .copy_from_nonoverlapping(m2_ptr.add(m2_in_f1), m2_in_f2);
    }

@paolobarbolini
Copy link
Contributor

If I had to guess it's faster to check that the copy isn't zero length than incurring the cost of calling the function, but we'll see. I'll try finding some time to work on it

@KillingSpark
Copy link
Owner Author

KillingSpark commented Apr 8, 2022

The function below seems to be compiled into a jumptable, so coming up with a small integer for each case would be enough to stay branchless and get less memcpy invokations!

This could probably be converted to target even smaller integers because c2 and c3 cannot both be 1, but having a range of [0..16) should be small enough. And being smarter would probably involve branching which is what we are trying to avoid ^^

let c1 = m1_in_f1 > 0;
let c2 = m2_in_f1 > 0;
let c3 = m1_in_f2 > 0;
let c4 = m2_in_f2 > 0;
let case = c1 as usize | (c2 as usize << 1) | (c3 as usize << 2) | (c4 as usize << 3);

Then we pick the ones that are possible in a match and mark all others as unreachable to the compiler.

Obviously this is just one of the options we need to benchmark. If a more readable version has similar performance I'd gladly use that instead.

pub fn is_jump_table(x1: *mut usize, x2: *mut usize, x3: *mut usize, y1: *const usize, y2: *const usize, y3: *const usize, len: usize, len2: usize, len3: usize, case: usize){
    unsafe {
        match case {
            0 => {}
            1 => {
                x1.copy_from_nonoverlapping(y1, len);
            }
            2 => {
                x1.copy_from_nonoverlapping(y2, len);
                x2.copy_from_nonoverlapping(y3, len2);
            }
            3 => {
                x1.copy_from_nonoverlapping(y3, len);
                x3.copy_from_nonoverlapping(y1, len2);
            }
            16 => {
                x1.copy_from_nonoverlapping(y1, len);
                x2.copy_from_nonoverlapping(y2, len2);
                x3.copy_from_nonoverlapping(y3, len3);
            }
            _ => { std::hint::unreachable_unchecked() }
        }
    }
}

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 10, 2022

Considering how good the std VecDeque is I wander how long it'd take to get VecDeque::extend_from_within accepted as an unstable API and subsequently stabilized 🤔. Anyway, I've started improving the implementation and I'll send a PR in the following days

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 11, 2022

I've sent rust-lang/rust#95904 to start improving VecDeque so that one day we'll be able to go back to it instead of having to rely on our own ring buffer for ever. The improvements are very big, so it should reduce the gap caused by the missing extend_from_within implementation

@KillingSpark
Copy link
Owner Author

KillingSpark commented Apr 11, 2022

Nice! Yeah I think it would be nice to use vecdequeue once it has this API. I don't think it will take that long. It is pretty clear what the API needs to do so stabilization should be fast.

Still fun to tinker with this stuff tho :)

@KillingSpark
Copy link
Owner Author

KillingSpark commented Apr 11, 2022

I hate when I come up with clever things and benchmarks say no :(

Anyways, at least on my machine and decoding enwik9, just doing the checks is the fastest option.

Coming in at about 5.7x what real zstd is able to do. That program is irritatingly well optimized.

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 12, 2022

Yeah #[inline(always)] is a bad idea usually (I think I saw you adding those from the GH feed). On the opposite side #[cold] and #[inline(never)] can help a lot sometimes 😃

@KillingSpark
Copy link
Owner Author

KillingSpark commented Apr 12, 2022

Yeah at this point I am just playing around.

What worries me is, that

A) There are still bugs in this code
B) This is less performant than the Version implemented with Vec (at least on my machine, with enwik9 ~5.5x vs ~x4.6 times slower)

@paolobarbolini
Copy link
Contributor

This is less performant than the Version implemented with Vec (at least on my machine, with enwik9 ~5.5x vs ~x4.6 times slower)

I think part of the blame might be the fact that you make a new allocation instead of extending the existing one.

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 13, 2022

Before we forget it: this returns a null pointer if there's an error

let new_buf = unsafe { std::alloc::alloc(new_layout) };

@KillingSpark
Copy link
Owner Author

Before we forget it: this returns a null pointer if there's an error

That is checked in the line below. But you are right, this should either panic or return an error in the case that new_buf == ptr::null_mut()

@pothos
Copy link

pothos commented Apr 20, 2022

FYI smoltcp has a working ring buffer implementation, too: https://github.com/smoltcp-rs/smoltcp/blob/master/src/storage/ring_buffer.rs

@KillingSpark
Copy link
Owner Author

For anyone interested, work on this is happening in the ringbuffer branch

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 27, 2022

The VecDeque::extend optimization is moving forward. I hope it lands in 1.62 rust-lang/rust#95904. The benchmarks are promising.

EDIT: it will be in tomorrows nightly (2022-04-28)

@KillingSpark
Copy link
Owner Author

That's just for the extend, not the extend_from_within operation right?

@paolobarbolini
Copy link
Contributor

That's just for the extend, not the extend_from_within operation right?

Right

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 28, 2022

But it does make VecDeque::extend actually do a memcpy instead of doing a very slow unoptimized byte by byte copy

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Apr 29, 2022

Benchmarks on a big file with a high compression ratio

zstd-rs commit nightly version results time spent doing memcpy
f604d48 (Vec) 2022-04-28 6.5s 77%
66d02e4 (VecDeque) 2022-04-27 2.8s 40% (including all of the time spent in the non-optimized Extend impl)
66d02e4 (VecDeque) 2022-04-28 (thanks to rust-lang/rust#95904) 2.2s 22%
25cf7a6 #23 (VecDeque) 2022-04-28 2.18s 22%
30e65d7 (custom ringbuffer) 2022-04-28 1.8s 27%
zstd 1.5.2 - 0.5s Untested

@KillingSpark
Copy link
Owner Author

That's very nice! I am actually a bit surprised that VecDeque did not have this optimization yet. I think you did a lot of people a big favor with that.

Nonetheless I'd say your benchmarks speak towards keeping the custom ringbuffer. a 1.2x speedup is pretty significant in my opinion. Do you plan on doing a extend_from_within for VecDeque?

@paolobarbolini
Copy link
Contributor

Do you plan on doing a extend_from_within for VecDeque?

Yes. I've already looked into it and it doesn't seem like the implementation would be too hard.

@paolobarbolini
Copy link
Contributor

paolobarbolini commented May 1, 2022

I'm trying to see if there's a way of doing it without having to implement extend_from_within in VecDeque by relying on LLVM to optimize away useless intermediate memcpys but it doesn't work.

This got me really curious to how the memcpy optimizer works in LLVM. I'll take a look at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp

@paolobarbolini
Copy link
Contributor

I may have found a way to convince LLVM, if only it reliably eliminated memcpys of uninit memory 😞

https://rust.godbolt.org/z/sef536cPn

@KillingSpark
Copy link
Owner Author

This is a pretty interesting thing: https://github.com/gnzlbg/slice_deque/

I don't think we should use that (because they seem to rely heavily on linux syscalls) but I thought anyone interested in this issue might also be interested in this crate. It looks very cool. I might just out of interest play with it and see how it performs.

@KillingSpark
Copy link
Owner Author

I merged the work from the ringbuffer branch into master. It will need more test before I feel comfortable to make a new release with it, but I think (at least for now) this is the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants