RFC Decode/Encode with multiple bytes per chunk #84

lwus · 2022-05-11T07:43:15Z

I didn't really spend any time thinking about an API so looking for guidance on where the library wants to go with that. This is still an O(n^2) algorithm but with a better constant factor https://gmplib.org/manual/Radix-to-Binary. We could potentially implement a better algorithm for very large numbers but that's overkill for my use case.

Benchmark on my M1 laptop is roughly

  32_bytes/decode_bs58          time:   [414.08 ns 415.13 ns 416.33 ns]
  32_bytes/decode_bs58_unsafe   time:   [99.003 ns 99.456 ns 99.953 ns]
  256_bytes/decode_bs58         time:   [26.940 us 27.022 us 27.118 us]
  256_bytes/decode_bs58_unsafe  time:   [1.3279 us 1.3282 us 1.3286 us]
...
  256_bytes/encode_bs58_vec         time:   [86.649 us 86.698 us 86.763 us]
  256_bytes/encode_bs58_vec_unsafe  time:   [3.5615 us 3.5681 us 3.5762 us]
  32_bytes/encode_bs58_vec          time:   [1.1578 us 1.1612 us 1.1649 us]
  32_bytes/encode_bs58_vec_unsafe   time:   [236.90 ns 237.01 ns 237.13 ns]

src/decode.rs

Nemo157 · 2022-05-11T10:29:12Z

The speedup is nice, but I wouldn't want to introduce any additional unsafe code to get it. I wonder what the speedup would be like using just bytemuck::pod_align_to_mut and not changing the allocation size. That would allow applying the faster algorithm for the inner portion of a longer run (or the whole thing if the allocation happens to be aligned) while maintaining the current API.

- Vec::from_raw_parts is much stricter than anticipated and requires deallocation to happen with the same alignment

lwus · 2022-05-11T13:40:21Z

I removed the UB from_raw_parts in 1382308. Bytemuck seems to just defer to unsafe { vals.align_to_mut... } anyway so not sure if there's much advantage (since we're just going between u8 and u32).

My machine shows no practical difference in the speedup. I imagine that almost all allocations schemes ultimately use memory at least aligned to word size but maybe rust runtime does something more specific...

- slightly more complex expansion from limb to output bytes but pretty solid perf gains

Nemo157 · 2022-05-11T14:04:55Z

The reason I'd prefer to use bytemuck is so that there's one place proving the safety of the cast, I would like to eventually make this crate forbid(unsafe_code) by pulling out the &mut str helper into another crate.

lwus · 2022-05-12T04:21:28Z

Sounds good, I made the change. Any thoughts on the API for exposing this?

- unrolling apparently does not help with performance. this simplifies a lot of the indexing and is also seemingly faster

lwus · 2022-05-17T14:32:39Z

@Nemo157 any thoughts?

Also removed the manual unrolling and that seemed to both improve perf as well and be simpler codewise anyway

Nemo157 · 2022-05-23T11:53:11Z

My thought that was by not relying on the alignment of the allocation this algorithm could replace the existing single-byte algorithm and give speedups in all cases.

One additional requirement to do that would be adding in testing with a misaligned buffer to ensure that does not get broken.

lwus · 2022-05-23T20:17:02Z

I think the current api of allowing arbitrary slices as output buffers makes that tricky. Specifically, it's significantly faster for short buffers to decode everything in word-sized chunks but we would need to switch to byte-sized chunks for the slop. We could add some extra checks do a special case dispatch in that case but that also sounds awkward.

mina86 · 2022-09-04T16:36:01Z

I haven’t looked at the code so this may be dumb comment, but alignment could be solved by copying data to and from a properly-aligned buffer on stack. Hopefully no one uses base58 for large inputs so fixed-size stack buffer should handle all sane cases.

lwus added 4 commits May 11, 2022 15:17

Decode with multiple bytes per chunk

6a32d57

Fix main loop ending condition

90b79c0

Do remaining bytes in 1 multiply loop

c8b5a90

Add benchmarks for decode::into_vec_unsafe

19d7487

lwus force-pushed the prīmum branch from 72ef0ea to 19d7487 Compare May 11, 2022 10:13

Nemo157 reviewed May 11, 2022

View reviewed changes

src/decode.rs Outdated Show resolved Hide resolved

lwus added 2 commits May 11, 2022 16:56

Implement batched encode

58e8a6c

Handle unaligned output instead of using UB unsafe allocs

1382308

- Vec::from_raw_parts is much stricter than anticipated and requires deallocation to happen with the same alignment

lwus changed the title ~~RFC Decode with multiple bytes per chunk~~ RFC Decode/Encode with multiple bytes per chunk May 11, 2022

Encode with 5 base58 bytes per limb

6ae3c07

- slightly more complex expansion from limb to output bytes but pretty solid perf gains

Use bytemuck instead of raw unsafe align

38468ad

lwus added 3 commits May 17, 2022 09:24

Small slice indexing optimization

f6e175c

Add sanity check for encoded output buffer length

faaa09c

Simplify en/decoding chunks loops

5926ef2

- unrolling apparently does not help with performance. this simplifies a lot of the indexing and is also seemingly faster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Decode/Encode with multiple bytes per chunk #84

RFC Decode/Encode with multiple bytes per chunk #84

lwus commented May 11, 2022 •

edited

Nemo157 commented May 11, 2022

lwus commented May 11, 2022

Nemo157 commented May 11, 2022

lwus commented May 12, 2022

lwus commented May 17, 2022

Nemo157 commented May 23, 2022

lwus commented May 23, 2022

mina86 commented Sep 4, 2022

RFC Decode/Encode with multiple bytes per chunk #84

Are you sure you want to change the base?

RFC Decode/Encode with multiple bytes per chunk #84

Conversation

lwus commented May 11, 2022 • edited

Nemo157 commented May 11, 2022

lwus commented May 11, 2022

Nemo157 commented May 11, 2022

lwus commented May 12, 2022

lwus commented May 17, 2022

Nemo157 commented May 23, 2022

lwus commented May 23, 2022

mina86 commented Sep 4, 2022

lwus commented May 11, 2022 •

edited