Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #5

Open
Frommi opened this issue Aug 22, 2017 · 26 comments
Open

Performance #5

Frommi opened this issue Aug 22, 2017 · 26 comments

Comments

@Frommi
Copy link
Owner

Frommi commented Aug 22, 2017

@oyvindln, can you please give a link for your performance tests?

  • adler32(crc) vectorization. I rewrote the shortest version.
  • Also, when rewriting I skipped on copying fields from compressor/decompressor to stack as it is in miniz. Check if it matters.
  • In find_match check if bit magic with u64 (instead of u16) can boost speed by finding first different spot in xor using leading zeros count.
@oyvindln
Copy link
Collaborator

The benchmarks I posted on the issue in the flate2 issue were [these] from my deflate encoder. I simply replaced flate2 with flate2-oxide. (I don't think there is an easy way of using 2 versions of the same crate at the same time without renaming one of them.)

I also have this project which I've used. We could add miniz-oxide to that, though for decompression it would probably be nice to alter it to do do several runs for each file and take the average as the noise tends to be a bit high.

For adler32, there is the adler32 crate we could use.

@oyvindln
Copy link
Collaborator

oyvindln commented Aug 23, 2017

It turns out directly comparing with miniz-backed flate2 is a bit complicated, as the compiler doesn't like that there are two crates that link to 2 versions of a c library with the same name. Maybe we could hide the parts that still link to C code behind a feature.

Though at least it's quite simple to compare to other pure rust libraries with the compression-tester binary by simply swapping flate2 with flate2-oxide.

@Frommi
Copy link
Owner Author

Frommi commented Aug 23, 2017

I faced the same issue when I was writing fuzzing: I solved the problem using objcopy, adding prefix c_ to every symbol in one version of miniz. See: https://github.com/Frommi/miniz_oxide/blob/master/build_fuzz.sh

@oyvindln
Copy link
Collaborator

Ah, that's nice. Then we could do the same for performance comparisons.

@oyvindln
Copy link
Collaborator

oyvindln commented Aug 27, 2017

I managed to improve match copying in tinfl_oxide a bit. loading/storing u64 and other larger types didn't seem to help, but adding a special case for 3-length matches helped a little. Using a local variable as a loop counter for overlapping matches also helped a bit, so I think it might be worth making local copies of more values, like in the C code, so the compiler can put them in registers when it finds it beneficial.

@Frommi
Copy link
Owner Author

Frommi commented Aug 30, 2017

decompress_oxide's main loop isn't very optimal: I looked at asm and there is, for example, two table jumps for Action::None with some other instructions sprinkled in, when there is only need for one jump to the start of the state and in some cases this jump can be made even further into the state's code.

This loop improves performance noticeably as it is essentially a for-loop. But it reduces readability.

The main path of DECODE_LITLEN --> HUFF_DECODE_OUTER_LOOP1 --> [READ_EXTRA_BITS_LITLEN] --> DECODE_DISTANCE --> [READ_EXTRA_BITS_DISTANCE] --> HUFF_DECODE_OUTER_LOOP2 certainly has huge overhead for small matches. The main literal writing in DECODE_LITLEN suffers as well.

@oyvindln
Copy link
Collaborator

Yeah, C code can jump to random places using switch and goto, something miniz and many C/C++ decoders makes use of. We need to find some way around that in Rust.

@Frommi
Copy link
Owner Author

Frommi commented Aug 30, 2017

We could change Action::None to continue, Action::Next for break in loops over states, and Action::End for break 'outer if not for closures. So, maybe change them to macros?

@oyvindln
Copy link
Collaborator

Yeah we could probably abuse loops a bit.

@oyvindln
Copy link
Collaborator

Seems like we're still a bit behind on 32-bit (tested on old core (1) duo laptop):


     Running target/release/deps/bench-c6ee7e382af50fa6

running 13 tests
test compress_default                   ... bench:  38,803,194 ns/iter (+/- 167,060)
test compress_fast                      ... bench:  10,655,691 ns/iter (+/- 21,273)
test compress_high                      ... bench:  68,385,352 ns/iter (+/- 125,997)
test compress_mem_to_heap_default_miniz ... bench:  30,819,206 ns/iter (+/- 83,438)
test compress_mem_to_heap_default_oxide ... bench:  38,946,724 ns/iter (+/- 128,007)
test compress_mem_to_heap_fast_miniz    ... bench:  11,878,490 ns/iter (+/- 101,661)
test compress_mem_to_heap_fast_oxide    ... bench:  15,257,433 ns/iter (+/- 168,803)
test compress_mem_to_heap_high_miniz    ... bench:  51,414,234 ns/iter (+/- 154,822)
test compress_mem_to_heap_high_oxide    ... bench:  68,431,216 ns/iter (+/- 186,737)
test decompress                         ... bench:   4,383,233 ns/iter (+/- 11,595)
test decompress_mem_to_heap_miniz       ... bench:   2,684,700 ns/iter (+/- 25,511)
test zlib_compress_fast                 ... bench:  10,932,038 ns/iter (+/- 42,275)
test zlib_decompress                    ... bench:   4,592,048 ns/iter (+/- 13,105)

@Frommi
Copy link
Owner Author

Frommi commented Aug 30, 2017

Well, yes. We use 64bit buffers and even 64bit magic regardless of the machine's bitness. There are ifdef's in miniz to fallback into byte-by-byte style functions if it is on 32bits.

@oyvindln
Copy link
Collaborator

oyvindln commented Aug 30, 2017

I don't think decompression uses much 64-bit buffers so I'm not sure why the difference is so massive there. Compression is more as expected (actually expected it to be more of a difference given that we use some 64-bit stuff there.

Also interesting is that using RUSTFLAGS= -C target-cpu=native made a significant difference for the default/high compression (from 68M to 54M for compress_mem_to_heap_high_oxide. Not a huge difference for the other tests.

@Frommi
Copy link
Owner Author

Frommi commented Aug 30, 2017

Added some loops and local vars. Halfway to the miniz decompression time.

@oyvindln
Copy link
Collaborator

Nice work! Helped a fair bit on 32-bit machine too:


running 3 tests
test decompress                         ... bench:   3,832,808 ns/iter (+/- 18,652)
test decompress_mem_to_heap_miniz       ... bench:   2,696,180 ns/iter (+/- 33,146)
test zlib_decompress                    ... bench:   4,079,228 ns/iter (+/- 23,350)


@oyvindln
Copy link
Collaborator

oyvindln commented Sep 4, 2017

I don't think decompression uses much 64-bit buffers

Actually, I just realized the Action enum is 64-bit (as status and state are 32-bit values.) I don't know whether this makes an impact on 32-bit or not.
EDIT: Changing data types to make it smaller didn't seem to make any difference on 32-bit.

@oyvindln
Copy link
Collaborator

oyvindln commented Sep 15, 2017

With my last commit, the decompression benchmark is now faster than miniz on my machine!

running 4 tests
test decompress                         ... bench:     740,342 ns/iter (+/- 50,835)
test decompress_mem_to_heap_miniz       ... bench:     822,743 ns/iter (+/- 34,919)
test decompress_mem_to_heap_oxide       ... bench:     739,122 ns/iter (+/- 38,805)
test zlib_decompress                    ... bench:     838,384 ns/iter (+/- 12,919)

It also helped a fair bit on my rpi3 (still not quite as fast here though):

running 4 tests
test decompress                         ... bench:   4,280,570 ns/iter (+/- 10,271)
test decompress_mem_to_heap_miniz       ... bench:   3,249,473 ns/iter (+/- 10,988)
test decompress_mem_to_heap_oxide       ... bench:   4,309,158 ns/iter (+/- 25,855)
test zlib_decompress                    ... bench:   4,597,016 ns/iter (+/- 30,478)

@jrmuizel
Copy link
Contributor

Have you looked at libdeflate? It has the fastest decompression that I've seen: https://quixdb.github.io/squash-benchmark/unstable

@matklad
Copy link
Contributor

matklad commented Sep 15, 2017

This is somewhat off topic, but an impl period is looming (next monday), and I think miniz-oxide can be a perfect crate to contribute to: it should be possible to write comprehensive tests and benchmarks, so that cargo test and cargo bench give guarantee that it works correct, and works fast, and then you can easily throw the whole Rust community on optimzing the bench numbers, while keeping tests green.

I unfortunately don't have time to do the work required to make this crate super-friendly to contributors myself :(

@oyvindln
Copy link
Collaborator

@jrmuizel I am aware of the library, it's probably worth looking into for potential ideas for improving performance further. Granted, libdeflate doesn't support streaming, so the use case is a bit different.

@matklad Yeah, sounds like an idea, or at the very least, we could put in something in the looking for contributers thread on the forums.

@Frommi
Copy link
Owner Author

Frommi commented Sep 15, 2017

@oyvindln these are very cool results! I will have time in the near future, so I will be able to hop back and help. I have quite some history of changes to go through first though :)

@Frommi
Copy link
Owner Author

Frommi commented Oct 2, 2017

Sadly, benches on Travis CI are very unstable. For example, x 1.52 changed to x 1.21 even though decompression wasn't touched, while local benches varies only in 2-3% range for me.

@Frommi
Copy link
Owner Author

Frommi commented Mar 4, 2018

Last few commits improved decompression performance by another ~5% over miniz:

 name                         miniz:: ns/iter  oxide:: ns/iter  diff ns/iter   diff %  speedup 
 compress_bin_lvl_1           1,578,045        1,634,729              56,684    3.59%   x 0.97 
 compress_bin_lvl_6           9,193,429        9,616,389             422,960    4.60%   x 0.96 
 compress_bin_lvl_9           21,811,997       21,582,578           -229,419   -1.05%   x 1.01 
 compress_code_lvl_1          1,199,268        1,129,718             -69,550   -5.80%   x 1.06 
 compress_code_lvl_6          5,185,837        5,340,112             154,275    2.97%   x 0.97 
 compress_code_lvl_9          7,042,597        7,159,632             117,035    1.66%   x 0.98 
 compress_compressed_lvl_1    240,810          281,148                40,338   16.75%   x 0.86 
 compress_compressed_lvl_6    1,448,752        1,569,350             120,598    8.32%   x 0.92 
 compress_compressed_lvl_9    3,139,587        3,224,519              84,932    2.71%   x 0.97 
 compress_short_lvl_1         4,552            20,976                 16,424  360.81%   x 0.22 
 compress_short_lvl_6         4,639            21,112                 16,473  355.10%   x 0.22 
 decompress_bin_lvl_1         970,568          806,123              -164,445  -16.94%   x 1.20 
 decompress_bin_lvl_6         810,642          648,963              -161,679  -19.94%   x 1.25 
 decompress_bin_lvl_9         802,212          637,111              -165,101  -20.58%   x 1.26 
 decompress_code_lvl_1        732,376          610,690              -121,686  -16.62%   x 1.20 
 decompress_code_lvl_6        521,172          395,393              -125,779  -24.13%   x 1.32 
 decompress_code_lvl_9        519,140          392,498              -126,642  -24.39%   x 1.32 
 decompress_compressed_lvl_1  169,651          120,782               -48,869  -28.81%   x 1.40 
 decompress_compressed_lvl_6  240,749          193,696               -47,053  -19.54%   x 1.24 
 decompress_compressed_lvl_9  241,462          193,670               -47,792  -19.79%   x 1.25 
 decompress_short_lvl_1       5,211            3,376                  -1,835  -35.21%   x 1.54 

The main change are that decompress_fast no longer uses read_bytes/read_bits to check for input end, but checks at the start that there is at least theoretical max of needed bytes available and then fulls out bit_buffer when needed without those checks. And special case for match of length 3.

I have an idea for the next improvement though: right now there is up to 14 bytes to be read from input buffer per decompress_fast cycle. They are read with 32-bit (or 16-bit for 32-bit systems) words, totaling in maximum of 4 (or 7) reads. Maybe if we read 16 (const) bytes with copy_from_slice it would optimize to one 128-bit read from buffer (in probably heap) right on stack to apply bit magic to. And then there also will be no if-checks to full the bit_buffer. But there will be a bit of dance to return unused bits.

@oyvindln
Copy link
Collaborator

oyvindln commented Mar 5, 2018

Neat.

I added some extra checks to avoid some overflows/out of bounds issues that showed up in fuzzing and from people using this library after zip-rs adopted it as default. Some of it may have been redundant, though I was focused on fixing the bugs first. I was worried that it might have impacted performance, but it seems not.

As for optimizing more, I guess you just have to try it out, it may help.

@dralley
Copy link

dralley commented Jan 13, 2022

Hey, so I'm writing a library that handles parsing some compressed XML files and noticed that miniz_oxide is responsible for the great majority of the runtime of my testcase program. This is pretty surprising as all of the processing is doing a fair amount of work (and plenty of allocations, it's not really optimized much at all). I don't have a lot to compare it against, so maybe it's not fully avoidable (it could just be down to the algorithm and not the implementation), but here's some performance data that I collected.

miniz_oxide::compress::core::compress_inner responsible for 90% of L1 cache misses in the program

Screenshot from 2022-01-13 12-24-09

And 32% of branch mispredictions (there's a lot of branches in my code, too, but nothing else comes close in terms of proportion)

Screenshot from 2022-01-13 12-23-18

@dralley
Copy link

dralley commented Feb 21, 2022

I just recompiled my application w/ the flate2 "cloudflare_zlib" feature instead of the miniz_oxide backend, and it knocked nearly 30% off of the runtime. Possibly a good place to mine for optimizations.

@dralley
Copy link

dralley commented Feb 21, 2022

The system zlib backend was ~15% faster, but this is measuring my whole application including all the XML parsing I'm doing, so the actual differences are probably bigger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants