Skip to content

FemtoZip vs VCDiff and Zstd

gtoubassi edited this page Jul 26, 2021 · 3 revisions

FemtoZip vs VCDiff

FemtoZip is in many ways spiritually similar to vcdiff (http://code.google.com/p/open-vcdiff/), which is based on Bentley and McIllroy's algorithm described in "Data Compression Using Long Common Strings". In both cases the goal is to achieve improved compression ratios via a precomputed shared dictionary which is used by both the compressor and decompressor. Some differences:

  • vcdiff is designed to work well with very large dictionaries (100's of kb or more) while FemtoZip has a maximum dictionary size of 64k. During compression both algorithm's hash substrings in the shared dictionary as well as the input string as it is scanned. In the case of FemtoZip, this leads to n + m entries in the hashtable, where n is the length of the dictionary and m is the length of the input. For very large dictionaries this overhead (both in storage, as well as time to build the hashtable) can be prohibitive. vcdiff supports large dictionaries by hashing non overlapping string fragments, for example by default every 32 bytes. This leads to a space and time overhead for the hashtable of (n + m)/32.

  • Because it is optimized for large dictionaries, the vcdiff compressor only recognizes large repetitions in the document. For example by default the vcdiff compressor will discard matches less than 32 bytes (meaning repetitions in the content of less than 32 bytes are not compressed). FemtoZip compresses matches of 4 bytes or more which is closer to gzip's behavior (which is 3 bytes).

  • For the above reason vcdiff is intended to be used in conjunction with gzip, not as a replacement. FemtoZip is intended as a replacement for gzip.

  • vcdiff can recognize substrings that occur infinitely far away in an input string or dictionary. For example, as demonstrated in their paper, if you take a large document like the Bible, concatenate it together with itself, and run it through the compressor, it will fully compress the second copy. FemtoZip, will only recognize repetitions that occur within 64k of the occurrence. This is closer to gzip's behavior, which matches within a 32k window.

  • The vcdiff dictionary is just a dictionary of substrings while the FemtoZip dictionary is both substrings and a Huffman coding tree, which means with FemtoZip you get an optimally computed Huffman coding tree but it is stored in the dictionary out of band vs adding to the file size of the compressed document (as is the case in gzip). With vcdiff+gzip you will either use no Huffman coding, default (and almost certainly suboptimal) Huffman coding tree, or a tailored Huffman tree which has to be stored in the compressed document. You also pay the cpu cost for gzip to attempt to encode using a custom computed Huffman tree vs the default Huffman tree to decide which is better.

  • vcdiff does not include a way to compute a dictionary based on sample data, while FemtoZip does. In fact FemtoZip can be used to compute vcdiff dictionaries of any size (even above the 64k limit of the FemtoZip compressor) for use in other applications (e.g. the now deprecated SDCH compression scheme+.

FemtoZip vs Zstd

FemtoZip was released in 2011 when no existing open source tools supported dictionary computation and compression. Zstd was released in 2016 and is a general purpose compression tool which also does support dictionary computation and compression. Given the robust support and wide availability of Zstd, the discerning engineer should prefer it. However you should test both Zstd and FemtoZip for your use case and see which performs better (both in terms of compression ratios and cpu/speed). FemtoZip wins handily in the synthetic benchmark used in the Tutorial (a benchmark which was created before Zstd existed). Beyond this the author hasn't deeply researched Zstd and the more nuanced tradeoffs between it and FemtoZip for the "shared dictionary/small payload" use case.