Skip to content

Latest commit

 

History

History
1685 lines (1046 loc) · 79.8 KB

index.rst

File metadata and controls

1685 lines (1046 loc) · 79.8 KB

pyzstd module

Introduction

Pyzstd module provides classes and functions for compressing and decompressing data using Facebook's Zstandard (or zstd as short name) algorithm.

The API style is similar to Python's bz2/lzma/zlib modules.

  • Includes the latest zstd library source code
  • Can also dynamically link to zstd library provided by system, see this note<build_pyzstd>.
  • Has a CFFI implementation that can work with PyPy
  • Support sub-interpreter on CPython 3.12+
  • :pyZstdFile class has C language level performance
  • Supports Zstandard Seekable Format
  • Has a command line interface, python -m pyzstd --help.

Links: GitHub page, PyPI page.

Features of zstd:

  • Fast compression and decompression speed.
  • If use multi-threaded compression<mt_compression>, the compression speed improves significantly.
  • If use pre-trained dictionary<zstd_dict>, the compression ratio on small data (a few KiB) improves dramatically.
  • Frame and block<frame_block> allow the use more flexible, suitable for many scenarios.
  • Can be used as a patching engine<patching_engine>.

Note

Two other zstd modules on PyPI:

Exception

Simple compression/decompression

This section contains:

  • function :pycompress
  • function :pydecompress

Hint

If there are a big number of same type individual data, reuse these objects may eliminate the small overhead of creating context / setting parameters / loading dictionary.

  • :pyZstdCompressor
  • :pyRichMemZstdCompressor

python

# int compression level compressed_dat = compress(raw_dat, 10)

# dict option, use 6 threads to compress, and append a 4-byte checksum. option = {CParameter.compressionLevel : 10, CParameter.nbWorkers : 6, CParameter.checksumFlag : 1} compressed_dat = compress(raw_dat, option)

Rich memory compression

Compress data using rich memory mode<rich_mem>. This mode allocates more memory for output buffer, it's faster in some cases.

This section contains:

  • function :pyrichmem_compress
  • class :pyRichMemZstdCompressor, a reusable compressor.

Streaming compression

This section contains:

  • function :pycompress_stream, a fast and convenient function.
  • class :pyZstdCompressor, similar to compressors in Python standard library.

It would be nice to know some knowledge about zstd data, see frame and block<frame_block>.

Streaming decompression

This section contains:

  • function :pydecompress_stream, a fast and convenient function.
  • class :pyZstdDecompressor, similar to decompressors in Python standard library.
  • class :pyEndlessZstdDecompressor, a decompressor accepts multiple concatenated frames<frame_block>.

Dictionary

This section contains:

  • class :pyZstdDict
  • function :pytrain_dict
  • function :pyfinalize_dict

Note

If use pre-trained zstd dictionary, the compression ratio achievable on small data (a few KiB) improves dramatically.

Background

The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms, and reason is, compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no "past" to build upon.

Zstd training mode can be used to tune the algorithm for a selected type of data. Training is achieved by providing it with a few samples (one file per sample). The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression.

See the FAQ in this file for details.

Attention

  1. If you lose a zstd dictionary, then can't decompress the corresponding data.
  2. Zstd dictionary has negligible effect on large data (multi-MiB) compression. If want to use large dictionary content, see prefix(:pyZstdDict.as_prefix).
  3. There is a possibility that the dictionary content could be maliciously tampered by a third party.

Advanced dictionary training

Pyzstd module only uses zstd library's stable API. The stable API only exposes two dictionary training functions that corresponding to :pytrain_dict and :pyfinalize_dict.

If want to adjust advanced training parameters, you may use zstd's CLI program (not pyzstd module's CLI), it has entries to zstd library's experimental API.

Module-level functions

This section contains:

  • function :pyget_frame_info, get frame information from a frame header.
  • function :pyget_frame_size, get a frame's size.

python

>>> pyzstd.get_frame_info(compressed_dat[:20]) frame_info(decompressed_size=687379, dictionary_id=1040992268)

python

>>> pyzstd.get_frame_size(compressed_dat) 252874

Module-level variables

This section contains:

  • :pyzstd_version, a str.
  • :pyzstd_version_info, a tuple.
  • :pycompressionLevel_values, some values defined by the underlying zstd library.
  • :pyzstd_support_multithread, whether the underlying zstd library supports multi-threaded compression.

python

>>> pyzstd.zstd_version '1.4.5'

python

>>> pyzstd.zstd_version_info (1, 4, 5)

python

>>> pyzstd.compressionLevel_values # 131072 = 128*1024 values(default=3, min=-131072, max=22)

0.15.1

python

>>> pyzstd.zstd_support_multithread True

ZstdFile class and open() function

This section contains:

  • class :pyZstdFile, open a zstd-compressed file in binary mode.
  • function :pyopen, open a zstd-compressed file in binary or text mode.

In writing modes (compression), these methods are available:

  • .write(b)
  • .flush(mode=ZstdFile.FLUSH_BLOCK), flush to the underlying stream:

    1. The mode argument can be ZstdFile.FLUSH_BLOCK, ZstdFile.FLUSH_FRAME.
    2. Contiguously invoking this method with .FLUSH_FRAME will not generate empty content frames.
    3. Abuse of this method will reduce compression ratio, use it only when necessary.
    4. If the program is interrupted afterwards, all data can be recovered. To ensure saving to disk, also need os.fsync(fd).

    (Added in version 0.15.1, added mode argument in version 0.15.9.)

In both reading and writing modes, these methods and property are available:

SeekableZstdFile class

This section contains facilities that supporting Zstandard Seekable Format:

  • exception :pySeekableFormatError
  • class :pySeekableZstdFile

Advanced parameters

This section contains class :pyCParameter, :pyDParameter, :pyStrategy, they are subclasses of IntEnum, used for setting advanced parameters.

Attributes of :pyCParameter class:

  • Compression level (:py~CParameter.compressionLevel)
  • Compress algorithm parameters (:py~CParameter.windowLog, :py~CParameter.hashLog, :py~CParameter.chainLog, :py~CParameter.searchLog, :py~CParameter.minMatch, :py~CParameter.targetLength, :py~CParameter.strategy)
  • Long distance matching (:py~CParameter.enableLongDistanceMatching, :py~CParameter.ldmHashLog, :py~CParameter.ldmMinMatch, :py~CParameter.ldmBucketSizeLog, :py~CParameter.ldmHashRateLog)
  • Misc (:py~CParameter.contentSizeFlag, :py~CParameter.checksumFlag, :py~CParameter.dictIDFlag)
  • Multi-threaded compression (:py~CParameter.nbWorkers, :py~CParameter.jobSize, :py~CParameter.overlapLog)

Attribute of :pyDParameter class:

  • Decompression parameter (:py~DParameter.windowLogMax)

Attributes of :pyStrategy class:

:py~Strategy.fast, :py~Strategy.dfast, :py~Strategy.greedy, :py~Strategy.lazy, :py~Strategy.lazy2, :py~Strategy.btlazy2, :py~Strategy.btopt, :py~Strategy.btultra, :py~Strategy.btultra2.

Informative notes

Compression level

Note

Compression level

Compression level is an integer:

  • 1 to 22 (currently), regular levels. Levels >= 20, labeled ultra, should be used with caution, as they require more memory.
  • 0 means use the default level, which is currently 3 defined by the underlying zstd library.
  • -131072 to -1, negative levels extend the range of speed vs ratio preferences. The lower the level, the faster the speed, but at the cost of compression ratio. 131072 = 128*1024.

:pycompressionLevel_values are some values defined by the underlying zstd library.

For advanced user

Compression levels are just numbers that map to a set of compression parameters, see this table for overview. The parameters may be adjusted by the underlying zstd library after gathering some information, such as data size, using dictionary or not.

Setting a compression level does not set all other compression parameters<CParameter> to default. Setting this will dynamically impact the compression parameters which have not been manually set, the manually set ones will "stick".

Frame and block

Note

Frame and block

Frame

Zstd data consists of one or more independent "frames". The decompressed content of multiple concatenated frames is the concatenation of each frame decompressed content.

A frame is completely independent, has a frame header, and a set of parameters which tells the decoder how to decompress it.

In addition to normal frame, there is skippable frame that can contain any user-defined data, skippable frame will be decompressed to b''.

Block

A frame encapsulates one or multiple "blocks". Block has a guaranteed maximum size (3 bytes block header + 128 KiB), the actual maximum size depends on frame parameters.

Unlike independent frames, each block depends on previous blocks for proper decoding, but doesn't need the following blocks, a complete block can be fully decompressed. So flushing block may be used in communication scenarios, see :pyZstdCompressor.FLUSH_BLOCK.

Attention

In some language bindings, decompress() function doesn't support multiple frames, or/and doesn't support a frame with unknown content size<content_size>, pay attention when compressing data for other language bindings.

Multi-threaded compression

Note

Multi-threaded compression

Zstd library supports multi-threaded compression. Set :pyCParameter.nbWorkers parameter >= 1 to enable multi-threaded compression, 1 means "1-thread multi-threaded mode".

The threads are spawned by the underlying zstd library, not by pyzstd module.

python

# use 4 threads to compress option = {CParameter.nbWorkers : 4} compressed_dat = compress(raw_dat, option)

The data will be split into portions and compressed in parallel. The portion size can be specified by :pyCParameter.jobSize parameter, the overlap size can be specified by :pyCParameter.overlapLog parameter, usually don't need to set these.

The multi-threaded output will be different than the single-threaded output. However, both are deterministic, and the multi-threaded output produces the same compressed data no matter how many threads used.

The multi-threaded output is a single frame<frame_block>, it's larger a little. Compressing a 520.58 MiB data, single-threaded output is 273.55 MiB, multi-threaded output is 274.33 MiB.

Hint

Using "CPU physical cores number" as threads number may be the fastest, to get the number need to install third-party module. os.cpu_count() can only get "CPU logical cores number" (hyper-threading capability).

Rich memory mode

Note

Rich memory mode

pyzstd module has a "rich memory mode" for compression. It allocates more memory for output buffer, and faster in some cases. Suitable for extremely fast compression scenarios.

There is a :pyrichmem_compress function, a :pyRichMemZstdCompressor class.

Currently it won't be faster when using zstd multi-threaded compression <mt_compression>, it will issue a ResourceWarnings in this case.

Effects:

  • The output buffer is larger than input data a little.
  • If input data is larger than ~31.8KB, up to 22% faster. The lower the compression level, the much faster it is usually.

When not using this mode, the output buffer grows gradually, in order not to allocate too much memory. The negative effect is that pyzstd module usually need to call the underlying zstd library's compress function multiple times.

When using this mode, the size of output buffer is provided by ZSTD_compressBound() function, which is larger than input data a little (maximum compressed size in worst case single-pass scenario). For a 100 MiB input data, the allocated output buffer is (100 MiB + 400 KiB). The underlying zstd library avoids extra memory copy for this output buffer size.

python

# use richmem_compress() function compressed_dat = richmem_compress(raw_dat)

# reuse RichMemZstdCompressor object c = RichMemZstdCompressor() frame1 = c.compress(raw_dat1) frame2 = c.compress(raw_dat2)

Compressing a 520.58 MiB data, it accelerates from 5.40 seconds to 4.62 seconds.

Use with tarfile module

Note

Use with tarfile module

Python's tarfile module supports arbitrary compression algorithms by providing a file object.

This code encapsulates a ZstdTarFile class using :pyZstdFile, it can be used like tarfile.TarFile class:

python

import tarfile

# when using read mode (decompression), the level_or_option parameter # can only be a dict object, that represents decompression option. It # doesn't support int type compression level in this case.

class ZstdTarFile(tarfile.TarFile):
def __init__(self, name, mode='r', , level_or_option=None, zstd_dict=None,kwargs): self.zstd_file = ZstdFile(name, mode, level_or_option=level_or_option, zstd_dict=zstd_dict) try: super().__init__(fileobj=self.zstd_file, mode=mode,*kwargs)
except:

self.zstd_file.close() raise

def close(self):
try:

super().close()

finally:

self.zstd_file.close()

# write .tar.zst file (compression) with ZstdTarFile('archive.tar.zst', mode='w', level_or_option=5) as tar: # do something

# read .tar.zst file (decompression) with ZstdTarFile('archive.tar.zst', mode='r') as tar: # do something

When the above code is in read mode (decompression), and selectively read files multiple times, it may seek to a position before the current position, then the decompression has to be restarted from zero. If this slows down the operations, you can:

  1. Use :pySeekableZstdFile class to create/read .tar.zst file.
  2. Decompress the archive to a temporary file, and read from it. This code encapsulates the process:

python

import contextlib import io import tarfile import tempfile from pyzstd import decompress_stream

@contextlib.contextmanager def ZstdTarReader(name, , zstd_dict=None, option=None,kwargs): with tempfile.TemporaryFile() as tmp_file: with io.open(name, 'rb') as ifh: decompress_stream(ifh, tmp_file, zstd_dict=zstd_dict, option=option) tmp_file.seek(0) with tarfile.TarFile(fileobj=tmp_file,*kwargs) as tar: yield tar

with ZstdTarReader('archive.tar.zst') as tar:

# do something

Zstd dictionary ID

Note

Zstd dictionary ID

Dictionary ID is a 32-bit unsigned integer value. Decoder uses it to check if the correct dictionary is used.

According to zstd dictionary format specification, if a dictionary is going to be distributed in public, the following ranges are reserved for future registrar and shall not be used:

  • low range: <= 32767
  • high range: >= 2^31

Outside of these ranges, any value in (32767 < v < 2^31) can be used freely, even in public environment.

In zstd frame header, the Dictionary_ID field can be 0/1/2/4 bytes. If the value is small, this can save 2~3 bytes. Or don't write the ID by setting :pyCParameter.dictIDFlag parameter.

pyzstd module doesn't support specifying ID when training dictionary currently. If want to specify the ID, modify the dictionary content according to format specification, and take the corresponding risks.

Attention

In :pyZstdDict class, :pyZstdDict.dict_id attribute == 0 means the dictionary is a "raw content" dictionary, free of any format restriction, used for advanced user. Non-zero means it's an ordinary dictionary, was created by zstd functions, follow the format specification.

In :pyget_frame_info function, dictionary_id == 0 means dictionary ID was not recorded in the frame header, the frame may or may not need a dictionary to be decoded, and the ID of such a dictionary is not specified.

Use zstd as a patching engine

Note

Use zstd as a patching engine

Zstd can be used as a great patching engine, although it has some limitations.

In this particular scenario, pass :pyZstdDict.as_prefix attribute as zstd_dict argument. "Prefix" is similar to "raw content" dictionary, but zstd internally handles them differently, see this issue.

Essentially, prefix is like being placed before the data to be compressed. See "ZSTD_c_deterministicRefPrefix" in this file.

1, Generating a patch (compress)

Assuming VER_1 and VER_2 are two versions.

Let the "window" cover the longest version, by setting :pyCParameter.windowLog. And enable "long distance matching" by setting :pyCParameter.enableLongDistanceMatching to 1. The --patch-from option of zstd CLI also uses other parameters, but these two matter the most.

The valid value of windowLog is [10,30] in 32-bit build, [10,31] in 64-bit build. So in 64-bit build, it has a 2GiB length limit. Strictly speaking, the limit is (2GiB - ~100KiB). When this limit is exceeded, the patch becomes very large and loses the meaning of a patch.

python

# use VER_1 as prefix v1 = ZstdDict(VER_1, is_raw=True)

# let the window cover the longest version. # don't forget to clamp windowLog to valid range. # enable "long distance matching". windowLog = max(len(VER_1), len(VER_2)).bit_length() option = {CParameter.windowLog: windowLog, CParameter.enableLongDistanceMatching: 1}

# get a small PATCH PATCH = compress(VER_2, level_or_option=option, zstd_dict=v1.as_prefix)

2, Applying the patch (decompress)

Prefix is not dictionary, so the frame header doesn't record a dictionary id<dict_id>. When decompressing, must use the same prefix as when compressing. Otherwise ZstdError exception may be raised with a message like "Data corruption detected".

Decompressing requires a window of the same size as when compressing, this may be a problem for small RAM device. If the window is larger than 128MiB, need to explicitly set :pyDParameter.windowLogMax to allow larger window.

python

# use VER_1 as prefix v1 = ZstdDict(VER_1, is_raw=True)

# allow large window, the actual windowLog is from frame header. option = {DParameter.windowLogMax: 31}

# get VER_2 from (VER_1 + PATCH) VER_2 = decompress(PATCH, zstd_dict=v1.as_prefix, option=option)

Build pyzstd module with options

Note

Build pyzstd module with options

1️⃣ If provide --avx2 build option, it will build with AVX2/BMI2 instructions. In MSVC build (static link), this brings some performance improvements. GCC/CLANG builds already dynamically dispatch some functions for BMI2 instructions, so no significant improvement, or worse.

shell

# 🟠 pyzstd 0.15.4+ and pip 22.1+ support PEP-517: # build and install pip install --config-settings="--build-option=--avx2" -v pyzstd-0.15.4.tar.gz # build a redistributable wheel pip wheel --config-settings="--build-option=--avx2" -v pyzstd-0.15.4.tar.gz # 🟠 legacy commands: # build and install python setup.py install --avx2 # build a redistributable wheel python setup.py bdist_wheel --avx2

2️⃣ Pyzstd module supports:

  • Dynamically link to zstd library (provided by system or a DLL library), then the zstd source code in zstd folder will be ignored.
  • Provide a CFFI implementation that can work with PyPy.

On CPython, provide these build options:

  1. no option: C implementation, statically link to zstd library.
  2. --dynamic-link-zstd: C implementation, dynamically link to zstd library.
  3. --cffi: CFFI implementation (slower), statically link to zstd library.
  4. --cffi --dynamic-link-zstd: CFFI implementation (slower), dynamically link to zstd library.

On PyPy, only CFFI implementation can be used, so --cffi is added implicitly. --dynamic-link-zstd is optional.

shell

# 🟠 pyzstd 0.15.4+ and pip 22.1+ support PEP-517: # build and install pip3 install --config-settings="--build-option=--dynamic-link-zstd" -v pyzstd-0.15.4.tar.gz # build a redistributable wheel pip3 wheel --config-settings="--build-option=--dynamic-link-zstd" -v pyzstd-0.15.4.tar.gz # specify more than one option pip3 wheel --config-settings="--build-option=--dynamic-link-zstd --cffi" -v pyzstd-0.15.4.tar.gz # 🟠 legacy commands: # build and install python3 setup.py install --dynamic-link-zstd # build a redistributable wheel python3 setup.py bdist_wheel --dynamic-link-zstd

Some notes:

  • The wheels on PyPI use static linking, the packages on Anaconda use dynamic linking.
  • No matter static or dynamic linking, pyzstd module requires zstd v1.4.0+.
  • Static linking: Use zstd's official release without any change. If want to upgrade or downgrade the zstd library, just replace zstd folder.
  • Dynamic linking: If new zstd API is used at compile-time, linking to lower version run-time zstd library will fail. Use v1.5.0 new API if possible.

On Windows, there is no system-wide zstd library. Pyzstd module can dynamically link to a DLL library, modify setup.py:

python

# E:zstd_dll folder has zstd.h / zdict.h / libzstd.lib that # along with libzstd.dll if DYNAMIC_LINK: kwargs = { 'include_dirs': ['E:zstd_dll'], # .h directory 'library_dirs': ['E:zstd_dll'], # .lib directory 'libraries': ['libzstd'], # lib name, not filename, for the linker. ...

And put libzstd.dll into one of these directories:

  • Directory added by os.add_dll_directory() function. (The unit-tests and the CLI can't utilize this)
  • Python's root directory that has python.exe.
  • %SystemRoot%System32

Note that the above list doesn't include the current working directory and %PATH% directories.

3️⃣ Disable mremap output buffer on CPython+Linux.

On CPython(3.5~3.12)+Linux, pyzstd uses another output buffer code that can utilize the mremap mechanism, which brings some performance improvements. If this causes problems, you may use --no-mremap option to disable this code.