pyzstd module
Pyzstd module provides classes and functions for compressing and decompressing data using Facebook's Zstandard (or zstd as short name) algorithm.
The API style is similar to Python's bz2/lzma/zlib modules.
- Includes the latest zstd library source code
- Can also dynamically link to zstd library provided by system, see
this note<build_pyzstd>
. - Has a CFFI implementation that can work with PyPy
- Support sub-interpreter on CPython 3.12+
- :py
ZstdFile
class has C language level performance - Supports Zstandard Seekable Format
- Has a command line interface,
python -m pyzstd --help
.
Links: GitHub page, PyPI page.
Features of zstd:
- Fast compression and decompression speed.
- If use
multi-threaded compression<mt_compression>
, the compression speed improves significantly. - If use pre-trained
dictionary<zstd_dict>
, the compression ratio on small data (a few KiB) improves dramatically. Frame and block<frame_block>
allow the use more flexible, suitable for many scenarios.- Can be used as a
patching engine<patching_engine>
.
This section contains:
- function :py
compress
- function :py
decompress
Hint
If there are a big number of same type individual data, reuse these objects may eliminate the small overhead of creating context / setting parameters / loading dictionary.
- :py
ZstdCompressor
- :py
RichMemZstdCompressor
python
# int compression level compressed_dat = compress(raw_dat, 10)
# dict option, use 6 threads to compress, and append a 4-byte checksum. option = {CParameter.compressionLevel : 10, CParameter.nbWorkers : 6, CParameter.checksumFlag : 1} compressed_dat = compress(raw_dat, option)
Compress data using
rich memory mode<rich_mem>
. This mode allocates more memory for output buffer, it's faster in some cases.This section contains:
- function :py
richmem_compress
- class :py
RichMemZstdCompressor
, a reusable compressor.
This section contains:
- function :py
compress_stream
, a fast and convenient function.- class :py
ZstdCompressor
, similar to compressors in Python standard library.It would be nice to know some knowledge about zstd data, see
frame and block<frame_block>
.
This section contains:
- function :py
decompress_stream
, a fast and convenient function.- class :py
ZstdDecompressor
, similar to decompressors in Python standard library.- class :py
EndlessZstdDecompressor
, a decompressor accepts multiple concatenatedframes<frame_block>
.
This section contains:
- class :py
ZstdDict
- function :py
train_dict
- function :py
finalize_dict
Note
If use pre-trained zstd dictionary, the compression ratio achievable on small data (a few KiB) improves dramatically.
Background
The smaller the amount of data to compress, the more difficult it is to compress. This problem is common to all compression algorithms, and reason is, compression algorithms learn from past data how to compress future data. But at the beginning of a new data set, there is no "past" to build upon.
Zstd training mode can be used to tune the algorithm for a selected type of data. Training is achieved by providing it with a few samples (one file per sample). The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression.
See the FAQ in this file for details.
Attention
- If you lose a zstd dictionary, then can't decompress the corresponding data.
- Zstd dictionary has negligible effect on large data (multi-MiB) compression. If want to use large dictionary content, see prefix(:py
ZstdDict.as_prefix
). - There is a possibility that the dictionary content could be maliciously tampered by a third party.
Advanced dictionary training
Pyzstd module only uses zstd library's stable API. The stable API only exposes two dictionary training functions that corresponding to :pytrain_dict
and :pyfinalize_dict
.
If want to adjust advanced training parameters, you may use zstd's CLI program (not pyzstd module's CLI), it has entries to zstd library's experimental API.
This section contains:
- function :py
get_frame_info
, get frame information from a frame header.- function :py
get_frame_size
, get a frame's size.
python
>>> pyzstd.get_frame_info(compressed_dat[:20]) frame_info(decompressed_size=687379, dictionary_id=1040992268)
python
>>> pyzstd.get_frame_size(compressed_dat) 252874
This section contains:
- :py
zstd_version
, astr
.- :py
zstd_version_info
, atuple
.- :py
compressionLevel_values
, some values defined by the underlying zstd library.- :py
zstd_support_multithread
, whether the underlying zstd library supports multi-threaded compression.
python
>>> pyzstd.zstd_version '1.4.5'
python
>>> pyzstd.zstd_version_info (1, 4, 5)
python
>>> pyzstd.compressionLevel_values # 131072 = 128*1024 values(default=3, min=-131072, max=22)
0.15.1
python
>>> pyzstd.zstd_support_multithread True
This section contains:
- class :py
ZstdFile
, open a zstd-compressed file in binary mode.- function :py
open
, open a zstd-compressed file in binary or text mode.
In writing modes (compression), these methods are available:
- .write(b)
.flush(mode=ZstdFile.FLUSH_BLOCK), flush to the underlying stream:
- The mode argument can be
ZstdFile.FLUSH_BLOCK
,ZstdFile.FLUSH_FRAME
.- Contiguously invoking this method with
.FLUSH_FRAME
will not generate empty content frames.- Abuse of this method will reduce compression ratio, use it only when necessary.
- If the program is interrupted afterwards, all data can be recovered. To ensure saving to disk, also need os.fsync(fd).
(Added in version 0.15.1, added mode argument in version 0.15.9.)
In both reading and writing modes, these methods and property are available:
- .close()
- .tell(), return the current position of uncompressed content. In append mode, the initial position is 0.
- .fileno()
- .closed (a property attribute)
- .writable()
- .readable()
- .seekable()
This section contains facilities that supporting Zstandard Seekable Format:
- exception :py
SeekableFormatError
- class :py
SeekableZstdFile
This section contains class :py
CParameter
, :pyDParameter
, :pyStrategy
, they are subclasses ofIntEnum
, used for setting advanced parameters.Attributes of :py
CParameter
class:
- Compression level (:py
~CParameter.compressionLevel
)- Compress algorithm parameters (:py
~CParameter.windowLog
, :py~CParameter.hashLog
, :py~CParameter.chainLog
, :py~CParameter.searchLog
, :py~CParameter.minMatch
, :py~CParameter.targetLength
, :py~CParameter.strategy
)- Long distance matching (:py
~CParameter.enableLongDistanceMatching
, :py~CParameter.ldmHashLog
, :py~CParameter.ldmMinMatch
, :py~CParameter.ldmBucketSizeLog
, :py~CParameter.ldmHashRateLog
)- Misc (:py
~CParameter.contentSizeFlag
, :py~CParameter.checksumFlag
, :py~CParameter.dictIDFlag
)- Multi-threaded compression (:py
~CParameter.nbWorkers
, :py~CParameter.jobSize
, :py~CParameter.overlapLog
)Attribute of :py
DParameter
class:
- Decompression parameter (:py
~DParameter.windowLogMax
)Attributes of :py
Strategy
class::py
~Strategy.fast
, :py~Strategy.dfast
, :py~Strategy.greedy
, :py~Strategy.lazy
, :py~Strategy.lazy2
, :py~Strategy.btlazy2
, :py~Strategy.btopt
, :py~Strategy.btultra
, :py~Strategy.btultra2
.
Note
Compression level
Compression level is an integer:
1
to22
(currently), regular levels. Levels >= 20, labeled ultra, should be used with caution, as they require more memory.0
means use the default level, which is currently3
defined by the underlying zstd library.-131072
to-1
, negative levels extend the range of speed vs ratio preferences. The lower the level, the faster the speed, but at the cost of compression ratio. 131072 = 128*1024.
:pycompressionLevel_values
are some values defined by the underlying zstd library.
For advanced user
Compression levels are just numbers that map to a set of compression parameters, see this table for overview. The parameters may be adjusted by the underlying zstd library after gathering some information, such as data size, using dictionary or not.
Setting a compression level does not set all other compression parameters<CParameter>
to default. Setting this will dynamically impact the compression parameters which have not been manually set, the manually set ones will "stick".
Note
Frame and block
Frame
Zstd data consists of one or more independent "frames". The decompressed content of multiple concatenated frames is the concatenation of each frame decompressed content.
A frame is completely independent, has a frame header, and a set of parameters which tells the decoder how to decompress it.
In addition to normal frame, there is skippable frame that can contain any user-defined data, skippable frame will be decompressed to b''
.
Block
A frame encapsulates one or multiple "blocks". Block has a guaranteed maximum size (3 bytes block header + 128 KiB), the actual maximum size depends on frame parameters.
Unlike independent frames, each block depends on previous blocks for proper decoding, but doesn't need the following blocks, a complete block can be fully decompressed. So flushing block may be used in communication scenarios, see :pyZstdCompressor.FLUSH_BLOCK
.
Attention
In some language bindings, decompress() function doesn't support multiple frames, or/and doesn't support a frame with unknown content size<content_size>
, pay attention when compressing data for other language bindings.
Note
Multi-threaded compression
Zstd library supports multi-threaded compression. Set :pyCParameter.nbWorkers
parameter >= 1
to enable multi-threaded compression, 1
means "1-thread multi-threaded mode".
The threads are spawned by the underlying zstd library, not by pyzstd module.
python
# use 4 threads to compress option = {CParameter.nbWorkers : 4} compressed_dat = compress(raw_dat, option)
The data will be split into portions and compressed in parallel. The portion size can be specified by :pyCParameter.jobSize
parameter, the overlap size can be specified by :pyCParameter.overlapLog
parameter, usually don't need to set these.
The multi-threaded output will be different than the single-threaded output. However, both are deterministic, and the multi-threaded output produces the same compressed data no matter how many threads used.
The multi-threaded output is a single frame<frame_block>
, it's larger a little. Compressing a 520.58 MiB data, single-threaded output is 273.55 MiB, multi-threaded output is 274.33 MiB.
Hint
Using "CPU physical cores number" as threads number may be the fastest, to get the number need to install third-party module. os.cpu_count() can only get "CPU logical cores number" (hyper-threading capability).
Note
Rich memory mode
pyzstd module has a "rich memory mode" for compression. It allocates more memory for output buffer, and faster in some cases. Suitable for extremely fast compression scenarios.
There is a :pyrichmem_compress
function, a :pyRichMemZstdCompressor
class.
Currently it won't be faster when using zstd multi-threaded compression <mt_compression>
, it will issue a ResourceWarnings
in this case.
Effects:
- The output buffer is larger than input data a little.
- If input data is larger than ~31.8KB, up to 22% faster. The lower the compression level, the much faster it is usually.
When not using this mode, the output buffer grows gradually, in order not to allocate too much memory. The negative effect is that pyzstd module usually need to call the underlying zstd library's compress function multiple times.
When using this mode, the size of output buffer is provided by ZSTD_compressBound() function, which is larger than input data a little (maximum compressed size in worst case single-pass scenario). For a 100 MiB input data, the allocated output buffer is (100 MiB + 400 KiB). The underlying zstd library avoids extra memory copy for this output buffer size.
python
# use richmem_compress() function compressed_dat = richmem_compress(raw_dat)
# reuse RichMemZstdCompressor object c = RichMemZstdCompressor() frame1 = c.compress(raw_dat1) frame2 = c.compress(raw_dat2)
Compressing a 520.58 MiB data, it accelerates from 5.40 seconds to 4.62 seconds.
Note
Use with tarfile module
Python's tarfile module supports arbitrary compression algorithms by providing a file object.
This code encapsulates a ZstdTarFile
class using :pyZstdFile
, it can be used like tarfile.TarFile class:
python
import tarfile
# when using read mode (decompression), the level_or_option parameter # can only be a dict object, that represents decompression option. It # doesn't support int type compression level in this case.
- class ZstdTarFile(tarfile.TarFile):
- def __init__(self, name, mode='r', , level_or_option=None, zstd_dict=None,kwargs): self.zstd_file = ZstdFile(name, mode, level_or_option=level_or_option, zstd_dict=zstd_dict) try: super().__init__(fileobj=self.zstd_file, mode=mode,*kwargs)
- except:
self.zstd_file.close() raise
- def close(self):
- try:
super().close()
- finally:
self.zstd_file.close()
# write .tar.zst file (compression) with ZstdTarFile('archive.tar.zst', mode='w', level_or_option=5) as tar: # do something
# read .tar.zst file (decompression) with ZstdTarFile('archive.tar.zst', mode='r') as tar: # do something
When the above code is in read mode (decompression), and selectively read files multiple times, it may seek to a position before the current position, then the decompression has to be restarted from zero. If this slows down the operations, you can:
- Use :py
SeekableZstdFile
class to create/read .tar.zst file.- Decompress the archive to a temporary file, and read from it. This code encapsulates the process:
python
import contextlib import io import tarfile import tempfile from pyzstd import decompress_stream
@contextlib.contextmanager def ZstdTarReader(name, , zstd_dict=None, option=None,kwargs): with tempfile.TemporaryFile() as tmp_file: with io.open(name, 'rb') as ifh: decompress_stream(ifh, tmp_file, zstd_dict=zstd_dict, option=option) tmp_file.seek(0) with tarfile.TarFile(fileobj=tmp_file,*kwargs) as tar: yield tar
- with ZstdTarReader('archive.tar.zst') as tar:
# do something
Note
Zstd dictionary ID
Dictionary ID is a 32-bit unsigned integer value. Decoder uses it to check if the correct dictionary is used.
According to zstd dictionary format specification, if a dictionary is going to be distributed in public, the following ranges are reserved for future registrar and shall not be used:
- low range: <= 32767
- high range: >= 2^31
Outside of these ranges, any value in (32767 < v < 2^31) can be used freely, even in public environment.
In zstd frame header, the Dictionary_ID field can be 0/1/2/4 bytes. If the value is small, this can save 2~3 bytes. Or don't write the ID by setting :pyCParameter.dictIDFlag
parameter.
pyzstd module doesn't support specifying ID when training dictionary currently. If want to specify the ID, modify the dictionary content according to format specification, and take the corresponding risks.
Attention
In :pyZstdDict
class, :pyZstdDict.dict_id
attribute == 0 means the dictionary is a "raw content" dictionary, free of any format restriction, used for advanced user. Non-zero means it's an ordinary dictionary, was created by zstd functions, follow the format specification.
In :pyget_frame_info
function, dictionary_id
== 0 means dictionary ID was not recorded in the frame header, the frame may or may not need a dictionary to be decoded, and the ID of such a dictionary is not specified.
Note
Use zstd as a patching engine
Zstd can be used as a great patching engine, although it has some limitations.
In this particular scenario, pass :pyZstdDict.as_prefix
attribute as zstd_dict argument. "Prefix" is similar to "raw content" dictionary, but zstd internally handles them differently, see this issue.
Essentially, prefix is like being placed before the data to be compressed. See "ZSTD_c_deterministicRefPrefix" in this file.
1, Generating a patch (compress)
Assuming VER_1 and VER_2 are two versions.
Let the "window" cover the longest version, by setting :pyCParameter.windowLog
. And enable "long distance matching" by setting :pyCParameter.enableLongDistanceMatching
to 1. The --patch-from
option of zstd CLI also uses other parameters, but these two matter the most.
The valid value of windowLog is [10,30] in 32-bit build, [10,31] in 64-bit build. So in 64-bit build, it has a 2GiB length limit. Strictly speaking, the limit is (2GiB - ~100KiB). When this limit is exceeded, the patch becomes very large and loses the meaning of a patch.
python
# use VER_1 as prefix v1 = ZstdDict(VER_1, is_raw=True)
# let the window cover the longest version. # don't forget to clamp windowLog to valid range. # enable "long distance matching". windowLog = max(len(VER_1), len(VER_2)).bit_length() option = {CParameter.windowLog: windowLog, CParameter.enableLongDistanceMatching: 1}
# get a small PATCH PATCH = compress(VER_2, level_or_option=option, zstd_dict=v1.as_prefix)
2, Applying the patch (decompress)
Prefix is not dictionary, so the frame header doesn't record a dictionary id<dict_id>
. When decompressing, must use the same prefix as when compressing. Otherwise ZstdError exception may be raised with a message like "Data corruption detected".
Decompressing requires a window of the same size as when compressing, this may be a problem for small RAM device. If the window is larger than 128MiB, need to explicitly set :pyDParameter.windowLogMax
to allow larger window.
python
# use VER_1 as prefix v1 = ZstdDict(VER_1, is_raw=True)
# allow large window, the actual windowLog is from frame header. option = {DParameter.windowLogMax: 31}
# get VER_2 from (VER_1 + PATCH) VER_2 = decompress(PATCH, zstd_dict=v1.as_prefix, option=option)
Note
Build pyzstd module with options
1️⃣ If provide --avx2
build option, it will build with AVX2/BMI2 instructions. In MSVC build (static link), this brings some performance improvements. GCC/CLANG builds already dynamically dispatch some functions for BMI2 instructions, so no significant improvement, or worse.
shell
# 🟠 pyzstd 0.15.4+ and pip 22.1+ support PEP-517: # build and install pip install --config-settings="--build-option=--avx2" -v pyzstd-0.15.4.tar.gz # build a redistributable wheel pip wheel --config-settings="--build-option=--avx2" -v pyzstd-0.15.4.tar.gz # 🟠 legacy commands: # build and install python setup.py install --avx2 # build a redistributable wheel python setup.py bdist_wheel --avx2
2️⃣ Pyzstd module supports:
- Dynamically link to zstd library (provided by system or a DLL library), then the zstd source code in
zstd
folder will be ignored.- Provide a CFFI implementation that can work with PyPy.
On CPython, provide these build options:
- no option: C implementation, statically link to zstd library.
--dynamic-link-zstd
: C implementation, dynamically link to zstd library.--cffi
: CFFI implementation (slower), statically link to zstd library.--cffi --dynamic-link-zstd
: CFFI implementation (slower), dynamically link to zstd library.
On PyPy, only CFFI implementation can be used, so --cffi
is added implicitly. --dynamic-link-zstd
is optional.
shell
# 🟠 pyzstd 0.15.4+ and pip 22.1+ support PEP-517: # build and install pip3 install --config-settings="--build-option=--dynamic-link-zstd" -v pyzstd-0.15.4.tar.gz # build a redistributable wheel pip3 wheel --config-settings="--build-option=--dynamic-link-zstd" -v pyzstd-0.15.4.tar.gz # specify more than one option pip3 wheel --config-settings="--build-option=--dynamic-link-zstd --cffi" -v pyzstd-0.15.4.tar.gz # 🟠 legacy commands: # build and install python3 setup.py install --dynamic-link-zstd # build a redistributable wheel python3 setup.py bdist_wheel --dynamic-link-zstd
Some notes:
- The wheels on PyPI use static linking, the packages on Anaconda use dynamic linking.
- No matter static or dynamic linking, pyzstd module requires zstd v1.4.0+.
- Static linking: Use zstd's official release without any change. If want to upgrade or downgrade the zstd library, just replace
zstd
folder.- Dynamic linking: If new zstd API is used at compile-time, linking to lower version run-time zstd library will fail. Use v1.5.0 new API if possible.
On Windows, there is no system-wide zstd library. Pyzstd module can dynamically link to a DLL library, modify setup.py
:
python
# E:zstd_dll folder has zstd.h / zdict.h / libzstd.lib that # along with libzstd.dll if DYNAMIC_LINK: kwargs = { 'include_dirs': ['E:zstd_dll'], # .h directory 'library_dirs': ['E:zstd_dll'], # .lib directory 'libraries': ['libzstd'], # lib name, not filename, for the linker. ...
And put libzstd.dll
into one of these directories:
- Directory added by os.add_dll_directory() function. (The unit-tests and the CLI can't utilize this)
- Python's root directory that has python.exe.
- %SystemRoot%System32
Note that the above list doesn't include the current working directory and %PATH% directories.
3️⃣ Disable mremap output buffer on CPython+Linux.
On CPython(3.5~3.12)+Linux, pyzstd uses another output buffer code that can utilize the mremap
mechanism, which brings some performance improvements. If this causes problems, you may use --no-mremap
option to disable this code.