-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize decoding #3
Comments
What I am missing interface-wise is the size of the last block. I can get all other sizes from the difference of neighboring blocks but not so for the last block. Strictly speaking, if we are talking about block boundaries, then the end of the last block should also be returned even if it is not the beginning of a new block. Then, I could infer the decompressed size of the last block from this. Edit: Ok, I can simply use |
After some thinking on this issue, I believe that this issue is complex, and would add too much complexity for this project: there are many ways to do parallelism in Python (threading, asyncio, gevent, etc.) and I feel like this will be too much for a single use case (parallel decompression). This would be in the same way that I chose not to support other features, such as decompressing a stream on the fly: So I would suggest creating your own library to support this specific use-case. If you decide to base it on With that being said, I agree that the API to get the blocks and their positions/sizes is not optimal at all. I copied that from the I feel like a better API would be to have an attribute returning all the blocks as a list of objects, and from each object you would be able to get the size and the position. I can see two possible APIs, let me know what you think
>>> for pos, block in file.blocks.items():
... print(pos, len(block))
0 100
100 200
>>> for block in file.blocks:
... print(block.something, len(block)) # attribute name `something` to be decided
0 100
100 200
For the time being you can use |
I arranged myself with what is currently available, so it isn't that pressing anymore. Feel free to close this issue if you want. I improved and benchmarked the already posed sketch and added it to (the public interface of) But about the proposed interfaces, based on the getter name |
@mxmlnkn @Rogdham @pauldmccarthy seeing the conversation above: is there any appetite to unify the indexed_gzip + indexed_bzip2 + python-xz approach? I mean in the sense of a unified API solution to random-seeking. Python's built And maybe even submit that API back to CPython, to replace the In my imagination, the benefit would be a) lower maintenance / less duplicated effort, and b) greater reach of the outstanding work done on your libraries. But maybe I'm way off, please let me know. |
I want to interject that I'm still heavily working on pragzip, which is the same code base as But, the underlying infrastructure for parallel block-based decompression with random-seek capabilities is sufficiently generic to work with both bzip2 and gzip already and probably can also be extended to work with xz and zstandard. The interfaces of pragzip and indexed_bzip2 Python modules are also mostly similar. But, I'm only a single person doing this mostly in my free time, so the speed of progress is limited.
I have zero experience with that... It sounds difficult and (just speculating) it might not work because C++17, which indexed_bzip2 and pragzip heavily use is not allowed inside CPython? Also, other Python interpreters like PyPy, Codon(?), and others still would require the packages to be maintained.
a) For only the single-threaded random-seek support, I don't think it would save that much of effort. Each compression format index requires slightly different data and the main problem is to use existing compression libraries correctly (zstandard library, Python lzma, ...), i.e., highly divergent glue code. As for a generic infrastructure for random seeking plus block-parallel decompression, see my first paragraph. b) I would like greater reach and adoption :). But, to be honest, I would also like some attribution and direct contact with users. And, if it's inside CPython, then it becomes basically invisible inside the whole amalgam code base of CPython. Some unified library like libarchive is, would be nice though. In an ideal work, it might even be integrated into libarchive but starting a separate project is always easier to get done and libarchive is written in C, which I don't want to write. |
Hi @piskvorky @mxmlnkn @Rogdham, in principle I like the idea of a unified library, but unfortunately I just don't have the time to contribute more beyond discussion/review ( |
I have nothing against a unified library as well, and I agree there are many things that could be factored (e.g. the python-xz code for file-objects is generic already). The good thing is that having a few different file formats to consider would make it pretty clear what can be factored and what can not. Feel free to ping me if my knowledge about XZ file format can be of any help! |
…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3
…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3
…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3
…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3
…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3
…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3
Xz is pretty slow compared to other compression formats. It would be really cool if python-xz could be parallelized such that it prefetches the next blocks and decodes them in parallel. I think this would be a helpful feature and unique selling point for python-xz. I don't think there is a parallelized XZ decoder for Python at all, or is there?
I'm doing something similar in
indexed_bzip2
. But, I am aware that this adds complexity and problems:xz
can compress in parallel, so maybe that could also be possible.I implemented a very rudimentary sketch on top of python-xz using
multiprocessing.pool.Pool
. It has the same design asindexed_bzip2
, which is:With this, I was able to speed up the decompression of a 3.1GiB xz file (decompressed 4GiB) consisting of 171 blocks by factor ~7 on an 8-core CPU (16 virtual cores):
Hower, at this point I'm becoming uncertain whether this might be easier to implement inside python-xz itself or whether the wrapper is a sufficient ad-hoc solution. It only uses public methods and members of XZFile, so it should be stable during non-major version changes.
Rudimentary unfinished sketch / proof of work:
decompress-xz-parallel.py
Click to expand
parallel_xz_decoder.py
Click to expand
Manual Shell Execution
The text was updated successfully, but these errors were encountered: