Parallelize decoding #3

mxmlnkn · 2021-12-31T01:27:42Z

Xz is pretty slow compared to other compression formats. It would be really cool if python-xz could be parallelized such that it prefetches the next blocks and decodes them in parallel. I think this would be a helpful feature and unique selling point for python-xz. I don't think there is a parallelized XZ decoder for Python at all, or is there?

I'm doing something similar in indexed_bzip2. But, I am aware that this adds complexity and problems:

It probably won't mesh well with write and read-write opened files resulting in an obnoxious special case for read-opened files. Then again, xz can compress in parallel, so maybe that could also be possible.
How to handle bounded memory? It can't decode blocks in parallel if the decompressed results don't fit in memory. But, the decompressed block sizes are known and therefore could be used to limit the parallelism. One edge case would be one or multiple blocks that don't fit into memory not even alone. The easy workaround would be to fall back to a serial implementation but a more sophisticated solution should then be able to handle partial block reads inside the parallel decoder framework.
When using multiprocessing as opposed to multithreading, there might be problems with opening the file multiple times, e.g., on Windows. Also, file objects generally cannot be reopened. But pickling file objects to other processes also isn't possible or could introduce race conditions.

I implemented a very rudimentary sketch on top of python-xz using multiprocessing.pool.Pool. It has the same design as indexed_bzip2, which is:

A least-recently-used block cache containing futures to the decompressed block contents.
A prefetcher inside the read method, which tracks the last accessed blocks and adds the next n blocks to the block cache if it detected sequential access.
The read method will check the block cache and/or submit new blocks for decoding if necessary and returns the concatenated block results.

With this, I was able to speed up the decompression of a 3.1GiB xz file (decompressed 4GiB) consisting of 171 blocks by factor ~7 on an 8-core CPU (16 virtual cores):

serial: Reading 4294967296 B took: 187.482s
parallel: Reading 4294967296 B took: 26.890s

Hower, at this point I'm becoming uncertain whether this might be easier to implement inside python-xz itself or whether the wrapper is a sufficient ad-hoc solution. It only uses public methods and members of XZFile, so it should be stable during non-major version changes.

Rudimentary unfinished sketch / proof of work:

decompress-xz-parallel.py

Click to expand

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import bisect
import io
import lzma
import math
import multiprocessing.pool
import os
import resource
import sys
import time

from typing import Iterable

import xz

from parallel_xz_decoder import ParallelXZReader


def benchmark_python_xz_serial(filename):
    print("== Benchmark serial xz file decompression ==")

    size = 0
    t0 = time.time()
    with xz.open(filename, 'rb') as file:
        t1 = time.time()

        while True:
            readSize = len(file.read(32 * 1024 * 1024))
            if readSize == 0:
                break
            size += readSize

            if time.time() - t1 > 5:
                t1 = time.time()
                print(f"{t1 - t0:.2f}s {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss // 1024} MiB RSS")

        file.close()

    t1 = time.time()
    print(f"Reading {size} B took: {t1-t0:.3f}s")


def test_python_xz_parallel(filename):
    print("== Test parallel xz file decompression ==")

    size = 0
    t0 = time.time()
    with xz.open(filename, 'rb') as file, ParallelXZReader(filename, os.cpu_count()) as pfile:
        t1 = time.time()

        while True:
            readData = file.read(8 * 1024 * 1024)
            parallelReadData = pfile.read(len(readData))
            print("Read from:", file, pfile)
            if readData != parallelReadData:
                print("inequal", len(readData), len(parallelReadData))
            assert readData == parallelReadData
            readSize = len(readData)
            if readSize == 0:
                break
            size += readSize

            if time.time() - t1 > 5:
                t1 = time.time()
                print(f"{t1 - t0:.2f}s {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss // 1024} MiB RSS")

        file.close()

    t1 = time.time()
    print(f"Reading {size} B took: {t1-t0:.3f}s")


def benchmark_python_xz_parallel(filename):
    print("== Benchmark parallel xz file decompression ==")

    size = 0
    t0 = time.time()
    with ParallelXZReader(filename, os.cpu_count()) as file:
        t1 = time.time()

        while True:
            readSize = len(file.read(8 * 1024 * 1024))
            if readSize == 0:
                break
            size += readSize

            if time.time() - t1 > 5:
                t1 = time.time()
                print(f"{t1 - t0:.2f}s {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss // 1024} MiB RSS")

        file.close()

    t1 = time.time()
    print(f"Reading {size} B took: {t1-t0:.3f}s")


if __name__ == '__main__':
    print("xz version:", xz.__version__)
    filename = sys.argv[1]
    benchmark_python_xz_serial(filename)
    test_python_xz_parallel(filename)
    benchmark_python_xz_parallel(filename)
    # TODO test with multistream xz

parallel_xz_decoder.py

Click to expand

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import bisect
import io
import lzma
import math
import multiprocessing.pool
import os
import resource
import sys
import time

from typing import Iterable

import xz


# TODO Add tests for everything


def overrides(parentClass):
    """Simple decorator that checks that a method with the same name exists in the parent class"""

    def overrider(method):
        assert method.__name__ in dir(parentClass)
        assert callable(getattr(parentClass, method.__name__))
        return method

    return overrider


class LruCache(dict):
    def __init__(self, size: int = 10):
        self.size = size
        self.lastUsed: List[int] = []

    def _refresh(self, key):
        if key in self.lastUsed:
            self.lastUsed.remove(key)
        self.lastUsed.append(key)

    def __setitem__(self, key, value):
        super().__setitem__(key, value)

        self._refresh(key)
        while super().__len__() > self.size:
            super().__delitem__(self.lastUsed.pop(0))

    def __getitem__(self, key):
        value = super().__getitem__(key)
        self._refresh(key)
        return value


class Prefetcher:
    def __init__(self, memorySize):
        self.lastFetched = []
        self.memorySize = memorySize

    def fetch(self, value):
        if value in self.lastFetched:
            self.lastFetched.remove(value)
        self.lastFetched.append(value)
        while len(self.lastFetched) > self.memorySize:
            self.lastFetched.pop(0)

    def prefetch(self, maximumToPrefetch) -> Iterable:
        if not self.lastFetched or maximumToPrefetch <= 0:
            return []

        consecutiveCount = 0
        values = self.lastFetched[::-1]
        for i, j in zip(values[0:-1], values[1:]):
            if i == j + 1:
                consecutiveCount += 1
            else:
                break

        # I want an exponential progression like: logStep**consecutiveCount with the boundary conditions:
        # logStep**0 = 1 (mathematically true for any logStep because consecutiveCount was chosen to fit)
        # logStep**maxConsecutiveCount = maximumToPrefetch
        #   => logStep = exp(ln(maximumToPrefetch)/maxConsecutiveCount)
        #   => logStep**consecutiveCount = exp(ln(maximumToPrefetch) * consecutiveCount/maxConsecutiveCount)
        prefetchCount = int(round(math.exp(math.log(maximumToPrefetch) * consecutiveCount / (self.memorySize - 1))))
        return range(self.lastFetched[-1] + 1, self.lastFetched[-1] + 1 + prefetchCount)


class ParallelXZReader(io.BufferedIOBase):
    # TODO test if a simple thread pool would also parallelize equally well
    """Uses a process pool to prefetch and cache decoded xz blocks"""

    def __init__(self, filename, parallelization):
        print("Parallelize:", parallelization)
        self.parallelization = parallelization - 1  # keep one core for on-demand decompression
        self.pool = multiprocessing.pool.Pool(self.parallelization)
        self.offset = 0
        self.filename = filename
        self.fileobj = xz.open(filename, 'rb')
        self.blockCache = LruCache(2 * parallelization)
        self.prefetcher = Prefetcher(4)

        assert self.fileobj.seekable() and self.fileobj.readable()

        print(self.fileobj.stream_boundaries)
        print(self.fileobj.block_boundaries)  # contains uncompressed offsets and therefore sizes -> perfect!

    def _findBlock(self, offset: int):
        blockNumber = bisect.bisect_right(self.fileobj.block_boundaries, offset)
        print("Look for offset:", offset, "found:", blockNumber)
        if blockNumber <= 0:
            return blockNumber - 1, 0, 0
        if blockNumber >= len(self.fileobj.block_boundaries) or blockNumber <= 0:
            return blockNumber - 1, offset - self.fileobj.block_boundaries[blockNumber - 1], -1

        blockSize = self.fileobj.block_boundaries[blockNumber] - self.fileobj.block_boundaries[blockNumber - 1]
        offsetInBlock = offset - self.fileobj.block_boundaries[blockNumber - 1]
        assert offsetInBlock >= 0
        assert offsetInBlock < blockSize
        return blockNumber - 1, offsetInBlock, blockSize

    def _blockSize(self, blockNumber):
        blockNumber += 1
        if blockNumber >= len(self.fileobj.block_boundaries) or blockNumber <= 0:
            return -1
        return self.fileobj.block_boundaries[blockNumber] - self.fileobj.block_boundaries[blockNumber - 1]

    @staticmethod
    def _decodeBlock(filename, offset, size):
        with xz.open(filename, 'rb') as file:
            file.seek(offset)
            return file.read(size)

    def __enter__(self):
        return self

    def __exit__(self, exception_type, exception_value, exception_traceback):
        self.close()

    @overrides(io.BufferedIOBase)
    def close(self) -> None:
        self.fileobj.close()
        self.pool.close()

    @overrides(io.BufferedIOBase)
    def fileno(self) -> int:
        # This is a virtual Python level file object and therefore does not have a valid OS file descriptor!
        raise io.UnsupportedOperation()

    @overrides(io.BufferedIOBase)
    def seekable(self) -> bool:
        return True

    @overrides(io.BufferedIOBase)
    def readable(self) -> bool:
        return True

    @overrides(io.BufferedIOBase)
    def writable(self) -> bool:
        return False

    @overrides(io.BufferedIOBase)
    def read(self, size: int = -1) -> bytes:
        print("\nread", size, "from", self.offset)
        result = bytes()
        blocks = []
        blockNumber, firstBlockOffset, blockSize = self._findBlock(self.offset)
        print("Found block:", blockNumber, blockSize, firstBlockOffset)
        if blockNumber >= len(self.fileobj.block_boundaries) or blockNumber < 0:
            return result

        pendingBlocks = sum(not block.ready() for block in self.blockCache.values())

        availableSize = blockSize - firstBlockOffset
        while True:
            # Fetch Block
            self.prefetcher.fetch(blockNumber)
            if blockNumber in self.blockCache:
                fetchedBlock = self.blockCache[blockNumber]
            else:
                print("fetch block:", blockNumber, "sized", self._blockSize(blockNumber))
                fetchedBlock = self.pool.apply_async(
                    ParallelXZReader._decodeBlock,
                    (self.filename, self.fileobj.block_boundaries[blockNumber], self._blockSize(blockNumber)),
                )
                self.blockCache[blockNumber] = fetchedBlock
                pendingBlocks += 1

            blocks.append(fetchedBlock)
            if size <= availableSize or blockSize == -1:
                break
            size -= availableSize
            self.offset += availableSize

            # Get metadata for next block
            blockNumber += 1
            if blockNumber >= len(self.fileobj.block_boundaries):
                break
            blockSize = self._blockSize(blockNumber)
            offsetInBlock = self.offset - self.fileobj.block_boundaries[blockNumber - 1]

            availableSize = blockSize - offsetInBlock

        # TODO apply prefetch suggestion
        maxToPrefetch = self.parallelization - pendingBlocks
        toPrefetch = self.prefetcher.prefetch(self.parallelization)
        print("Prefetch suggestion:", toPrefetch)
        for blockNumber in toPrefetch:
            if blockNumber < len(self.fileobj.block_boundaries) and blockNumber not in self.blockCache:
                fetchedBlock = self.pool.apply_async(
                    ParallelXZReader._decodeBlock,
                    (self.filename, self.fileobj.block_boundaries[blockNumber], self._blockSize(blockNumber)),
                )
                self.blockCache[blockNumber] = fetchedBlock
                pendingBlocks += 1
        print("pending blocks:", pendingBlocks)

        print("Got blocks:", blocks)

        while blocks:
            block = blocks.pop(0)
            # Note that it is perfectly safe to call AsyncResult.get multiple times!
            toAppend = block.get()
            print(f"Append view ({firstBlockOffset},{ size}) of block of length {len(toAppend)}")
            if firstBlockOffset > 0:
                toAppend = toAppend[firstBlockOffset:]
            if not blocks:
                toAppend = toAppend[:size]
            firstBlockOffset = 0

            result += toAppend

        if blockNumber == 21:
            print("Result:", len(result))

        # TODO fall back to reading directly from fileobj if prefetch suggests nothing at all to improve latency!
        # self.fileobj.seek(self.offset)
        # result = self.fileobj.read(size)

        self.offset += len(result)
        return result

    @overrides(io.BufferedIOBase)
    def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
        if whence == io.SEEK_CUR:
            self.offset += offset
        elif whence == io.SEEK_END:
            self.offset = self.cumsizes[-1] + offset
        elif whence == io.SEEK_SET:
            self.offset = offset

        if self.offset < 0:
            raise ValueError("Trying to seek before the start of the file!")
        if self.offset >= self.cumsizes[-1]:
            return self.offset

        return self.offset

    @overrides(io.BufferedIOBase)
    def tell(self) -> int:
        return self.offset

Manual Shell Execution

base64 /dev/urandom | head -c $(( 4*1024*1024*1024  )) > large
xz -T 0 --keep large
python3 decompress-xz-parallel.py large.xz

The text was updated successfully, but these errors were encountered:

mxmlnkn · 2021-12-31T12:43:18Z

What I am missing interface-wise is the size of the last block. I can get all other sizes from the difference of neighboring blocks but not so for the last block. Strictly speaking, if we are talking about block boundaries, then the end of the last block should also be returned even if it is not the beginning of a new block. Then, I could infer the decompressed size of the last block from this.

Edit: Ok, I can simply use file.seek(0, io.SEEK_END) to geht the size and everything is fine.

Rogdham · 2022-01-02T16:10:09Z

After some thinking on this issue, I believe that this issue is complex, and would add too much complexity for this project: there are many ways to do parallelism in Python (threading, asyncio, gevent, etc.) and I feel like this will be too much for a single use case (parallel decompression).

This would be in the same way that I chose not to support other features, such as decompressing a stream on the fly: python-xz needs the whole xz file before being able to open it (the first things it does is seeking to the very end).

So I would suggest creating your own library to support this specific use-case. If you decide to base it on python-xz I would be happy and to some extend would try to provide some support, for example by improving the API of python-xz.

With that being said, I agree that the API to get the blocks and their positions/sizes is not optimal at all. I copied that from the lzmaffi library. Indeed, the first value in the list will always be 0, which is not really useful.

I feel like a better API would be to have an attribute returning all the blocks as a list of objects, and from each object you would be able to get the size and the position.

I can see two possible APIs, let me know what you think file.blocks should return:

A dict whose keys are block positions, and whose values are XZBlock instances (len(block) to have their length

>>> for pos, block in file.blocks.items():
...     print(pos, len(block))
0 100
100 200

A list of XZBlock instances, and add an attribute to each one to get the position

>>> for block in file.blocks:
...     print(block.something, len(block))  # attribute name `something` to be decided
0 100
100 200

Ok, I can simply use file.seek(0, io.SEEK_END) to geht the size and everything is fine.

For the time being you can use len(file) that should work as well 😉

mxmlnkn · 2022-01-02T21:18:57Z

I arranged myself with what is currently available, so it isn't that pressing anymore. Feel free to close this issue if you want. I improved and benchmarked the already posed sketch and added it to (the public interface of) ratarmountcore, from which it could even be used from other programs. The only thing missing is the check whether the block fits into memory and if not, fall back to on-demand reading and seeking. I'm not sure whether the ParallelXZReader class is worth a separate package albeit it would make it much easier for people to find it. After some tuning and fixes, the speedups on my 12-core are pretty consistent (for 0B per file, it is not decompression bandwidth bound):

But about the proposed interfaces, based on the getter name blocks, the second one feels more correct, because it really is only a list of blocks. Compared with indexed_bz2 and indexed_zstd, which is based on the former, the first interface would be a tad more consistent except for the missing "size" offset as the last element.

piskvorky · 2023-03-23T13:20:40Z

@mxmlnkn @Rogdham @pauldmccarthy seeing the conversation above: is there any appetite to unify the indexed_gzip + indexed_bzip2 + python-xz approach?

I mean in the sense of a unified API solution to random-seeking. Python's built gzip and bz2 and lzma share the inefficient DecompressReader.seek() implementation. So rather than replace Python's three built-ins by three distinct libraries (which are great btw, thanks) as we do now, factor out the seek points / indexing functionality, and use that? I'm not interested in parallel decompression, only in better support for seeking (which TBH feels like should have been a part of core Python already).

And maybe even submit that API back to CPython, to replace the seek(0) atrocity in DecompressReader if the user chooses to do so.

In my imagination, the benefit would be a) lower maintenance / less duplicated effort, and b) greater reach of the outstanding work done on your libraries. But maybe I'm way off, please let me know.

mxmlnkn · 2023-03-23T14:58:24Z

@mxmlnkn @Rogdham @pauldmccarthy seeing the conversation above: is there any appetite to unify the indexed_gzip + indexed_bzip2 + python-xz approach?

I want to interject that I'm still heavily working on pragzip, which is the same code base as indexed_bzip2. Single-threaded pragzip should already have feature parity to indexed_gzip but I still have some todos for parallel decompression. Pragzip also comes with its own command line tool replacement for GNU gzip btw. Pragzip has its own Github repository for better visibility and it has its own PyPI package because it has different system dependencies (zlib) than indexed_bzip2 and because it is still frequently updated with changes that are completely irrelevant to indexed_bzip2.

But, the underlying infrastructure for parallel block-based decompression with random-seek capabilities is sufficiently generic to work with both bzip2 and gzip already and probably can also be extended to work with xz and zstandard. The interfaces of pragzip and indexed_bzip2 Python modules are also mostly similar. But, I'm only a single person doing this mostly in my free time, so the speed of progress is limited.

And maybe even submit that API back to CPython, to replace the seek(0) atrocity in DecompressReader if the user chooses to do so.

I have zero experience with that... It sounds difficult and (just speculating) it might not work because C++17, which indexed_bzip2 and pragzip heavily use is not allowed inside CPython? Also, other Python interpreters like PyPy, Codon(?), and others still would require the packages to be maintained.

In my imagination, the benefit would be a) lower maintenance / less duplicated effort, and b) greater reach of the outstanding work done on your libraries. But maybe I'm way off, please let me know.

a) For only the single-threaded random-seek support, I don't think it would save that much of effort. Each compression format index requires slightly different data and the main problem is to use existing compression libraries correctly (zstandard library, Python lzma, ...), i.e., highly divergent glue code. As for a generic infrastructure for random seeking plus block-parallel decompression, see my first paragraph.

b) I would like greater reach and adoption :). But, to be honest, I would also like some attribution and direct contact with users. And, if it's inside CPython, then it becomes basically invisible inside the whole amalgam code base of CPython.

Some unified library like libarchive is, would be nice though. In an ideal work, it might even be integrated into libarchive but starting a separate project is always easier to get done and libarchive is written in C, which I don't want to write.

pauldmccarthy · 2023-03-27T09:53:25Z

Hi @piskvorky @mxmlnkn @Rogdham, in principle I like the idea of a unified library, but unfortunately I just don't have the time to contribute more beyond discussion/review (indexed_gzip is solely in maintenance mode for the time being, and will likely soon be supplanted by the great work being done by @mxmlnkn anyway 😄).

Rogdham · 2023-03-27T12:07:36Z

I have nothing against a unified library as well, and I agree there are many things that could be factored (e.g. the python-xz code for file-objects is generic already).

The good thing is that having a few different file formats to consider would make it pretty clear what can be factored and what can not.

Feel free to ping me if my knowledge about XZ file format can be of any help!

…odule rapidgzip was split off from indexed_bzip2 because it adds many dependencies and makes the build process more error-prone. The other way around, indexed_bzip2 is fully self-contained, relatively small because of fewer precomputed lookup tables. Ergo, there is basically no downside to including it with rapidgzip and has the upside that, in the future, ratarmount might only have to depend on rapidgzip when/if it gets zstd or even lz4 support. This adds ~15% overhead to the compressed and uncompressed precompiled wheels and binaries: zipinfo on rapidgzip 0.10.3 wheel: 35230512 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 35266211 bytes uncompressed, 7523567 bytes compressed: 78.7% zipinfo on rapidgzip wheel with indexed_bzip2: 39876504 rapidgzip.cpython-310-x86_64-linux-gnu.so 8 files, 39915164 bytes uncompressed, 8936562 bytes compressed: 77.6% zipinfo on indexed_bzip2 1.5.0: 12061016 indexed_bzip2.cpython-310-x86_64-linux-gnu.so 8 files, 12088983 bytes uncompressed, 3795195 bytes compressed: 68.6% Notably, this merge saves ~15% when comparing it to having both installed, which is helpful for the AppImage. The versioning of indexed_bzip2 is not easily accessible via the rapidgzip module, it can be thought of as assuming the one by rapidgzip. The rapidgzip.open method does not (yet) recognize bzip2 files. This is one step towards the unification mentioned in: Rogdham/python-xz#3

Rogdham added the enhancement New feature or request label Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize decoding #3

Parallelize decoding #3

mxmlnkn commented Dec 31, 2021

mxmlnkn commented Dec 31, 2021 •

edited

Loading

Rogdham commented Jan 2, 2022

mxmlnkn commented Jan 2, 2022

piskvorky commented Mar 23, 2023 •

edited

Loading

mxmlnkn commented Mar 23, 2023

pauldmccarthy commented Mar 27, 2023

Rogdham commented Mar 27, 2023

Parallelize decoding #3

Parallelize decoding #3

Comments

mxmlnkn commented Dec 31, 2021

decompress-xz-parallel.py

parallel_xz_decoder.py

Manual Shell Execution

mxmlnkn commented Dec 31, 2021 • edited Loading

Rogdham commented Jan 2, 2022

mxmlnkn commented Jan 2, 2022

piskvorky commented Mar 23, 2023 • edited Loading

mxmlnkn commented Mar 23, 2023

pauldmccarthy commented Mar 27, 2023

Rogdham commented Mar 27, 2023

mxmlnkn commented Dec 31, 2021 •

edited

Loading

piskvorky commented Mar 23, 2023 •

edited

Loading