Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed new virtual file system vsilz4 #2201

Open
bjornharrtell opened this issue Jan 31, 2020 · 15 comments
Open

Proposed new virtual file system vsilz4 #2201

bjornharrtell opened this issue Jan 31, 2020 · 15 comments

Comments

@bjornharrtell
Copy link
Contributor

bjornharrtell commented Jan 31, 2020

Rationale that it would allow for random access to compressed data. Looks like zx is the only mainstream format that has random access built into the standard. I previously thought lz4 had the capability but it seems it's a non-standard proof of concept. Looks like both zx and lz4 can do it but I think lz4 is more attractive due to it's impressive speed.

I'm interested in making an attempt to implement this.

@rouault
Copy link
Member

rouault commented Jan 31, 2020

where you need random access via network.

Can you efficiently seek at a given offset of the uncompressed stream, without having uncompressed the stream from the beginning to that offset ?

@bjornharrtell
Copy link
Contributor Author

bjornharrtell commented Jan 31, 2020

Yes that is from what I understand the purpose of the frames. https://tukaani.org/xz/format.html describes it as "The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.".

So, there is a trade-off that smaller blocks will give more efficient random access at the cost of lesser compression ratio.

@bjornharrtell
Copy link
Contributor Author

If I understand correctly zstd might get official framing support one day (see facebook/zstd#395).

@bjornharrtell
Copy link
Contributor Author

bjornharrtell commented Jan 31, 2020

Looks like I have misunderstood lz4, according to discussion at lz4/lz4#187 it does not support random access (out of the box).

@bjornharrtell
Copy link
Contributor Author

Looks like xz shows more promise, I'll switch my efforts to that.

@bjornharrtell bjornharrtell changed the title Proposed new virtual file system vsilz4 Proposed new virtual file system vsixz Feb 1, 2020
@rouault
Copy link
Member

rouault commented Feb 1, 2020

The downside of xz is that it isn't particularly fast. Too bad there are not yet standardized framing support for zstd

@rouault
Copy link
Member

rouault commented Feb 1, 2020

FYI some criticism of the xz format: https://www.nongnu.org/lzip/xz_inadequate.html

@rwmjones
Copy link

xz certainly supports random access without seeking from the start, see: http://libguestfs.org/nbdkit-xz-filter.1.html

@bjornharrtell
Copy link
Contributor Author

Revisiting this I'm again considering looking into lz4. I think I misunderstood twice... and that it does support random access when compressing using independent blocks which does seem to be the default.

@bjornharrtell bjornharrtell changed the title Proposed new virtual file system vsixz Proposed new virtual file system vsilz4 Dec 2, 2020
@bjornharrtell
Copy link
Contributor Author

Still interested in this.. and I note liblz4 is now available in GDAL for other purposes.

@bjornharrtell
Copy link
Contributor Author

bjornharrtell commented Nov 21, 2021

@bjornharrtell
Copy link
Contributor Author

Hmm and again I'm back to probably misunderstood lz4 frame format.. quote from https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md#introduction - "The data format defined by this specification does not attempt to allow random access to compressed data.".

@rouault
Copy link
Member

rouault commented Apr 21, 2022

"The data format defined by this specification does not attempt to allow random access to compressed data.".

yeah, I doubt any compression method has a standardized way of encoding a table that maps uncompressed offsets to the offset of the start of a new frame (or equivalent mechanism).

@rwmjones
Copy link

rwmjones commented Apr 21, 2022

That's exactly what xz has, and the existence proof of this is: https://gitlab.com/nbdkit/nbdkit/-/tree/master/filters/xz We use this routinely to random access the content of xz-compressed disk images without scanning or (fully) uncompressing them.

@bjornharrtell
Copy link
Contributor Author

I would pursue vsixz but the decompression speed is no fun. Seems everyone is moving away from xz these days. But it's definitely sad that random access has moved to application level when xz proved it could be part of the general compression format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants