Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Support games compressed in .xz format #1964

Closed
tony971 opened this issue Jun 7, 2017 · 20 comments
Closed

[Request] Support games compressed in .xz format #1964

tony971 opened this issue Jun 7, 2017 · 20 comments

Comments

@tony971
Copy link

tony971 commented Jun 7, 2017

PCSX2 has a shiny new .xz module. Any chance of users being able to shrink down their game library because of it?

@gregory38
Copy link
Contributor

I would love it.

Some info for future implementer

  • index information is builtin in the xz format
  • xz-utils code can be used as an example to get the index information

@gregory38
Copy link
Contributor

gregory38 commented Aug 6, 2017

Future/new version of xz (5.3) add a new API to parse block information: lzma_file_info_decoder. This function decodes all headers data and creates a block list info structure (type lzma_index). Note, xz calls block index

Then you can create an iterator lzma_index_iter_init. (Note next element can be get with lzma_index_iter_next but I think it is useless).

You can directly go to the good block with lzma_index_iter_locate which will return the iterator of the address you want to decode.

Summary:

  • lzma_file_info_decoder(&stream, &index, ...); : out index
  • lzma_index_iter_init(&iterator, index); out iterator
  • lzma_index_iter_locate(&iter, uncompressed_address): out correctly set iterator
  • Allow to use iter.block.compressed_file_offset, iter.block.uncompressed_file_offset and iter.block.uncompressed_size

API is not clear for me. There are lzma_block/lzma_index/lzma_iter objects. It seems that above iterator should be used to decode header block info.

while (!lzma_index_iter_next(&iter, LZMA_INDEX_ITER_BLOCK)) {
       lzma_block block;
      uint8_t header_size = fread of 1 bytes at iter.compressed_file_offset
      block.header_size = lzma_block_header_size_decode(header_size);
      // XXX need to block.version ! And likely block.check
      lzma_block_header_decode(&block, ..., compressed_buffer);
      all_info_struct.push(block);
   }

With block and file info, you can directly use lzma_block_buffer_decode

lzma_block_buffer_decode(lzma_block *block, const lzma_allocator *allocator,
		const uint8_t *in, size_t *in_pos, size_t in_size,
		uint8_t *out, size_t *out_pos, size_t out_size)

@gregory38
Copy link
Contributor

gregory38 commented Aug 6, 2017

So I read the xz file format specification. Information are duplicated for redundancy/corruption checking.

So the full story is

  • block contains header + data. Header contains header size and may (depends on compression flags) contains compressed/uncompressed size of the block.
  • index is a list of block records. Each record contain the unpadded size and the uncompressed size.

NOTE: I checked a binary on my computer and size aren't present in block header.

Conclusion we need first to decode index to get the various offset of the blocks. Iterator allow us to iterate on block records.

@gregory38
Copy link
Contributor

gregory38 commented Aug 6, 2017

@turtleli
I would need a newer version of xz to do some tests (unreleased actually). How can I sync https://github.com/PCSX2/xz.git with latest upstream git ?

Edit: actually don't bother I pulled something in a local branch. It should be enough.

@avih
Copy link
Contributor

avih commented Aug 6, 2017

While xz is definitely popular, if possible, I'd suggest to also examine/play with newer compression formats like zstd and brotly or maybe some of the LZ* family. In my experience with the gzip implementation, random access decompression speed is the key for avoiding lot of pitfalls, workarounds and caches.

Also, it's best to avoid creating an index, and instead stick to formats/configurations which provide their own index as part of the standard - and require users to use these configurations only (I don't know how much this is true for the formats i mentioned).

@gregory38
Copy link
Contributor

So I'm rather close of a working prototype (based on latest xz git). I manage to uncompress a couple of blocks. And cdvd format seem to be detected correctly. But it fails later. Maybe an issue with block boundary. I need to double check the logic..

@gregory38
Copy link
Contributor

Good news I manage to boot a game. The issue was on the blocksize/blockcount management. Honestly the logic should be moved into the base class. Anyway XZ stuff is done 👍

@gregory38
Copy link
Contributor

gregory38 commented Aug 10, 2017

As a side note, xz could also be a neat replacement for save state too. I saved 30% with a repack of the savestate.

@tony971
Copy link
Author

tony971 commented Sep 14, 2017

Is this waiting on a new XZ release?

@gregory38
Copy link
Contributor

Yeah a new XZ release would help. We would need to release 1.6 too. I don't want to requires an alpha release of XZ for our release. I'm also waiting to have free time to merge the code.

@Zero3K
Copy link

Zero3K commented Dec 31, 2017

Any news regarding it?

@MrCK1
Copy link
Member

MrCK1 commented Dec 31, 2017

Nope, everything you see in the pull requests section is what's currently being worked on.

@Zero3K
Copy link

Zero3K commented Dec 31, 2017

I hope that it gets added soon.

@tony971
Copy link
Author

tony971 commented Jan 5, 2018

I've just been checking for new XZ releases. This is the biggest gap between releases they've had in a long while, so hopefully it's soon.

@tony971
Copy link
Author

tony971 commented Apr 30, 2018

Looks like an alpha build of xz utils was published with the lzma_file_info_decoder() API

https://git.tukaani.org/?p=xz.git;a=blob_plain;f=NEWS;hb=114cab97af766b21e0fc8620479202fb1e7a5e41

@Quest79
Copy link

Quest79 commented Sep 18, 2020

Whats happening/ed with this? I've just done a ton of tests to compress my PS2 library and out of gz, zip, 7z, rar, cso and xz, xz had the lowest filesize and did it suspiciously fast 10MB/s with the strongest compression. Gz was doing about 2MB/s (less cores used) Im using the one built into current 7z.

Space savings roughly 8-15% over gz. Thats several hundred GB saved for larger collections.
Theoretically~ if we keep implementing new better compression every few years, in 200 years or so a massive Ps2 collections will be under 10KB

@refractionpcsx2
Copy link
Member

Nothing has really happened I'm afraid. xz got added to PCSX2 for making GS dumps, but no support for loading games yet, it's not really been much of a priority.

Theoretically~ if we keep implementing new better compression every few years, in 200 years or so a massive Ps2 collections will be under 10KB

That's big brain right there, but I don't think that's how compression works.

@gregory38
Copy link
Contributor

gregory38 commented Sep 18, 2020

Btw, I found this codec recently. The promise is a similar lzma compression ratio, but a much faster decompression speed.

https://github.com/richgel999/lzham_codec

However, I don't know if we can chunk the bitstream for random access

@tony971
Copy link
Author

tony971 commented Sep 18, 2020

This is the most promising I've found.

https://github.com/aaru-dps/Aaru

@lightningterror
Copy link
Contributor

Closing as trivial and we support other formats.
A pr was open to implement it in #2424 , if someone wishes they can resume work to get it in a mergeable state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants