Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster lookup of entries is large file-count Zip #147

Open
yairlenga opened this issue May 6, 2023 · 0 comments
Open

Faster lookup of entries is large file-count Zip #147

yairlenga opened this issue May 6, 2023 · 0 comments

Comments

@yairlenga
Copy link

Looking for feedback for the following problem:

I have a large hierarchy (>1M files, each 10K compressed), zipped into a logical "dataset". Individual files represent simulation results - semi structured. (side note: I've tried storing it into Parquet file, but performance for getting subset of the data).

When reading the Zip file, code is spending lot of time on reading the central dir. When retrieving large number of experiments (e.g., 100+) - the one time cost of the central dir read is reasonable (amortized across all read). However, when looking up few experiments (or just one), the cost of reading the central dir (measured to be double digit MB), outweigh the cost of reading the file, resulting in poor performance.

Did some research, and I understand that there is no "generic" solution, as the central directory must be ready sequentially (variable length entries, no "block markers", unsorted content). Hoping to get feedback/ideas if possible to build something more efficient and, leveraging the "virtualization" of the zziplib, to speed up processing. Basic idea:

  1. Take the original Zip file.
  2. Create "alternative" directory structure that can be searched in "binary" way (sort entries by name, make all entry fixed size, create index)
  3. Store the "alternative" directory as an entry in the Zip file.
  4. Somehow (how ?) - avoid reading the "real" central dir, and do binary search inside the "Alternative" directory to locate entry information
  5. Use the entry information to extract/inflate/... the real experiment.

Basic idea is that file stay compatible with standard zip tools, but have a "secret" path for fast access.

In theory, 1M entries will require reading less than 200K of "alternate" directory, instead of 40MB, which regular "unzip" is doing.

Any ideas/feedback on how to extend/leverage zziplib using ext-io to achieve the above speed ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant