Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering on compressed data #21426

Closed
alexey-milovidov opened this issue Mar 3, 2021 · 4 comments
Closed

Filtering on compressed data #21426

alexey-milovidov opened this issue Mar 3, 2021 · 4 comments
Labels
feature performance st-wontfix Known issue, no plans to fix it currenlty

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Mar 3, 2021

LZ4 is a compression method of LZ77 family. It's byte oriented: bytes are only copied around during decompression - there are no bit twiddling or arithmetic transforms (in contrast to ZSTD). The minimal match size is 4. More details here: https://habr.com/en/company/yandex/blog/457612/

There are the following observations:

Compressed data contains all the byte values from uncompressed data (and possibly some other byte values). For example, if source data has byte a, then compressed data will also contain byte a.

If source data contains a substring abcdefghij we can say that compressed data will either:

  • has substrings a, bcde, fghi, j;
  • has substrings ab, cdef, ghij;
  • has substrings abc, defg, hij;
  • has substrings abcd, efgh, ij;

We can apply quick SIMD multiple substring search algorithm on compressed data and if neither of the variants are true, the whole compressed block can be filtered out.

This can be applied to optimize WHERE conditions like x = const and s LIKE '%substring%'.

We can figure out some possible filters to push down. Then push down them to IDataType::deserializeBinaryBulk, then to ReadBuffer if it's CompressedReadBufferBase to call some method for "filtered decompression". If compressed block is filtered out, the method will return a flag instead of decompressing the data. The method IDataType::deserializeBinaryBulk can also return a flag or return a column filled with zeros instead of real data.

Alternatives

More "direct" approach will be to fuse the decompression loop and filtering to analyze the data while decompressing it. But it will be less performant, because decompressed data must be reconstructed in memory, and it is limited by decompression speed (several GB/sec).

Caveats

Filtered data will not be accounted in the progress bar.

@danlark1
Copy link
Contributor

The substring statement is not always correct, LZ4 for data <64KB chooses 2byte hashes and it will lead to

  • a, bc, de, fg, hi, j
  • ab, cd, ef, gh, ij

I am in the process of prototype, first results soon

@danlark1
Copy link
Contributor

danlark1 commented Sep 24, 2021

So, overall the observation of 4grams is not correct

Example: original file
html.txt
Compressed file
html.lz4_blocked.txt

Decompression description:
decompression_description.txt

Most important lines for searching the string options[:

Literal 540 \n<TITLE>Micro Achat : Ordinateurs, PDA -  Toute l\'informatique avec 01I
Match at 614 of length 7 copied to 1644 formati
Match at 1644 of length 8 copied to 1857 formatid
Match at 1865 of length 7 copied to 4306 formati
Literal 4313 ons
Literal 65643 .op
Match at 4311 of length 5 copied to 65646 tions
Match at 65632 of length 19 copied to 65822 departement.options
Literal 65841 [

And we get the string options[ correctly without having any 4grams of it when being decoded, ti is from literal, ons is from literal, .op is from literal, [ is from literal

This happens when matches are extended after the cache-table hit in decompression and so there is no reason to have 4grams of the patterns in the compressed data

@danlark1
Copy link
Contributor

danlark1 commented Sep 24, 2021

We should go to the drawing board if we want to have some pattern matching in compressed data, I am not aware of any attempts with modern compression algorithms, only some specialized ones

For now, this probably should be closed as Won't Fix

@alexey-milovidov
Copy link
Member Author

Thank you! It was not easy to understand why it does not work.
But yes - the idea failed miserably :)

@alexey-milovidov alexey-milovidov added the st-wontfix Known issue, no plans to fix it currenlty label Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature performance st-wontfix Known issue, no plans to fix it currenlty
Projects
None yet
Development

No branches or pull requests

2 participants