Filtering on compressed data #21426

alexey-milovidov · 2021-03-03T17:28:56Z

LZ4 is a compression method of LZ77 family. It's byte oriented: bytes are only copied around during decompression - there are no bit twiddling or arithmetic transforms (in contrast to ZSTD). The minimal match size is 4. More details here: https://habr.com/en/company/yandex/blog/457612/

There are the following observations:

Compressed data contains all the byte values from uncompressed data (and possibly some other byte values). For example, if source data has byte a, then compressed data will also contain byte a.

If source data contains a substring abcdefghij we can say that compressed data will either:

has substrings a, bcde, fghi, j;
has substrings ab, cdef, ghij;
has substrings abc, defg, hij;
has substrings abcd, efgh, ij;

We can apply quick SIMD multiple substring search algorithm on compressed data and if neither of the variants are true, the whole compressed block can be filtered out.

This can be applied to optimize WHERE conditions like x = const and s LIKE '%substring%'.

We can figure out some possible filters to push down. Then push down them to IDataType::deserializeBinaryBulk, then to ReadBuffer if it's CompressedReadBufferBase to call some method for "filtered decompression". If compressed block is filtered out, the method will return a flag instead of decompressing the data. The method IDataType::deserializeBinaryBulk can also return a flag or return a column filled with zeros instead of real data.

Alternatives

More "direct" approach will be to fuse the decompression loop and filtering to analyze the data while decompressing it. But it will be less performant, because decompressed data must be reconstructed in memory, and it is limited by decompression speed (several GB/sec).

Caveats

Filtered data will not be accounted in the progress bar.

The text was updated successfully, but these errors were encountered:

danlark1 · 2021-09-24T00:50:06Z

The substring statement is not always correct, LZ4 for data <64KB chooses 2byte hashes and it will lead to

a, bc, de, fg, hi, j
ab, cd, ef, gh, ij

I am in the process of prototype, first results soon

danlark1 · 2021-09-24T23:11:34Z

So, overall the observation of 4grams is not correct

Example: original file
html.txt
Compressed file
html.lz4_blocked.txt

Decompression description:
decompression_description.txt

Most important lines for searching the string options[:

Literal 540 \n<TITLE>Micro Achat : Ordinateurs, PDA -  Toute l\'informatique avec 01I
Match at 614 of length 7 copied to 1644 formati
Match at 1644 of length 8 copied to 1857 formatid
Match at 1865 of length 7 copied to 4306 formati
Literal 4313 ons
Literal 65643 .op
Match at 4311 of length 5 copied to 65646 tions
Match at 65632 of length 19 copied to 65822 departement.options
Literal 65841 [

And we get the string options[ correctly without having any 4grams of it when being decoded, ti is from literal, ons is from literal, .op is from literal, [ is from literal

This happens when matches are extended after the cache-table hit in decompression and so there is no reason to have 4grams of the patterns in the compressed data

danlark1 · 2021-09-24T23:15:50Z

We should go to the drawing board if we want to have some pattern matching in compressed data, I am not aware of any attempts with modern compression algorithms, only some specialized ones

For now, this probably should be closed as Won't Fix

alexey-milovidov · 2021-09-24T23:25:12Z

Thank you! It was not easy to understand why it does not work.
But yes - the idea failed miserably :)

alexey-milovidov added feature performance labels Mar 3, 2021

alexey-milovidov mentioned this issue Jul 9, 2021

Intern tasks 2020/2021 #15065

Closed

alexey-milovidov closed this as completed Sep 24, 2021

alexey-milovidov added the st-wontfix Known issue, no plans to fix it currenlty label Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering on compressed data #21426

Filtering on compressed data #21426

alexey-milovidov commented Mar 3, 2021 •

edited

danlark1 commented Sep 24, 2021

danlark1 commented Sep 24, 2021 •

edited

danlark1 commented Sep 24, 2021 •

edited

alexey-milovidov commented Sep 24, 2021

Filtering on compressed data #21426

Filtering on compressed data #21426

Comments

alexey-milovidov commented Mar 3, 2021 • edited

danlark1 commented Sep 24, 2021

danlark1 commented Sep 24, 2021 • edited

danlark1 commented Sep 24, 2021 • edited

alexey-milovidov commented Sep 24, 2021

alexey-milovidov commented Mar 3, 2021 •

edited

danlark1 commented Sep 24, 2021 •

edited

danlark1 commented Sep 24, 2021 •

edited