New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering on compressed data #21426
Comments
The substring statement is not always correct, LZ4 for data <64KB chooses 2byte hashes and it will lead to
I am in the process of prototype, first results soon |
So, overall the observation of 4grams is not correct Example: original file Decompression description: Most important lines for searching the string
And we get the string This happens when matches are extended after the cache-table hit in decompression and so there is no reason to have 4grams of the patterns in the compressed data |
We should go to the drawing board if we want to have some pattern matching in compressed data, I am not aware of any attempts with modern compression algorithms, only some specialized ones For now, this probably should be closed as Won't Fix |
Thank you! It was not easy to understand why it does not work. |
LZ4 is a compression method of LZ77 family. It's byte oriented: bytes are only copied around during decompression - there are no bit twiddling or arithmetic transforms (in contrast to ZSTD). The minimal match size is 4. More details here: https://habr.com/en/company/yandex/blog/457612/
There are the following observations:
Compressed data contains all the byte values from uncompressed data (and possibly some other byte values). For example, if source data has byte
a
, then compressed data will also contain bytea
.If source data contains a substring
abcdefghij
we can say that compressed data will either:a
,bcde
,fghi
,j
;ab
,cdef
,ghij
;abc
,defg
,hij
;abcd
,efgh
,ij
;We can apply quick SIMD multiple substring search algorithm on compressed data and if neither of the variants are true, the whole compressed block can be filtered out.
This can be applied to optimize WHERE conditions like
x = const
ands LIKE '%substring%'
.We can figure out some possible filters to push down. Then push down them to
IDataType::deserializeBinaryBulk
, then toReadBuffer
if it'sCompressedReadBufferBase
to call some method for "filtered decompression". If compressed block is filtered out, the method will return a flag instead of decompressing the data. The methodIDataType::deserializeBinaryBulk
can also return a flag or return a column filled with zeros instead of real data.Alternatives
More "direct" approach will be to fuse the decompression loop and filtering to analyze the data while decompressing it. But it will be less performant, because decompressed data must be reconstructed in memory, and it is limited by decompression speed (several GB/sec).
Caveats
Filtered data will not be accounted in the progress bar.
The text was updated successfully, but these errors were encountered: