Fix bloom filters for String (data skipping indices) #11638

azat · 2020-06-12T19:47:05Z

Changelog category (leave one):

Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix bloom filters for String (data skipping indices)

Detailed description / Documentation draft:
bloom filter was broken for the first element, if all of the following
conditions satisfied:

they are created on INSERT (in thie case bloom filter hashing uses
offsets, in case of OPTIMIZE it does not, since it already has
granulars).
the record is not the first in the block
the record is the first per index_granularity (do not confuse this
with data skipping index GRANULARITY).
type of the field for indexing is "String" (not FixedString)

Because in this case there was incorrect length and data for that string.

Fixes: #11634
Cc: @filimonov

bloom filter was broken for the first element, if all of the following conditions satisfied: - they are created on INSERT (in thie case bloom filter hashing uses offsets, in case of OPTIMIZE it does not, since it already has granulars). - the record is not the first in the block - the record is the first per index_granularity (do not confuse this with data skipping index GRANULARITY). - type of the field for indexing is "String" (not FixedString) Because in this case there was incorrect length and *data* for that string.

azat · 2020-06-12T20:20:50Z

src/Interpreters/BloomFilterHash.h

                UInt64 city_hash = CityHash_v1_0_2::CityHash64(
-                    reinterpret_cast<const char *>(&data[current_offset]), offsets[index + pos] - current_offset - 1);
+                    reinterpret_cast<const char *>(&data[current_offset]), length);


Speaking about this, there are tons of such places, but there is getDataAt, and asm looks the same (at least after a quick look), maybe complexity does not worth it?

I understand that this may be because getDataAt is a virtual function

Good point! But AFAICS the getDataAt may be marked final, and I guess it is always used after casting to some column type so compiler will optimize this (like here)

Although there can be cases when this will not happen I guess, and in this case explicit writing preferable.

Anyway this is just a thought, not related to this bugfix.

The expected values was incorrect, since for strings we have 1 and 10 and there will be at least two index granulas, hence 12 rows.

azat · 2020-06-12T23:09:04Z

Cc: @zhang2014

src/Interpreters/BloomFilterHash.h

zhang2014 · 2020-06-13T03:34:29Z

LGTM

…g_multi_granulas This better reflects the covered case.

akuzm · 2020-06-15T15:40:42Z

Build failure and perftest error are CI problem.

Fix bloom filters for String (data skipping indices) (cherry picked from commit e460d7c)

blinkov added the pr-bugfix Pull request with bugfix, not backported by default label Jun 12, 2020

azat commented Jun 12, 2020

View reviewed changes

Update max_rows_to_read in 00945_bloom_filter_index test

0b1ff4f

The expected values was incorrect, since for strings we have 1 and 10 and there will be at least two index granulas, hence 12 rows.

azat marked this pull request as draft June 12, 2020 23:17

zhang2014 reviewed Jun 13, 2020

View reviewed changes

src/Interpreters/BloomFilterHash.h Show resolved Hide resolved

Rename 01307_data_skip_bloom_filter to 01307_bloom_filter_index_strin…

901a657

…g_multi_granulas This better reflects the covered case.

azat marked this pull request as ready for review June 13, 2020 11:21

akuzm merged commit e460d7c into ClickHouse:master Jun 15, 2020

akuzm self-assigned this Jun 15, 2020

azat deleted the skip-idx-bloom-filter-fix branch June 15, 2020 17:36

vitlibar pushed a commit that referenced this pull request Jun 19, 2020

Merge pull request #11638 from azat/skip-idx-bloom-filter-fix

6fa332f

Fix bloom filters for String (data skipping indices) (cherry picked from commit e460d7c)

vitlibar added v20.3-backported labels Jun 19, 2020

vitlibar pushed a commit that referenced this pull request Jun 19, 2020

Merge pull request #11638 from azat/skip-idx-bloom-filter-fix

0384493

Fix bloom filters for String (data skipping indices) (cherry picked from commit e460d7c)

vitlibar added the v20.1-backported label Jun 19, 2020

vitlibar pushed a commit that referenced this pull request Jun 20, 2020

Merge pull request #11638 from azat/skip-idx-bloom-filter-fix

869dfce

Fix bloom filters for String (data skipping indices) (cherry picked from commit e460d7c)

qoega added the no-docs-needed label Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bloom filters for String (data skipping indices) #11638

Fix bloom filters for String (data skipping indices) #11638

azat commented Jun 12, 2020

azat Jun 12, 2020

zhang2014 Jun 13, 2020

azat Jun 13, 2020

azat commented Jun 12, 2020

zhang2014 commented Jun 13, 2020

akuzm commented Jun 15, 2020

Fix bloom filters for String (data skipping indices) #11638

Fix bloom filters for String (data skipping indices) #11638

Conversation

azat commented Jun 12, 2020

azat Jun 12, 2020

Choose a reason for hiding this comment

zhang2014 Jun 13, 2020

Choose a reason for hiding this comment

azat Jun 13, 2020

Choose a reason for hiding this comment

azat commented Jun 12, 2020

zhang2014 commented Jun 13, 2020

akuzm commented Jun 15, 2020