Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom filter indices #4499

Merged
merged 59 commits into from
Mar 20, 2019

Conversation

nikvas0
Copy link
Contributor

@nikvas0 nikvas0 commented Feb 25, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Category (leave one):

  • New Feature

Short description (up to few sentences):
A new type of data skipping indices based on bloom filters (can be used for equal, in and like functions).

  • ngrambf(n, bloom_filter_size_in_bytes, hash_functions_count, random_seed)
  • tokenbf(bloom_filter_size_in_bytes, hash_functions_count, random_seed) (saves all tokens, so query WHERE s LIKE '%cats%' will use fullscan, but query WHERE s LIKE %.cats.% will use 'cat' token)

@nikvas0 nikvas0 changed the title Bloom filter indices [WIP] [WIP] Bloom filter indices Feb 25, 2019
@nikvas0
Copy link
Contributor Author

nikvas0 commented Mar 6, 2019

Some comparison (ngrambf(3, 512, 2, 0), tokenbf(512, 2, 0) and no index)
https://gist.github.com/nikvas0/a6d66e833c0adaa42762fdf77a026e2b


/// Builds reverse polish notation
template <typename RPNElement>
class RPNBuilder
Copy link
Contributor Author

@nikvas0 nikvas0 Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be also used in KeyCondition (RPNBuilder is simply a copy-paste of some of its functions), but it fails performance tests. https://clickhouse-test-reports.s3.yandex.net/4499/fcb82ba901651b73229a0be1bbd71fba308a4d57/performance_test.html
(For BloomFilterIndex I have not noticed any difference in performance between copy-pasted functions and RPNBuilder)

@nikvas0
Copy link
Contributor Author

nikvas0 commented Mar 10, 2019

Comparision between no index and ngram index (50 runs for each)
https://github.com/nikvas0/CHDataSkippingTest/blob/master/indices_test.ipynb

index1
index2
index3

@nikvas0 nikvas0 changed the title Bloom filter indices [wip] Bloom filter indices Mar 10, 2019
@nikvas0 nikvas0 changed the title [wip] Bloom filter indices Bloom filter indices Mar 12, 2019
@nikvas0
Copy link
Contributor Author

nikvas0 commented Mar 14, 2019

Comparison for insert

# insert into ... select * from datasets.hits_v1

no index: 0 rows in set. Elapsed: 20.463 sec. Processed 8.87 million rows, 8.46 GB (433.65 thousand rows/s., 413.45 MB/s.)

3 x ngrambf(3, 512, 2, 0) (URLDomain, SearchPhrase, Title): 0 rows in set. Elapsed: 42.752 sec. Processed 8.87 million rows, 8.46 GB (207.57 thousand rows/s., 197.90 MB/s.)
3 x ngrambf(4, 512, 1, 0) (URLDomain, SearchPhrase, Title): 0 rows in set. Elapsed: 39.391 sec. Processed 8.87 million rows, 8.46 GB (225.28 thousand rows/s., 214.78 MB/s.)

1 x ngrambf(4, 512, 1, 0) (Title): 0 rows in set. Elapsed: 35.820 sec. Processed 8.87 million rows, 8.46 GB (247.74 thousand rows/s., 236.19 MB/s.)
1 x ngrambf(4, 512, 1, 0) (URLDomain): 0 rows in set. Elapsed: 25.047 sec. Processed 8.87 million rows, 8.46 GB (354.29 thousand rows/s., 337.79 MB/s.)

1 x tokenbf(512, 1, 0) (Title): 0 rows in set. Elapsed: 27.339 sec. Processed 8.87 million rows, 8.46 GB (324.59 thousand rows/s., 309.47 MB/s.)
1 x tokenbf(512, 1, 0) (URLDomain): 0 rows in set. Elapsed: 22.537 sec. Processed 8.87 million rows, 8.46 GB (393.74 thousand rows/s., 375.40 MB/s.)

https://gist.github.com/nikvas0/c60ecb9c37d4a61b4cd924090ff6a806

@alexey-milovidov alexey-milovidov merged commit 2b33e9b into ClickHouse:master Mar 20, 2019
@abyss7 abyss7 added the pr-feature Pull request with new product feature label Apr 10, 2019
@filimonov filimonov added the comp-skipidx Data skipping indices label May 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-skipidx Data skipping indices pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants