Secondary index cache #54101

al13n321 · 2023-08-31T03:29:45Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added a cache for deserialized secondary index granules. This should make repeated queries that use secondary index faster, if the index for the whole table fits in the cache. Size of the new cache is controlled by server setting secondary_index_cache_size.

Pretty straightforward. A lot of copypasta from UncompressedCache.

Should help a lot with usearch ANN indexes, when they all fit in memory. Making it fast when the index doesn't fit in memory would take more work, maybe even changing usearch's serialization format to have more locality.

robot-clickhouse-ci-1 · 2023-08-31T03:36:02Z

This is an automated comment for commit 1dbe1d0 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
SQLTest	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
SQLancer	Fuzzing tests that detect logical bugs with SQLancer tool	✅ success
Sqllogic	Run clickhouse on the sqllogic test set against sqlite and checks that all statements are passed	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

Check name	Description	Status
A Sync	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	❌ failure
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure
Mergeable Check	Checks if all other necessary checks are successful	❌ failure
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure

UnamedRus · 2023-08-31T08:36:09Z

ClickHouse already have skip index cache, this will be different?

#27961

al13n321 · 2023-08-31T19:20:17Z

Hm, right, maybe we should remove the uncompressed cache for index if this new cache works out.

The idea is to avoid deserializing the index for each query. If the index allows doing the lookup faster than a full scan then deserializing the whole index in O(n) time on each query partially defeats the purpose.

(Alternatively, we could say that index should be able to work without deserialization, directly from a SeekableReadBuffer. But that would take considerable extra work to support, in particular for usearch, at least currently. But maybe we'll do it anyway, to avoid caching the whole index if we only access small parts of it.)

al13n321 · 2023-11-17T23:39:52Z

Now this PR undoes #55683 . Seems impossible to have both: if granules are inserted into cache, they can't be reused.

jkartseva

Overall, LGTM, please see the inline comments.

Do I correctly understand that there are currently no tests (functional and integration) for the change?

src/Interpreters/BloomFilter.h

src/Interpreters/GinFilter.h

src/Storages/MergeTree/MergeTreeIndexHypothesis.h

src/Storages/MergeTree/MergeTreeIndexMinMax.h

jkartseva · 2023-12-06T04:36:22Z

src/Storages/MergeTree/MergeTreeIndexReader.cpp

-        granule = index->createIndexGranule();
+    auto load_func = [&] {
+        initStreamIfNeeded();
+        if (stream_mark != mark)


What happens if mark is less than stream_mark? It wasn't the case before the change, but now the interface of MergeTreeIndexReader allows it. Should there be an assert?

Not sure what you mean, previous interface MergeTreeIndexReader::seek() allowed moving backwards too, and it should work fine AFAICT. (Idk whether callers currently do that.)

src/Storages/MergeTree/SecondaryIndexCache.h

jkartseva · 2023-12-06T04:47:56Z

src/Core/ServerSettings.h

@@ -71,7 +71,11 @@ namespace DB
    M(String, index_mark_cache_policy, DEFAULT_INDEX_MARK_CACHE_POLICY, "Secondary index mark cache policy name.", 0) \
    M(UInt64, index_mark_cache_size, DEFAULT_INDEX_MARK_CACHE_MAX_SIZE, "Size of cache for secondary index marks. Zero means disabled.", 0) \
    M(Double, index_mark_cache_size_ratio, DEFAULT_INDEX_MARK_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index mark cache relative to the cache's total size.", 0) \
-    M(UInt64, mmap_cache_size, DEFAULT_MMAP_CACHE_MAX_SIZE, "A cache for mmapped files.", 0) \
+    M(UInt64, mmap_cache_size, DEFAULT_MMAP_CACHE_MAX_SIZE, "Maximum number of files to keep in the mmapped file cache.", 0) \
+    M(String, secondary_index_cache_policy, DEFAULT_SECONDARY_INDEX_CACHE_POLICY, "Index mark cache policy name.", 0) \


Can there be a use case for the primary and secondary index caches using different eviction policies? Just curious :)

¯_(ツ)_/¯ , I just added the same few settings as caches above. IIRC primary index doesn't have a cache, it's just always kept in memory as long as the data part is loaded. For the other caches above - I guess the only way this would come up is if someone notices (e.g. sees it in profiler) that one of the caches isn't working well and tries different settings for just that cache.

jrdi · 2023-12-07T12:31:46Z

We already have cache and uncompressed cache for secondary indexes. What does this cache differently?

ClickHouse/src/Core/ServerSettings.h

Lines 68 to 73 in add2435

    
           M(String, index_uncompressed_cache_policy, DEFAULT_INDEX_UNCOMPRESSED_CACHE_POLICY, "Secondary index uncompressed cache policy name.", 0) \ 
        
           M(UInt64, index_uncompressed_cache_size, DEFAULT_INDEX_UNCOMPRESSED_CACHE_MAX_SIZE, "Size of cache for uncompressed blocks of secondary indices. Zero means disabled.", 0) \ 
        
           M(Double, index_uncompressed_cache_size_ratio, DEFAULT_INDEX_UNCOMPRESSED_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index uncompressed cache relative to the cache's total size.", 0) \ 
        
           M(String, index_mark_cache_policy, DEFAULT_INDEX_MARK_CACHE_POLICY, "Secondary index mark cache policy name.", 0) \ 
        
           M(UInt64, index_mark_cache_size, DEFAULT_INDEX_MARK_CACHE_MAX_SIZE, "Size of cache for secondary index marks. Zero means disabled.", 0) \ 
        
           M(Double, index_mark_cache_size_ratio, DEFAULT_INDEX_MARK_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index mark cache relative to the cache's total size.", 0) \

alexey-milovidov · 2023-12-27T14:57:26Z

@jrdi, the index is still deserialized, and it can take a lot of time.
The indices can be large (larger than the data) in the case of vector search (HNSW), and it makes sense to just keep the deserialized data structure in memory.

Although this scenario is limited, but vector search is often used on minuscule datasets (under one billion records).

azat · 2024-01-26T17:38:21Z

FWIW here are some numbers, shortly it gives significant boost for ANN indexes. I've tested on a table with 33'114'778 rows with embeddings (array of 768 elements):

without index - 216 sec (4 seconds with data in page cache)
annoy before - 100sec (it is just too suboptimal even in memory)
usearch before - 33sec
usearch with this patch - 0.064sec

azat · 2024-01-26T17:49:11Z

But funny thing that it actually may slow down regular indexes.

Input

CREATE TABLE data
(
    `part` Int32,
    `key` Int32,
    `v1` Int32,
    `s` SimpleAggregateFunction(sum, Int64),
    INDEX v1_index v1 TYPE minmax GRANULARITY 1
)
ENGINE = AggregatingMergeTree
PARTITION BY part
ORDER BY key
SETTINGS index_granularity = 1;

insert into data (part, key, v1) select number%100, number, number from numbers(1e6);

Before:

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 0;
-- 16 rows in set. Elapsed: 0.396 sec. 

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 1;
-- 16 rows in set. Elapsed: 0.756 sec.

After:

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 0;
-- 16 rows in set. Elapsed: 0.307 sec. 

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 1;
-- 16 rows in set. Elapsed: 1.496 sec.

So now it is 2x slower, perf points to SLRUCachePolicy

perf

Perf use_skip_indexes_if_final=0

Samples: 5K of event 'cycles:P', 4000 Hz, Event count (approx.): 5139605993 lost: 0/0 drop: 0/0
Overhead  Shared Object     Symbol
  15.11%  clickhouse        [.] DB::MergeTreeIndexGranuleMinMax::deserializeBinary(DB::ReadBuffer&, unsigned char)
   9.47%  clickhouse        [.] DB::buildPipesForReadingByPKRanges(DB::KeyDescription const&, std::__1::shared_ptr<DB::ExpressionActions>, DB::RangesInDataParts, unsigned long, std::__1::shared_ptr<DB::Context const>, std::__1::function<DB::Pipe (DB::RangesInDataParts)>&&)
   4.29%  clickhouse        [.] DB::Field::~Field()

Perf use_skip_indexes_if_final=1 before

Samples: 2K of event 'cycles:P', 4000 Hz, Event count (approx.): 2420778913 lost: 0/0 drop: 0/523
Overhead  Shared Object     Symbol
  26.88%  clickhouse        [.] DB::buildPipesForReadingByPKRanges(DB::KeyDescription const&, std::__1::shared_ptr<DB::ExpressionActions>, DB::RangesInDataParts, unsigned long, std::__1::shared_ptr<DB::Context const>, std::__1::function<DB::Pipe (DB::RangesInDataParts)>&&)
   7.50%  clickhouse        [.] auto DB::Field::dispatch<DB::applyVisitor<DB::FieldVisitorAccurateLess, DB::Field const&, DB::Field const&>(DB::FieldVisitorAccurateLess&&, DB::Field const&, DB::Field const&)::{lambda(auto:1&)#1}::operator()<long const>(DB::FieldVisitorAccurateLess&) const::{lam
   7.04%  clickhouse        [.] auto DB::Field::dispatch<DB::applyVisitor<DB::FieldVisitorAccurateLess, DB::Field const&, DB::Field const&>(DB::FieldVisitorAccurateLess&&, DB::Field const&, DB::Field const&)::{lambda(auto:1&)#1}, DB::Field const&>(DB::FieldVisitorAccurateLess&&, DB::Field const
   6.38%  clickhouse        [.] void std::__1::__sift_up[abi:v15000]<std::__1::_ClassicAlgPolicy, std::__1::less<(anonymous namespace)::split(DB::RangesInDataParts, unsigned long)::PartsRangesIterator>&, std::__1::__wrap_iter<(anonymous namespace)::split(DB::RangesInDataParts, unsigned long)::P
   4.56%  clickhouse        [.] malloc

Perf use_skip_indexes_if_final=1 after

Samples: 30K of event 'cycles', 4000 Hz, Event count (approx.): 12978901988 lost: 0/0 drop: 27/36                                                                                             
Overhead  Shared Object     Symbol                                                                                                                                                            
  17.11%  clickhouse        [.] DB::SLRUCachePolicy<wide::integer<128ul, unsigned int>, DB::SecondaryIndexCacheCell, UInt128TrivialHash, DB::SecondaryIndexCacheWeightFunction>::get          
  11.48%  clickhouse        [.] DB::MergeTreeIndexReader::read                                                                                                                                
   5.86%  clickhouse        [.] DB::buildPipesForReadingByPKRanges

KrisAnTis-Group · 2024-02-07T21:22:12Z

we really need this 515x speedup in vector search 🥹

mcproger · 2024-02-11T08:43:49Z

Hey @azat, awesome numbers!

Could you share a little bit more details on how you did this benchmark? I'm trying to setup a table using ANN indexes (already tried both annoy and usearch) but have problems – the index doesn't speed up my select queries. Even more – without index queries perform better (however still not very good, that is why I'm looking for a optimizations way). Also a strange thing here is that the index size is almost equal to the embedding column size (even with compression).

My table has a relatively simple structure and I already tried different parameters for index, including granularity and compression. Queries execution time for this table with ~3kk rows may take around 3-5 seconds which is too slow for such data volume as far as I know. Embeddings dimension that I use is 1536 (also tried to reduce it but haven't noticed much of a difference).

CREATE TABLE default.entity_embeddings_table
(
    `date` Date,
    `__account_id` String,
    `__row_hash` UInt64 CODEC(NONE),
    `__row_id` UInt16,
    `__key_hash` UInt64 CODEC(NONE),
    `__sign` Int8,
    `__insert_date` DateTime,
    `__workspace_id` Int32,
    `__sql_view_name` LowCardinality(String),
    `collection_name` LowCardinality(String),
    `column_name` String,
    `table_name` LowCardinality(String),
    `entity_metadata` String,
    `entity_embedding` Array(Float32) CODEC(ZSTD(3)),
    `entity_embeddable_string` String,
    `destination_uuid` String,
    `schema_updated_at` DateTime64(3),
    `entity_id` String,
    INDEX ix_entity_embedding_usearch entity_embedding TYPE usearch('cosineDistance')
)
ENGINE = CollapsingMergeTree(__sign)
PARTITION BY table_name
ORDER BY (collection_name, destination_uuid, table_name, column_name, date, __account_id, __row_hash, __row_id)
SETTINGS index_granularity = 8192

Could you please share, if possible, CH config and table structure for you benchmark so I can compare it on my side? Probably I'm doing something wrong with CH settings or data structuring but I'm out of ideas at the moment.

Thank you!

azat · 2024-02-11T09:18:22Z

@mcproger are you using binaries from this PR?

mcproger · 2024-02-11T09:30:04Z

@azat no, not yet. I was jus looking for ways to optimize my setup on v23.8.7.24 and found this discussion.

I wonder either I'm doing something wrong with current setup or this fix with cache will just fix queries perfomance and there is no sense in trying to optimize perfomance at the moment.

azat · 2024-02-11T09:46:32Z

this fix with cache will just fix queries perfomance

Yes, you need this PR to have this improvement.
But it is better to wait until it will be merged (though if you have some dev stand you can try it at your own risk)

al13n321 · 2024-02-21T18:49:10Z

But funny thing that it actually may slow down regular indexes.

Hm, yeah, plausible that 1M cache lookups add 800ms, they do ~2 unordered_map lookups under a global mutex. Not sure what to do about it, other than just say it's worth it. I at least tried replacing unordered_map with absl::flat_hash_map, but it had little effect.

…the slow EXPLAIN query from the comment 28% faster

al13n321 · 2024-02-22T11:45:52Z

Do I correctly understand that there are currently no tests (functional and integration) for the change?

Yes. Added one now.

Should be ready for review again now.

amosbird · 2024-03-20T02:24:54Z

Now this PR undoes #55683 . Seems impossible to have both: if granules are inserted into cache, they can't be reused.

Hmm, how do we deal with this perf regression then? https://s3.amazonaws.com/clickhouse-test-reports/54101/1dbe1d0f24ffce7329f333f48118c74be6b2fcc8/performance_comparison_%5B1_4%5D/report.html

robot-clickhouse-ci-1 added the pr-performance Pull request with some performance improvements label Aug 31, 2023

al13n321 force-pushed the sic branch from 75b4f41 to 6dc782c Compare August 31, 2023 04:23

al13n321 force-pushed the sic branch from 6dc782c to 9e92840 Compare September 1, 2023 00:21

al13n321 marked this pull request as ready for review September 1, 2023 00:46

al13n321 force-pushed the sic branch 2 times, most recently from 3134711 to 29d593b Compare September 5, 2023 19:57

al13n321 mentioned this pull request Sep 9, 2023

Make usearch index use mmap for local files #54471

Open

al13n321 force-pushed the sic branch from 29d593b to c93a58b Compare September 22, 2023 18:50

al13n321 force-pushed the sic branch from c93a58b to 8909eaf Compare November 17, 2023 23:39

jkartseva self-assigned this Dec 6, 2023

al13n321 force-pushed the sic branch from 8909eaf to f19b529 Compare December 6, 2023 00:29

jkartseva assigned jkartseva and unassigned jkartseva Dec 6, 2023

jkartseva reviewed Dec 6, 2023

View reviewed changes

al13n321 force-pushed the sic branch from f19b529 to c4193a5 Compare February 21, 2024 15:25

al13n321 force-pushed the sic branch from b95e850 to bfe1a61 Compare February 22, 2024 07:46

al13n321 added 5 commits February 22, 2024 11:33

Secondary index cache

d2c326e

Conflicts

9996f11

Review comments

cd596f1

Replace unordered_map with absl::flat_hash_map in caches, this makes …

7f6133c

…the slow EXPLAIN query from the comment 28% faster

Test

1dbe1d0

al13n321 force-pushed the sic branch from bfe1a61 to 1dbe1d0 Compare February 22, 2024 11:33

alexey-milovidov unassigned jkartseva May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Secondary index cache #54101

Secondary index cache #54101

al13n321 commented Aug 31, 2023 •

edited

robot-clickhouse-ci-1 commented Aug 31, 2023 •

edited by robot-clickhouse-ci-2

UnamedRus commented Aug 31, 2023

al13n321 commented Aug 31, 2023

al13n321 commented Nov 17, 2023

jkartseva left a comment

jkartseva Dec 6, 2023 •

edited

al13n321 Feb 21, 2024

jkartseva Dec 6, 2023

al13n321 Feb 21, 2024

jrdi commented Dec 7, 2023

alexey-milovidov commented Dec 27, 2023

azat commented Jan 26, 2024 •

edited

azat commented Jan 26, 2024

KrisAnTis-Group commented Feb 7, 2024

mcproger commented Feb 11, 2024

azat commented Feb 11, 2024

mcproger commented Feb 11, 2024

azat commented Feb 11, 2024

al13n321 commented Feb 21, 2024

al13n321 commented Feb 22, 2024

amosbird commented Mar 20, 2024

Secondary index cache #54101

Are you sure you want to change the base?

Secondary index cache #54101

Conversation

al13n321 commented Aug 31, 2023 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

robot-clickhouse-ci-1 commented Aug 31, 2023 • edited by robot-clickhouse-ci-2

UnamedRus commented Aug 31, 2023

al13n321 commented Aug 31, 2023

al13n321 commented Nov 17, 2023

jkartseva left a comment

Choose a reason for hiding this comment

jkartseva Dec 6, 2023 • edited

Choose a reason for hiding this comment

al13n321 Feb 21, 2024

Choose a reason for hiding this comment

jkartseva Dec 6, 2023

Choose a reason for hiding this comment

al13n321 Feb 21, 2024

Choose a reason for hiding this comment

jrdi commented Dec 7, 2023

alexey-milovidov commented Dec 27, 2023

azat commented Jan 26, 2024 • edited

azat commented Jan 26, 2024

KrisAnTis-Group commented Feb 7, 2024

mcproger commented Feb 11, 2024

azat commented Feb 11, 2024

mcproger commented Feb 11, 2024

azat commented Feb 11, 2024

al13n321 commented Feb 21, 2024

al13n321 commented Feb 22, 2024

amosbird commented Mar 20, 2024

al13n321 commented Aug 31, 2023 •

edited

robot-clickhouse-ci-1 commented Aug 31, 2023 •

edited by robot-clickhouse-ci-2

jkartseva Dec 6, 2023 •

edited

azat commented Jan 26, 2024 •

edited