Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary index cache #54101

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Secondary index cache #54101

wants to merge 5 commits into from

Conversation

al13n321
Copy link
Member

@al13n321 al13n321 commented Aug 31, 2023

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added a cache for deserialized secondary index granules. This should make repeated queries that use secondary index faster, if the index for the whole table fits in the cache. Size of the new cache is controlled by server setting secondary_index_cache_size.

Pretty straightforward. A lot of copypasta from UncompressedCache.

Should help a lot with usearch ANN indexes, when they all fit in memory. Making it fast when the index doesn't fit in memory would take more work, maybe even changing usearch's serialization format to have more locality.

@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-performance Pull request with some performance improvements label Aug 31, 2023
@robot-clickhouse-ci-1
Copy link
Contributor

robot-clickhouse-ci-1 commented Aug 31, 2023

This is an automated comment for commit 1dbe1d0 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
SQLTestThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help❌ failure
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure

@UnamedRus
Copy link
Contributor

ClickHouse already have skip index cache, this will be different?

#27961

@al13n321
Copy link
Member Author

Hm, right, maybe we should remove the uncompressed cache for index if this new cache works out.

The idea is to avoid deserializing the index for each query. If the index allows doing the lookup faster than a full scan then deserializing the whole index in O(n) time on each query partially defeats the purpose.

(Alternatively, we could say that index should be able to work without deserialization, directly from a SeekableReadBuffer. But that would take considerable extra work to support, in particular for usearch, at least currently. But maybe we'll do it anyway, to avoid caching the whole index if we only access small parts of it.)

@al13n321
Copy link
Member Author

Now this PR undoes #55683 . Seems impossible to have both: if granules are inserted into cache, they can't be reused.

Copy link
Contributor

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, please see the inline comments.

Do I correctly understand that there are currently no tests (functional and integration) for the change?

src/Interpreters/BloomFilter.h Outdated Show resolved Hide resolved
src/Interpreters/GinFilter.h Outdated Show resolved Hide resolved
src/Storages/MergeTree/MergeTreeIndexHypothesis.h Outdated Show resolved Hide resolved
src/Storages/MergeTree/MergeTreeIndexMinMax.h Outdated Show resolved Hide resolved
granule = index->createIndexGranule();
auto load_func = [&] {
initStreamIfNeeded();
if (stream_mark != mark)
Copy link
Contributor

@jkartseva jkartseva Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if mark is less than stream_mark? It wasn't the case before the change, but now the interface of MergeTreeIndexReader allows it. Should there be an assert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, previous interface MergeTreeIndexReader::seek() allowed moving backwards too, and it should work fine AFAICT. (Idk whether callers currently do that.)

src/Storages/MergeTree/SecondaryIndexCache.h Show resolved Hide resolved
@@ -71,7 +71,11 @@ namespace DB
M(String, index_mark_cache_policy, DEFAULT_INDEX_MARK_CACHE_POLICY, "Secondary index mark cache policy name.", 0) \
M(UInt64, index_mark_cache_size, DEFAULT_INDEX_MARK_CACHE_MAX_SIZE, "Size of cache for secondary index marks. Zero means disabled.", 0) \
M(Double, index_mark_cache_size_ratio, DEFAULT_INDEX_MARK_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index mark cache relative to the cache's total size.", 0) \
M(UInt64, mmap_cache_size, DEFAULT_MMAP_CACHE_MAX_SIZE, "A cache for mmapped files.", 0) \
M(UInt64, mmap_cache_size, DEFAULT_MMAP_CACHE_MAX_SIZE, "Maximum number of files to keep in the mmapped file cache.", 0) \
M(String, secondary_index_cache_policy, DEFAULT_SECONDARY_INDEX_CACHE_POLICY, "Index mark cache policy name.", 0) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a use case for the primary and secondary index caches using different eviction policies? Just curious :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯_(ツ)_/¯ , I just added the same few settings as caches above. IIRC primary index doesn't have a cache, it's just always kept in memory as long as the data part is loaded. For the other caches above - I guess the only way this would come up is if someone notices (e.g. sees it in profiler) that one of the caches isn't working well and tries different settings for just that cache.

@jrdi
Copy link
Contributor

jrdi commented Dec 7, 2023

We already have cache and uncompressed cache for secondary indexes. What does this cache differently?

M(String, index_uncompressed_cache_policy, DEFAULT_INDEX_UNCOMPRESSED_CACHE_POLICY, "Secondary index uncompressed cache policy name.", 0) \
M(UInt64, index_uncompressed_cache_size, DEFAULT_INDEX_UNCOMPRESSED_CACHE_MAX_SIZE, "Size of cache for uncompressed blocks of secondary indices. Zero means disabled.", 0) \
M(Double, index_uncompressed_cache_size_ratio, DEFAULT_INDEX_UNCOMPRESSED_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index uncompressed cache relative to the cache's total size.", 0) \
M(String, index_mark_cache_policy, DEFAULT_INDEX_MARK_CACHE_POLICY, "Secondary index mark cache policy name.", 0) \
M(UInt64, index_mark_cache_size, DEFAULT_INDEX_MARK_CACHE_MAX_SIZE, "Size of cache for secondary index marks. Zero means disabled.", 0) \
M(Double, index_mark_cache_size_ratio, DEFAULT_INDEX_MARK_CACHE_SIZE_RATIO, "The size of the protected queue in the secondary index mark cache relative to the cache's total size.", 0) \

@alexey-milovidov
Copy link
Member

@jrdi, the index is still deserialized, and it can take a lot of time.
The indices can be large (larger than the data) in the case of vector search (HNSW), and it makes sense to just keep the deserialized data structure in memory.

Although this scenario is limited, but vector search is often used on minuscule datasets (under one billion records).

@azat
Copy link
Collaborator

azat commented Jan 26, 2024

FWIW here are some numbers, shortly it gives significant boost for ANN indexes. I've tested on a table with 33'114'778 rows with embeddings (array of 768 elements):

  • without index - 216 sec (4 seconds with data in page cache)
  • annoy before - 100sec (it is just too suboptimal even in memory)
  • usearch before - 33sec
  • usearch with this patch - 0.064sec

@azat
Copy link
Collaborator

azat commented Jan 26, 2024

But funny thing that it actually may slow down regular indexes.

Input
CREATE TABLE data
(
    `part` Int32,
    `key` Int32,
    `v1` Int32,
    `s` SimpleAggregateFunction(sum, Int64),
    INDEX v1_index v1 TYPE minmax GRANULARITY 1
)
ENGINE = AggregatingMergeTree
PARTITION BY part
ORDER BY key
SETTINGS index_granularity = 1;

insert into data (part, key, v1) select number%100, number, number from numbers(1e6);

Before:

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 0;
-- 16 rows in set. Elapsed: 0.396 sec. 

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 1;
-- 16 rows in set. Elapsed: 0.756 sec. 

After:

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 0;
-- 16 rows in set. Elapsed: 0.307 sec. 

EXPLAIN PIPELINE
SELECT *
FROM data
FINAL
WHERE v1 > 1
SETTINGS max_threads = 2, max_final_threads = 2, force_data_skipping_indices = 'v1_index', use_skip_indexes_if_final = 1;
-- 16 rows in set. Elapsed: 1.496 sec. 

So now it is 2x slower, perf points to SLRUCachePolicy

perf

Perf use_skip_indexes_if_final=0

Samples: 5K of event 'cycles:P', 4000 Hz, Event count (approx.): 5139605993 lost: 0/0 drop: 0/0
Overhead  Shared Object     Symbol
  15.11%  clickhouse        [.] DB::MergeTreeIndexGranuleMinMax::deserializeBinary(DB::ReadBuffer&, unsigned char)
   9.47%  clickhouse        [.] DB::buildPipesForReadingByPKRanges(DB::KeyDescription const&, std::__1::shared_ptr<DB::ExpressionActions>, DB::RangesInDataParts, unsigned long, std::__1::shared_ptr<DB::Context const>, std::__1::function<DB::Pipe (DB::RangesInDataParts)>&&)
   4.29%  clickhouse        [.] DB::Field::~Field()

Perf use_skip_indexes_if_final=1 before

Samples: 2K of event 'cycles:P', 4000 Hz, Event count (approx.): 2420778913 lost: 0/0 drop: 0/523
Overhead  Shared Object     Symbol
  26.88%  clickhouse        [.] DB::buildPipesForReadingByPKRanges(DB::KeyDescription const&, std::__1::shared_ptr<DB::ExpressionActions>, DB::RangesInDataParts, unsigned long, std::__1::shared_ptr<DB::Context const>, std::__1::function<DB::Pipe (DB::RangesInDataParts)>&&)
   7.50%  clickhouse        [.] auto DB::Field::dispatch<DB::applyVisitor<DB::FieldVisitorAccurateLess, DB::Field const&, DB::Field const&>(DB::FieldVisitorAccurateLess&&, DB::Field const&, DB::Field const&)::{lambda(auto:1&)#1}::operator()<long const>(DB::FieldVisitorAccurateLess&) const::{lam
   7.04%  clickhouse        [.] auto DB::Field::dispatch<DB::applyVisitor<DB::FieldVisitorAccurateLess, DB::Field const&, DB::Field const&>(DB::FieldVisitorAccurateLess&&, DB::Field const&, DB::Field const&)::{lambda(auto:1&)#1}, DB::Field const&>(DB::FieldVisitorAccurateLess&&, DB::Field const
   6.38%  clickhouse        [.] void std::__1::__sift_up[abi:v15000]<std::__1::_ClassicAlgPolicy, std::__1::less<(anonymous namespace)::split(DB::RangesInDataParts, unsigned long)::PartsRangesIterator>&, std::__1::__wrap_iter<(anonymous namespace)::split(DB::RangesInDataParts, unsigned long)::P
   4.56%  clickhouse        [.] malloc

Perf use_skip_indexes_if_final=1 after

Samples: 30K of event 'cycles', 4000 Hz, Event count (approx.): 12978901988 lost: 0/0 drop: 27/36                                                                                             
Overhead  Shared Object     Symbol                                                                                                                                                            
  17.11%  clickhouse        [.] DB::SLRUCachePolicy<wide::integer<128ul, unsigned int>, DB::SecondaryIndexCacheCell, UInt128TrivialHash, DB::SecondaryIndexCacheWeightFunction>::get          
  11.48%  clickhouse        [.] DB::MergeTreeIndexReader::read                                                                                                                                
   5.86%  clickhouse        [.] DB::buildPipesForReadingByPKRanges                                                                                                                            

@KrisAnTis-Group
Copy link

we really need this 515x speedup in vector search 🥹

@mcproger
Copy link

Hey @azat, awesome numbers!

Could you share a little bit more details on how you did this benchmark? I'm trying to setup a table using ANN indexes (already tried both annoy and usearch) but have problems – the index doesn't speed up my select queries. Even more – without index queries perform better (however still not very good, that is why I'm looking for a optimizations way). Also a strange thing here is that the index size is almost equal to the embedding column size (even with compression).

My table has a relatively simple structure and I already tried different parameters for index, including granularity and compression. Queries execution time for this table with ~3kk rows may take around 3-5 seconds which is too slow for such data volume as far as I know. Embeddings dimension that I use is 1536 (also tried to reduce it but haven't noticed much of a difference).

CREATE TABLE default.entity_embeddings_table
(
    `date` Date,
    `__account_id` String,
    `__row_hash` UInt64 CODEC(NONE),
    `__row_id` UInt16,
    `__key_hash` UInt64 CODEC(NONE),
    `__sign` Int8,
    `__insert_date` DateTime,
    `__workspace_id` Int32,
    `__sql_view_name` LowCardinality(String),
    `collection_name` LowCardinality(String),
    `column_name` String,
    `table_name` LowCardinality(String),
    `entity_metadata` String,
    `entity_embedding` Array(Float32) CODEC(ZSTD(3)),
    `entity_embeddable_string` String,
    `destination_uuid` String,
    `schema_updated_at` DateTime64(3),
    `entity_id` String,
    INDEX ix_entity_embedding_usearch entity_embedding TYPE usearch('cosineDistance')
)
ENGINE = CollapsingMergeTree(__sign)
PARTITION BY table_name
ORDER BY (collection_name, destination_uuid, table_name, column_name, date, __account_id, __row_hash, __row_id)
SETTINGS index_granularity = 8192

Could you please share, if possible, CH config and table structure for you benchmark so I can compare it on my side? Probably I'm doing something wrong with CH settings or data structuring but I'm out of ideas at the moment.

Thank you!

@azat
Copy link
Collaborator

azat commented Feb 11, 2024

@mcproger are you using binaries from this PR?

@mcproger
Copy link

@azat no, not yet. I was jus looking for ways to optimize my setup on v23.8.7.24 and found this discussion.

I wonder either I'm doing something wrong with current setup or this fix with cache will just fix queries perfomance and there is no sense in trying to optimize perfomance at the moment.

@azat
Copy link
Collaborator

azat commented Feb 11, 2024

this fix with cache will just fix queries perfomance

Yes, you need this PR to have this improvement.
But it is better to wait until it will be merged (though if you have some dev stand you can try it at your own risk)

@al13n321
Copy link
Member Author

But funny thing that it actually may slow down regular indexes.

Hm, yeah, plausible that 1M cache lookups add 800ms, they do ~2 unordered_map lookups under a global mutex. Not sure what to do about it, other than just say it's worth it. I at least tried replacing unordered_map with absl::flat_hash_map, but it had little effect.

@al13n321
Copy link
Member Author

Do I correctly understand that there are currently no tests (functional and integration) for the change?

Yes. Added one now.

Should be ready for review again now.

@amosbird
Copy link
Collaborator

Now this PR undoes #55683 . Seems impossible to have both: if granules are inserted into cache, they can't be reused.

Hmm, how do we deal with this perf regression then? https://s3.amazonaws.com/clickhouse-test-reports/54101/1dbe1d0f24ffce7329f333f48118c74be6b2fcc8/performance_comparison_%5B1_4%5D/report.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-performance Pull request with some performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants