Add selective replication for ReplicatedMergeTree#102244
Add selective replication for ReplicatedMergeTree#102244zoomxi wants to merge 1 commit intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [eb3f0d9] Summary: ❌
AI ReviewSummaryThis PR adds selective replication for Findings❌ Blockers
Tests
ClickHouse Rules
Final Verdict
|
e6432f0 to
76ef782
Compare
481fe9a to
989b2dd
Compare
989b2dd to
d140478
Compare
d140478 to
24aaa89
Compare
24aaa89 to
f7c9889
Compare
f7c9889 to
07edaf9
Compare
07edaf9 to
eb3f0d9
Compare
|
|
||
| String shared_log_path = fs::path(storage.getZooKeeperPath()) / "log/log-"; | ||
|
|
||
| for (const auto & part_name : meta.source_parts_snapshot) |
There was a problem hiding this comment.
startClone defines SelectiveReplication::GET_PART_BATCH_SIZE, but still issues one zk->create RPC per part in a tight loop. For large partitions this can generate thousands of sequential Keeper writes and significantly delay migration or trigger Keeper timeouts under load.
Please batch log entry creation in chunks (for example, GET_PART_BATCH_SIZE) using tryMulti and retry partial batch failures. That keeps migration throughput predictable and reduces Keeper pressure.
LLVM Coverage Report
Changed lines: 74.48% (1643/2206) | lost baseline coverage: 58 line(s) · Uncovered code |
eb3f0d9 to
dcdfb2a
Compare
ad3885f to
914956b
Compare
1. Keeper Data StructureUnder each table's ZK path ( Key points:
|
2. Read PathWhen Depth guard: Sub-queries arrive with Cache consistency: The 60-second TTL means a recently-migrated partition may route to the old replica until the cache expires. This is an acceptable trade-off for avoiding per-query ZK lookups. |
3. Write PathINSERT queries are forwarded to assigned replicas with CAS-protected assignment verification: Queue filtering: |
4. Rebalance & MigrationMigration State Machine |
|
This PR is labeled Could someone help approve this? Thanks! CC @azat |
914956b to
8ac59c8
Compare
8ac59c8 to
f070806
Compare
By default, every replica in a
ReplicatedMergeTreeshard stores a full copy of all data. This PR introduces selective replication: a newreplication_factortable setting that controls how many replicas store each partition's data.This reduces storage costs and write amplification while maintaining read availability through automatic partition-to-replica assignment, query routing, and background rebalancing.This implementation is inspired by #58132 .
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add selective replication for ReplicatedMergeTree, allowing each partition to be stored on only replication_factor replicas instead of all replicas in a shard. Partitions are automatically rebalanced in the background. Closes #45766.
Documentation entry for user-facing changes