Skip to content

Fix flaky test_insert_select_from_cluster_with_partition_pruning#105525

Open
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-test_storage_delta-insert_select_cluster_partition_pruning
Open

Fix flaky test_insert_select_from_cluster_with_partition_pruning#105525
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-test_storage_delta-insert_select_cluster_partition_pruning

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

The cluster INSERT path (distributedWriteIntoReplicatedMergeTreeOrDataLakeFromClusterStorage, added in #101299) sends the SELECT to a remote replica, which writes the part there and replicates via ZooKeeper. The INSERT pipeline returns as soon as the remote query completes, before the local replica on the coordinator has fetched the part. If SELECT runs immediately after, it sees 0 rows.

The first INSERT in the test happened to pass because SYSTEM FLUSH LOGS ON CLUSTER between INSERT and SELECT acted as an accidental barrier. The second INSERT had no synchronization and failed across many unrelated PRs (#100185, #101446, #101757, #102115, #103525, #105249, etc.) on slower configs (arm_binary, ASan, MSan, db disk, llvm_coverage), plus one master hit on 2026-05-12. Reported by @alexey-milovidov on #100752.

Fix: add SYSTEM SYNC REPLICA after each INSERT so the local replica has fetched any new parts before reading.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix flaky test_insert_select_from_cluster_with_partition_pruning.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The cluster INSERT path (distributedWriteIntoReplicatedMergeTreeOrDataLakeFromClusterStorage)
sends the SELECT to a remote replica, which writes the part there and replicates
via ZooKeeper. The INSERT pipeline returns to the client as soon as the remote
query completes, before the LOCAL replica on the coordinator has fetched the part.

If the test runs SELECT immediately after INSERT (timing under ~10ms on slow CI
shards), the part is still PreActive locally and the SELECT returns 0 rows.

The first INSERT in the test happened to pass because the SYSTEM FLUSH LOGS ON
CLUSTER between INSERT and SELECT acted as an accidental barrier. The second
INSERT lacked any synchronization, so it fails about 30% of the time on slower
configs (arm_binary, ASan, MSan, db disk, llvm_coverage).

Add explicit SYSTEM SYNC REPLICA after each INSERT to ensure the local replica
has fetched any new parts before reading.

Related: ClickHouse#100752
Signed-off-by: Groene AI <270696204+groeneai@users.noreply.github.com>
@groeneai
Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

a) Deterministic repro? Partial. The race is timing-dependent (interval between INSERT pipeline return and the start of SELECT count()); on the failing CI report (PR #100752, arm_binary distributed plan, 2026-05-20 18:04:40) the SELECT arrived 1ms after the local replica's ZK commit, vs ~100ms in passing runs. Cannot force-reproduce locally without artificially throttling the local replica's GET_PART task, but the failure mechanism is conclusively visible in the CI server logs (see node1 clickhouse-server.log from the linked report: Source finished: files_read=0 on the local INSERT, Pulling 1 entries to queue: log-0000000000, Fetching part all_0_0_0 ... replicas/2, Committing part all_0_0_0 to zookeeper at 18:20:23.971609, SELECT count() arriving at 18:20:23.975225).

b) Root cause explained? Yes. The cluster INSERT path in distributedWriteIntoReplicatedMergeTreeOrDataLakeFromClusterStorage distributes file tasks across cluster replicas. The matching Delta file is given to ONE replica (in the failing log, replica 1 = node2), which writes the part and replicates via ZooKeeper. The INSERT pipeline returns after the remote query completes (line 989, pipeline.addCompletedPipeline), but the LOCAL replica on the coordinator only sees the new part once its background ReplicatedMergeTreeRestartingThread pulls the queue entry and fetches the part. The SELECT runs on the local replica and reads data_parts_indexes, which doesn't include the part until after Renaming part to all_0_0_0 and ZK commit.

c) Fix matches root cause? Yes. SYSTEM SYNC REPLICA is the standard ClickHouse primitive that pulls the local queue from ZK and waits for all pending entries (GET_PART in this case) to be processed, ensuring the part is locally visible before the SELECT.

d) Test intent preserved? Yes. The fix adds synchronization without weakening any assertions or pinning settings:

  • Both assert result == 5 and assert result == 0 row-count checks still run with full strictness.
  • assert pruned >= 2 (DeltaLakePartitionPrunedFiles) and assert filtered >= 2 (ObjectStoragePredicateFilteredObjects) still verify the pruning code path on the remote replica.
  • The cluster-INSERT-distribution code path under test is unchanged.

e) Demonstrated in both directions? Partial. Without the fix: CIDB shows 30+ failures with assert 0 == 5 across 10+ unrelated PRs and 1 master hit over 60 days; server logs prove the part is committed after the INSERT pipeline returns. With the fix: SYSTEM SYNC REPLICA blocks until the local replica's queue is empty, so by the time the SELECT runs the GET_PART entry has been processed and the part is active. I have not been able to run the integration test locally end-to-end in this session, but the synchronization primitive is well-established.

f) Fix is general? Yes. Both INSERTs in the test exhibit the same race; both are fixed. The "happens to work" first INSERT (with SYSTEM FLUSH LOGS ON CLUSTER acting as an accidental barrier) is also made explicit. Any future test following this pattern will need the same SYSTEM SYNC REPLICA.

Note: the same race is observable for user-written INSERT INTO ReplicatedMergeTree SELECT FROM <cluster_table_function> queries on a multi-replica cluster — followed by a read on the coordinator. That is a real product limitation worth tracking, but a test fix is sufficient here.

Session: cron:clickhouse-ci-task-worker:20260521-124500

@groeneai
Copy link
Copy Markdown
Contributor Author

cc @scanhex12 — could you review this? You wrote this test in #101299 / #101634; the second INSERT block is missing a SYSTEM SYNC REPLICA between INSERT and SELECT, so the local replica can be behind background fetch when the count check runs. The first INSERT block happened to pass because SYSTEM FLUSH LOGS ON CLUSTER acted as an accidental barrier.

@PedroTadim PedroTadim added the can be tested Allows running workflows for external contributors label May 22, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 22, 2026

Workflow [PR], commit [e4d6870]

Summary:

job_name test_name status info comment
Integration tests (amd_asan_ubsan, db disk, old analyzer, 2/6) FAIL
test_mysql_protocol/test.py::test_mysql_dotnet_client FAIL cidb
Integration tests (arm_binary, distributed plan, 3/4) FAIL
test_mysql_protocol/test.py::test_mysql_dotnet_client FAIL cidb
Integration tests (amd_tsan, 2/6) FAIL
test_mysql_protocol/test.py::test_mysql_dotnet_client FAIL cidb
Integration tests (amd_msan, 2/6) FAIL
test_mysql_protocol/test.py::test_mysql_dotnet_client FAIL cidb
Integration tests (amd_llvm_coverage, 3/5) FAIL
test_mysql_protocol/test.py::test_mysql_dotnet_client FAIL cidb

AI Review

Summary

This PR makes test_insert_select_from_cluster_with_partition_pruning deterministic by adding SYSTEM SYNC REPLICA barriers after each cluster-driven INSERT and before the subsequent SELECT assertions. The change matches the documented race condition (remote write + asynchronous local fetch) and preserves the original test intent and coverage checks.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 22, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 22, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.10% -0.10%
Functions 91.40% 91.40% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 458 line(s), 14 function(s) across 133 file(s) · Details

Top files
  • src/Common/QueryFuzzer.cpp: 66 line(s)
  • src/Storages/WindowView/StorageWindowView.cpp: 22 line(s), 1 function(s)
  • src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp: 17 line(s), 1 function(s)
  • src/Storages/StorageReplicatedMergeTree.cpp: 17 line(s)
  • src/AggregateFunctions/AggregateFunctionWindowFunnel.cpp: 11 line(s), 6 function(s)

Full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants