Skip to content

Fix flaky test_write_failover in test_merge_tree_s3_failover#96716

Merged
alexey-milovidov merged 1 commit intomasterfrom
fix-flaky-s3-failover-test-put-filter
Feb 12, 2026
Merged

Fix flaky test_write_failover in test_merge_tree_s3_failover#96716
alexey-milovidov merged 1 commit intomasterfrom
fix-flaky-s3-failover-test-put-filter

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov commented Feb 12, 2026

The S3 proxy endpoint's fail counter was consumed by background GET/DELETE requests (e.g. cleanup of partially written parts from failed INSERTs), causing the INSERT to succeed earlier than expected.

Added optional HTTP method filtering to the proxy's fail_request endpoint so test_write_failover counts only PUT requests. Other tests in the same suite are unaffected (they use unfiltered fail_request).

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96705&sha=8a70737430373209132a005dc88f94f9bd2962af&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%202%2F4%29
#96705

Closes #96649

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The S3 proxy endpoint counted all HTTP methods (GET, POST, PUT, DELETE)
toward its fail counter. When a previous INSERT failed partway through
writing part files, ClickHouse would clean up the partially written
files via DELETE requests. These background DELETE/GET requests could
race with the next iteration's `fail_request()` call, consuming the
counter before the INSERT's PUT requests, causing the INSERT to succeed
earlier than expected.

The fix adds optional HTTP method filtering to the proxy's
`fail_request` endpoint (`/fail_request/<N>/<method>`). The
`test_write_failover` test now uses `fail_request(N, "PUT")` to only
count PUT requests, so background cleanup operations no longer
interfere with the counter.

Other tests (`test_move_failover`, `test_retry_loading_parts`,
`test_throttle_retry`) are unaffected as they continue using the
unfiltered `fail_request(N)`.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96705&sha=8a70737430373209132a005dc88f94f9bd2962af&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%202%2F4%29

Closes #96649

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Feb 12, 2026

Workflow [PR], commit [9052190]

Summary:

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Feb 12, 2026
@alexey-milovidov alexey-milovidov merged commit c038184 into master Feb 12, 2026
262 of 264 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-flaky-s3-failover-test-put-filter branch February 12, 2026 07:12
@alexey-milovidov alexey-milovidov self-assigned this Feb 12, 2026
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: test_merge_tree_s3_failover/test.py::test_write_failover[0-15-2]

2 participants