Skip to content

Fix crash caused by stale ZooKeeper session in UDF retry loop#102059

Merged
fm4v merged 3 commits intomasterfrom
fix-udf-stale-zk-retry
Apr 9, 2026
Merged

Fix crash caused by stale ZooKeeper session in UDF retry loop#102059
fm4v merged 3 commits intomasterfrom
fix-udf-stale-zk-retry

Conversation

@fm4v
Copy link
Copy Markdown
Member

@fm4v fm4v commented Apr 8, 2026

Linked issues

The ZooKeeperRetriesControl retry loop introduced in #101891 reused the same expired ZooKeeper session on every retry iteration, without calling renewZooKeeper as every other retry loop in the codebase does. This caused repeated requests on a finalized session, leading to crashes on canary deploys (https://github.com/ClickHouse/clickhouse-core-incidents/issues/1644, https://github.com/ClickHouse/clickhouse-core-incidents/issues/1645).

Fix: renew the ZooKeeper session via zookeeper_getter on each retry iteration (matching the pattern used by backup coordination code), and move getObjectNamesAndSetWatch inside the retry loop so that the object list and watches are re-established on the fresh session.

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix crash in UDF refresh caused by ZooKeeperRetriesControl retrying on a stale (expired) ZooKeeper session without renewing it.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The `ZooKeeperRetriesControl` retry loop introduced in #101891 reused
the same expired ZooKeeper session on every retry iteration, without
calling `renewZooKeeper` as every other retry loop in the codebase does.
This caused repeated requests on a finalized session, leading to crashes
(clickhouse-core-incidents #1644, #1645).

Remove the broken retry loop. The `tryLoadObject` split-catch that
re-throws Keeper hardware errors is sufficient: the exception propagates
to `processWatchQueue`'s catch-all, which resets the session and retries
from scratch with a fresh connection. This preserves the original fix's
goal — `setAllObjects` is never called with a partial set — without the
stale-session misuse.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 8, 2026

Workflow [PR], commit [b20f2d8]

Summary:

job_name test_name status info comment
Stress test (arm_msan) failure
Server died FAIL cidb, issue ISSUE EXISTS
MemorySanitizer: use-of-uninitialized-value (STID: 4179-5154) FAIL cidb, issue ISSUE EXISTS
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure IGNORED
Fast test (arm_darwin) error IGNORED

AI Review

Summary

This PR fixes a real reliability issue in UDF refresh: retries now renew ZooKeeper session via zookeeper_getter and re-fetch object names/watches on the fresh session before reloading objects. The core fix is correct and aligned with existing ZooKeeperRetriesControl usage patterns. I did not find new correctness, safety, concurrency, or performance blockers in the current diff.

Tests

⚠️ No regression test was added for the stale-session retry path in refreshObjects. This gap has already been raised in existing inline review discussion, so I am not posting a duplicate inline comment.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict

Status: ⚠️ Request changes

Minimum required action:

  • Add a focused regression test that reproduces a Keeper session-expiration/hardware-error condition during UDF refresh/retry and verifies recovery on a fresh session without leaving a partial UDF set.

@clickhouse-gh clickhouse-gh Bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Apr 8, 2026
ZooKeeperRetriesInfo{max_retries, initial_backoff_ms, max_backoff_ms, /*query_status=*/nullptr});

retries_ctl.retryLoop([&]
for (const auto & function_name : object_names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a regression test for this path. The original issue was a production exception caused by retrying on an expired Keeper session; this change removes the local retry loop, so we need a test that forces a Keeper hardware error during refreshObjects and verifies recovery happens via processWatchQueue retry (fresh session) without leaving a partial UDF set.

Instead of removing the retry loop entirely, keep it but fix the root
cause: renew the ZooKeeper session via zookeeper_getter on each retry
iteration (matching the pattern used by backup coordination code).
Also move getObjectNamesAndSetWatch inside the loop so that the object
list and watches are re-established on the fresh session.
@alexey-milovidov
Copy link
Copy Markdown
Member

The 02346_text_index_bug101913 test failure is fixed by #102108 (already merged). Please rebase or merge master to pick up the fix.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 9, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.00% -0.10%
Functions 90.90% 90.90% +0.00%
Branches 76.60% 76.50% -0.10%

Changed lines: 95.24% (20/21) | lost baseline coverage: 2 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov
Copy link
Copy Markdown
Member

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

@fm4v fm4v added this pull request to the merge queue Apr 9, 2026
Merged via the queue into master with commit ad69fbe Apr 9, 2026
160 of 164 checks passed
@fm4v fm4v deleted the fix-udf-stale-zk-retry branch April 9, 2026 05:32
@robot-ch-test-poll robot-ch-test-poll added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Apr 9, 2026
robot-clickhouse added a commit that referenced this pull request Apr 9, 2026
robot-clickhouse added a commit that referenced this pull request Apr 9, 2026
@alexey-milovidov
Copy link
Copy Markdown
Member

@fm4v, our CI checks that every bug fix is accompanied by a test. But you ignored it... Maybe there is still a chance to add a test?

{
/// Renew the session on retry — the previous one may have expired.
if (retries_ctl.isRetry())
current_zookeeper = zookeeper_getter.getZooKeeper().first;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places issues a SYNC to keeper in case of new session, probably we also need it here? Since other keeper nodes may not see some objects that was visible from this session

Also, why this and #101891 has been merged w/o review?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I tested it on a single customer replica before merging: https://github.com/ClickHouse/clickhouse-private/pull/54946

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYNC to keeper in case of new session

@fm4v please also check this:

/// Checks that our session was not killed, and allows to avoid applying a request from an old lost session.
/// Imagine a "connection-loss-on-commit" situation like this:
/// - We have written some write requests to the socket and immediately disconnected (e.g. due to "Operation timeout")
/// - The requests were sent, but the destination [Zoo]Keeper host will receive it later (it doesn't know about our requests yet)
/// - We don't know the status of our requests
/// - We connect to another [Zoo]Keeper replica with a new session, and do some reads
/// to find out the status of our requests. We see that they were not committed.
/// - The packets from our old session finally arrive to the destination [Zoo]Keeper host. The requests get processed.
/// - Changes are committed (although, we have already seen that they are not)
///
/// We need a way to reject requests from old sessions somehow.
///
/// So we update the version of /clickhouse/sessions/server_uuid node when starting a new session.
/// And there's an option to check this version when committing something.
void addCheckSessionOp(Coordination::Requests & requests) const;

fm4v added a commit that referenced this pull request Apr 9, 2026
Backport #102059 to 26.3: Fix crash caused by stale ZooKeeper session in UDF retry loop
fm4v added a commit that referenced this pull request Apr 9, 2026
Backport #102059 to 26.2: Fix crash caused by stale ZooKeeper session in UDF retry loop
fm4v added a commit that referenced this pull request Apr 9, 2026
Backport #102059 to 26.1: Fix crash caused by stale ZooKeeper session in UDF retry loop
@fm4v
Copy link
Copy Markdown
Member Author

fm4v commented Apr 9, 2026

@alexey-milovidov I've added a test in the first PR #101891, but it was unreliable due to network fault injection timing. Tested manually on a custom replica of the affected customer instance

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v25.10-must-backport v25.12-must-backport v26.2-must-backport v26.3-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants