Fix crash caused by stale ZooKeeper session in UDF retry loop by fm4v · Pull Request #102059 · ClickHouse/ClickHouse

fm4v · 2026-04-08T14:06:33Z

Linked issues

fixes https://github.com/ClickHouse/clickhouse-core-incidents/issues/1645

The ZooKeeperRetriesControl retry loop introduced in #101891 reused the same expired ZooKeeper session on every retry iteration, without calling renewZooKeeper as every other retry loop in the codebase does. This caused repeated requests on a finalized session, leading to crashes on canary deploys (https://github.com/ClickHouse/clickhouse-core-incidents/issues/1644, https://github.com/ClickHouse/clickhouse-core-incidents/issues/1645).

Fix: renew the ZooKeeper session via zookeeper_getter on each retry iteration (matching the pattern used by backup coordination code), and move getObjectNamesAndSetWatch inside the retry loop so that the object list and watches are re-established on the fresh session.

Changelog category (leave one):

Critical Bug Fix (crash, data loss, RBAC)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix crash in UDF refresh caused by ZooKeeperRetriesControl retrying on a stale (expired) ZooKeeper session without renewing it.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

The `ZooKeeperRetriesControl` retry loop introduced in #101891 reused the same expired ZooKeeper session on every retry iteration, without calling `renewZooKeeper` as every other retry loop in the codebase does. This caused repeated requests on a finalized session, leading to crashes (clickhouse-core-incidents #1644, #1645). Remove the broken retry loop. The `tryLoadObject` split-catch that re-throws Keeper hardware errors is sufficient: the exception propagates to `processWatchQueue`'s catch-all, which resets the session and retries from scratch with a fresh connection. This preserves the original fix's goal — `setAllObjects` is never called with a partial set — without the stale-session misuse.

clickhouse-gh · 2026-04-08T14:07:19Z

Workflow [PR], commit [b20f2d8]

Summary: ❌

job_name	test_name	status	info	comment
Stress test (arm_msan)		failure
	Server died	FAIL	cidb, issue	ISSUE EXISTS
	MemorySanitizer: use-of-uninitialized-value (STID: 4179-5154)	FAIL	cidb, issue	ISSUE EXISTS
Finish Workflow		failure
	python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py	failure		IGNORED
Fast test (arm_darwin)		error		IGNORED

AI Review

Summary

This PR fixes a real reliability issue in UDF refresh: retries now renew ZooKeeper session via zookeeper_getter and re-fetch object names/watches on the fresh session before reloading objects. The core fix is correct and aligned with existing ZooKeeperRetriesControl usage patterns. I did not find new correctness, safety, concurrency, or performance blockers in the current diff.

Tests

⚠️ No regression test was added for the stale-session retry path in refreshObjects. This gap has already been raised in existing inline review discussion, so I am not posting a duplicate inline comment.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ⚠️ Request changes

Minimum required action:

Add a focused regression test that reproduces a Keeper session-expiration/hardware-error condition during UDF refresh/retry and verifies recovery on a fresh session without leaving a partial UDF set.

clickhouse-gh · 2026-04-08T14:09:54Z

-        ZooKeeperRetriesInfo{max_retries, initial_backoff_ms, max_backoff_ms, /*query_status=*/nullptr});
-
-    retries_ctl.retryLoop([&]
+    for (const auto & function_name : object_names)


Please add a regression test for this path. The original issue was a production exception caused by retrying on an expired Keeper session; this change removes the local retry loop, so we need a test that forces a Keeper hardware error during refreshObjects and verifies recovery happens via processWatchQueue retry (fresh session) without leaving a partial UDF set.

Instead of removing the retry loop entirely, keep it but fix the root cause: renew the ZooKeeper session via zookeeper_getter on each retry iteration (matching the pattern used by backup coordination code). Also move getObjectNamesAndSetWatch inside the loop so that the object list and watches are re-established on the fresh session.

alexey-milovidov · 2026-04-08T21:48:45Z

The 02346_text_index_bug101913 test failure is fixed by #102108 (already merged). Please rebase or merge master to pick up the fix.

clickhouse-gh · 2026-04-09T00:44:46Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.00%	-0.10%
Functions	90.90%	90.90%	+0.00%
Branches	76.60%	76.50%	-0.10%

Changed lines: 95.24% (20/21) | lost baseline coverage: 2 line(s) · Uncovered code

Full report · Diff report

alexey-milovidov · 2026-04-09T03:23:16Z

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

… in UDF retry loop

alexey-milovidov · 2026-04-09T06:55:24Z

@fm4v, our CI checks that every bug fix is accompanied by a test. But you ignored it... Maybe there is still a chance to add a test?

azat · 2026-04-09T07:34:59Z

    {
+        /// Renew the session on retry — the previous one may have expired.
+        if (retries_ctl.isRetry())
+            current_zookeeper = zookeeper_getter.getZooKeeper().first;


Other places issues a SYNC to keeper in case of new session, probably we also need it here? Since other keeper nodes may not see some objects that was visible from this session

Also, why this and #101891 has been merged w/o review?

My bad, I tested it on a single customer replica before merging: https://github.com/ClickHouse/clickhouse-private/pull/54946

SYNC to keeper in case of new session

@fm4v please also check this:

ClickHouse/src/Common/ZooKeeper/ZooKeeper.h

Lines 604 to 618 in b95ea45

/// Checks that our session was not killed, and allows to avoid applying a request from an old lost session.

/// Imagine a "connection-loss-on-commit" situation like this:

/// - We have written some write requests to the socket and immediately disconnected (e.g. due to "Operation timeout")

/// - The requests were sent, but the destination [Zoo]Keeper host will receive it later (it doesn't know about our requests yet)

/// - We don't know the status of our requests

/// - We connect to another [Zoo]Keeper replica with a new session, and do some reads

/// to find out the status of our requests. We see that they were not committed.

/// - The packets from our old session finally arrive to the destination [Zoo]Keeper host. The requests get processed.

/// - Changes are committed (although, we have already seen that they are not)

///

/// We need a way to reject requests from old sessions somehow.

///

/// So we update the version of /clickhouse/sessions/server_uuid node when starting a new session.

/// And there's an option to check this version when committing something.

void addCheckSessionOp(Coordination::Requests & requests) const;

Backport #102059 to 26.3: Fix crash caused by stale ZooKeeper session in UDF retry loop

Backport #102059 to 26.2: Fix crash caused by stale ZooKeeper session in UDF retry loop

Backport #102059 to 26.1: Fix crash caused by stale ZooKeeper session in UDF retry loop

fm4v · 2026-04-09T07:44:01Z

@alexey-milovidov I've added a test in the first PR #101891, but it was unreliable due to network fault injection timing. Tested manually on a custom replica of the affected customer instance

clickhouse-gh Bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Apr 8, 2026

clickhouse-gh Bot reviewed Apr 8, 2026

View reviewed changes

fm4v added v25.10-must-backport v25.12-must-backport labels Apr 8, 2026

Merge branch 'master' into fix-udf-stale-zk-retry

b20f2d8

fm4v added this pull request to the merge queue Apr 9, 2026

fm4v added v26.2-must-backport v26.3-must-backport labels Apr 9, 2026

Merged via the queue into master with commit ad69fbe Apr 9, 2026
160 of 164 checks passed

fm4v deleted the fix-udf-stale-zk-retry branch April 9, 2026 05:32

robot-ch-test-poll added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Apr 9, 2026

This was referenced Apr 9, 2026

Cherry pick #102059 to 25.8: Fix crash caused by stale ZooKeeper session in UDF retry loop #102168

Closed

Cherry pick #102059 to 26.1: Fix crash caused by stale ZooKeeper session in UDF retry loop #102169

Merged

robot-clickhouse added a commit that referenced this pull request Apr 9, 2026

Backport #102059 to 26.1: Fix crash caused by stale ZooKeeper session…

c7b783f

… in UDF retry loop

This was referenced Apr 9, 2026

Backport #102059 to 26.1: Fix crash caused by stale ZooKeeper session in UDF retry loop #102170

Merged

Cherry pick #102059 to 26.2: Fix crash caused by stale ZooKeeper session in UDF retry loop #102171

Merged

robot-clickhouse added a commit that referenced this pull request Apr 9, 2026

Backport #102059 to 26.2: Fix crash caused by stale ZooKeeper session…

5c13a61

… in UDF retry loop

This was referenced Apr 9, 2026

Backport #102059 to 26.2: Fix crash caused by stale ZooKeeper session in UDF retry loop #102172

Merged

Cherry pick #102059 to 26.3: Fix crash caused by stale ZooKeeper session in UDF retry loop #102173

Merged

robot-clickhouse added a commit that referenced this pull request Apr 9, 2026

Backport #102059 to 26.3: Fix crash caused by stale ZooKeeper session…

42801fa

… in UDF retry loop

robot-clickhouse-ci-1 mentioned this pull request Apr 9, 2026

Backport #102059 to 26.3: Fix crash caused by stale ZooKeeper session in UDF retry loop #102174

Merged

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 9, 2026

azat reviewed Apr 9, 2026

View reviewed changes

fm4v added a commit that referenced this pull request Apr 9, 2026

Merge pull request #102174 from ClickHouse/backport/26.3/102059

67fd139

Backport #102059 to 26.3: Fix crash caused by stale ZooKeeper session in UDF retry loop

fm4v added a commit that referenced this pull request Apr 9, 2026

Merge pull request #102172 from ClickHouse/backport/26.2/102059

bee87f5

Backport #102059 to 26.2: Fix crash caused by stale ZooKeeper session in UDF retry loop

fm4v added a commit that referenced this pull request Apr 9, 2026

Merge pull request #102170 from ClickHouse/backport/26.1/102059

863e3d3

Backport #102059 to 26.1: Fix crash caused by stale ZooKeeper session in UDF retry loop

robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash caused by stale ZooKeeper session in UDF retry loop#102059

Fix crash caused by stale ZooKeeper session in UDF retry loop#102059
fm4v merged 3 commits intomasterfrom
fix-udf-stale-zk-retry

fm4v commented Apr 8, 2026 •

edited by clickhouse-ci Bot

Loading

Uh oh!

clickhouse-gh Bot commented Apr 8, 2026 •

edited by fm4v

Loading

Uh oh!

clickhouse-gh Bot Apr 8, 2026

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

clickhouse-gh Bot commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

azat Apr 9, 2026

Uh oh!

fm4v Apr 9, 2026

Uh oh!

tavplubix Apr 20, 2026

Uh oh!

fm4v commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	/// Checks that our session was not killed, and allows to avoid applying a request from an old lost session.
	/// Imagine a "connection-loss-on-commit" situation like this:
	/// - We have written some write requests to the socket and immediately disconnected (e.g. due to "Operation timeout")
	/// - The requests were sent, but the destination [Zoo]Keeper host will receive it later (it doesn't know about our requests yet)
	/// - We don't know the status of our requests
	/// - We connect to another [Zoo]Keeper replica with a new session, and do some reads
	/// to find out the status of our requests. We see that they were not committed.
	/// - The packets from our old session finally arrive to the destination [Zoo]Keeper host. The requests get processed.
	/// - Changes are committed (although, we have already seen that they are not)
	///
	/// We need a way to reject requests from old sessions somehow.
	///
	/// So we update the version of /clickhouse/sessions/server_uuid node when starting a new session.
	/// And there's an option to check this version when committing something.
	void addCheckSessionOp(Coordination::Requests & requests) const;

Conversation

fm4v commented Apr 8, 2026 • edited by clickhouse-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issues

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Apr 8, 2026 • edited by fm4v Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Tests

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 8, 2026

Uh oh!

clickhouse-gh Bot commented Apr 9, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

azat Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

fm4v Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

tavplubix Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

fm4v commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fm4v commented Apr 8, 2026 •

edited by clickhouse-ci Bot

Loading

clickhouse-gh Bot commented Apr 8, 2026 •

edited by fm4v

Loading