Fix crash caused by stale ZooKeeper session in UDF retry loop#102059
Fix crash caused by stale ZooKeeper session in UDF retry loop#102059
Conversation
The `ZooKeeperRetriesControl` retry loop introduced in #101891 reused the same expired ZooKeeper session on every retry iteration, without calling `renewZooKeeper` as every other retry loop in the codebase does. This caused repeated requests on a finalized session, leading to crashes (clickhouse-core-incidents #1644, #1645). Remove the broken retry loop. The `tryLoadObject` split-catch that re-throws Keeper hardware errors is sufficient: the exception propagates to `processWatchQueue`'s catch-all, which resets the session and retries from scratch with a fresh connection. This preserves the original fix's goal — `setAllObjects` is never called with a partial set — without the stale-session misuse.
|
Workflow [PR], commit [b20f2d8] Summary: ❌
AI ReviewSummaryThis PR fixes a real reliability issue in UDF refresh: retries now renew ZooKeeper session via Tests
ClickHouse Rules
Final VerdictStatus: Minimum required action:
|
| ZooKeeperRetriesInfo{max_retries, initial_backoff_ms, max_backoff_ms, /*query_status=*/nullptr}); | ||
|
|
||
| retries_ctl.retryLoop([&] | ||
| for (const auto & function_name : object_names) |
There was a problem hiding this comment.
Please add a regression test for this path. The original issue was a production exception caused by retrying on an expired Keeper session; this change removes the local retry loop, so we need a test that forces a Keeper hardware error during refreshObjects and verifies recovery happens via processWatchQueue retry (fresh session) without leaving a partial UDF set.
Instead of removing the retry loop entirely, keep it but fix the root cause: renew the ZooKeeper session via zookeeper_getter on each retry iteration (matching the pattern used by backup coordination code). Also move getObjectNamesAndSetWatch inside the loop so that the object list and watches are re-established on the fresh session.
|
The |
LLVM Coverage Report
Changed lines: 95.24% (20/21) | lost baseline coverage: 2 line(s) · Uncovered code |
|
The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158 |
|
@fm4v, our CI checks that every bug fix is accompanied by a test. But you ignored it... Maybe there is still a chance to add a test? |
| { | ||
| /// Renew the session on retry — the previous one may have expired. | ||
| if (retries_ctl.isRetry()) | ||
| current_zookeeper = zookeeper_getter.getZooKeeper().first; |
There was a problem hiding this comment.
Other places issues a SYNC to keeper in case of new session, probably we also need it here? Since other keeper nodes may not see some objects that was visible from this session
Also, why this and #101891 has been merged w/o review?
There was a problem hiding this comment.
My bad, I tested it on a single customer replica before merging: https://github.com/ClickHouse/clickhouse-private/pull/54946
There was a problem hiding this comment.
SYNC to keeper in case of new session
@fm4v please also check this:
ClickHouse/src/Common/ZooKeeper/ZooKeeper.h
Lines 604 to 618 in b95ea45
Backport #102059 to 26.3: Fix crash caused by stale ZooKeeper session in UDF retry loop
Backport #102059 to 26.2: Fix crash caused by stale ZooKeeper session in UDF retry loop
Backport #102059 to 26.1: Fix crash caused by stale ZooKeeper session in UDF retry loop
|
@alexey-milovidov I've added a test in the first PR #101891, but it was unreliable due to network fault injection timing. Tested manually on a custom replica of the affected customer instance |
Linked issues
The
ZooKeeperRetriesControlretry loop introduced in #101891 reused the same expired ZooKeeper session on every retry iteration, without callingrenewZooKeeperas every other retry loop in the codebase does. This caused repeated requests on a finalized session, leading to crashes on canary deploys (https://github.com/ClickHouse/clickhouse-core-incidents/issues/1644, https://github.com/ClickHouse/clickhouse-core-incidents/issues/1645).Fix: renew the ZooKeeper session via
zookeeper_getteron each retry iteration (matching the pattern used by backup coordination code), and movegetObjectNamesAndSetWatchinside the retry loop so that the object list and watches are re-established on the fresh session.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix crash in UDF refresh caused by
ZooKeeperRetriesControlretrying on a stale (expired) ZooKeeper session without renewing it.Documentation entry for user-facing changes