Fix "Unexpected refresh lock in keeper" exception in RefreshTask by groeneai · Pull Request #103849 · ClickHouse/ClickHouse

groeneai · 2026-05-01T09:48:33Z

Removes the DEBUG_OR_SANITIZER_BUILD abort that fired
Logical error: 'Unexpected refresh lock in keeper' from
DB::RefreshTask::refreshTask() at src/Storages/MaterializedView/RefreshTask.cpp:523.
The recovery branch that was already in place for release builds now runs
unconditionally.

Why this is not a bug in `RefreshTask`

The assertion was added in #103427 as a defensive guard, but the state it
asserts against (/running exists in Keeper with our replica name in its data,
and this RefreshTask instance is in Scheduling) is reachable through normal
operation:

Server kill+restart inside the Keeper session timeout window. The
ephemeral /running znode created by the previous Keeper session of this
server is still observable to the new session until Keeper expires the old
session (~30s by default). The new RefreshTask instance reads /running
with our replica's name in its data and wrongly concludes it is itself
running a refresh. This is the dominant cause in stress tests with
RandomQueryKiller.
DETACH+ATTACH while Keeper is unreachable during shutdown.
RefreshTask::shutdown does not proactively call removeRunningZnodeIfMine;
it relies on the in-flight cleanup path
updateCoordinationState(running=false). If Keeper is unreachable when
shutdown runs, the cleanup cannot run, the znode lingers, and on ATTACH the
new task observes the leftover.

The release-build recovery branch is correct in both cases:

removeRunningZnodeIfMine does tryGet + data match against this replica's
name + tryRemove — all session-agnostic and safe across session boundaries.
schedule_keeper_retry requeues a Keeper re-read in 5s; the next iteration
observes the cleaned-up state and proceeds normally.

This change drops the #ifdef DEBUG_OR_SANITIZER_BUILD gate so the recovery
runs in every build, and demotes the misleading Likely a bug LOG_ERROR to a
LOG_WARNING that explains the actual cause.

User-visible behavior on stable (release) builds does not change — release
builds already ran the recovery path. The fix only stops debug/sanitizer
builds from aborting on a state that was always reachable. Per the bot's
review (#103849 (comment)),
the category is therefore CI Fix or Improvement rather than Bug Fix.

Reproduction / regression test

The new integration test
test_refreshable_mv/test.py::test_refresh_recovers_from_stale_running_znode
plants a stale /running znode through a separate Kazoo session — exactly the
way an old Keeper session's leftover would look from the new session — and
verifies SYSTEM REFRESH VIEW + SYSTEM WAIT VIEW complete without server
abort and the leftover znode is cleaned up.

Verified locally in a debug build: test FAILS without this fix
(Logical error: 'Unexpected refresh lock in keeper' aborts node1) and PASSES
with it.

CIDB evidence

30-day window: ~22 hits across 7 unrelated PRs (STIDs 2508-3e24 and
2508-4754). Primary failure modes:

pull_request_number	check_name	hits
101757	Integration tests (amd_asan_ubsan, targeted)	7
101757	Integration tests (amd_msan, 5/6)	5
101757	Integration tests (amd_asan_ubsan, db disk, old analyzer, 5/6)	4
90740	Stress test (amd_msan)	1
93757	Integration tests (amd_msan, 2/6)	1
98533	Stress test (amd_tsan)	1
103737	Stress test (arm_ubsan)	1
103347	Stress test (amd_tsan)	1
101757	Integration tests (amd_tsan, 5/6)	1

Per @alexey-milovidov's directive on
#103737

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

@alexey-milovidov

The chassert / abortOnFailedAssertion at RefreshTask.cpp:523 was added in PR ClickHouse#103427 as a defensive guard, but the state it asserts against is reachable through normal operation: 1. Server kill+restart inside the Keeper session timeout window: the ephemeral `/running` znode created by the previous Keeper session is still observable to the new session (with our replica name in its data) until Keeper expires the old session, ~30s by default. Dominant cause in stress tests with `RandomQueryKiller`. 2. DETACH+ATTACH while Keeper is unreachable during shutdown: `RefreshTask::shutdown` does not proactively call `removeRunningZnodeIfMine`; it relies on the in-flight cleanup path `updateCoordinationState(running=false)`, which can not run if Keeper is unreachable. After ATTACH the new task observes the leftover znode and aborts. The release-build recovery branch (clear local stale flag, call `removeRunningZnodeIfMine`, reschedule a Keeper re-read) is correct for both scenarios — `tryGet` + data match against this replica's name + `tryRemove` are session-agnostic. This change drops the `#ifdef DEBUG_OR_SANITIZER_BUILD` gate so the recovery runs unconditionally, and demotes the misleading `Likely a bug` `LOG_ERROR` to a `LOG_WARNING` that explains the actual cause. Adds an integration test that plants a stale `/running` znode through a separate Kazoo session — the same way an old session's leftover would look from a new session — and verifies `SYSTEM REFRESH VIEW` + `SYSTEM WAIT VIEW` complete without server abort. CIDB: ~22 hits across 7 unrelated PRs over 30 days (STIDs 2508-3e24, 2508-4754). Per @alexey-milovidov directive on ClickHouse#103737 Session: cron:clickhouse-ci-task-worker:20260501-091500 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

groeneai · 2026-05-01T09:49:12Z

Pre-PR validation gate (per the worker procedure):

a) Deterministic repro? Yes — the new integration test plants /running via Kazoo (separate session) and the abort fires 100% of runs against the unfixed binary.

b) Root cause explained? Yes. The chassert was added in #103427 as a defensive guard, but the state it asserts against is reachable in two normal paths: (1) old keeper session's ephemeral /running survives a kill+restart inside the keeper session timeout window, and (2) DETACH+ATTACH while keeper was unreachable during shutdown leaves /running behind. In both cases running_znode_exists=true + last_attempt_replica==me + state=Scheduling is observable on a brand-new task instance — not a bug in RefreshTask itself.

c) Fix matches root cause? Yes. The release-build branch already had a correct recovery (clear local stale flag, removeRunningZnodeIfMine — tryGet + data-match + tryRemove, all session-agnostic, then schedule_keeper_retry). Dropping the #ifdef DEBUG_OR_SANITIZER_BUILD gate runs the same recovery in every build. Also demotes the misleading Likely a bug LOG_ERROR to a LOG_WARNING that names the actual cause.

d) Test intent preserved / new tests added? Yes — added test_refresh_recovers_from_stale_running_znode that deterministically triggers the scenario, asserts the refresh completes, and asserts the leftover znode is cleaned up. The existing tests are unchanged.

e) Both directions demonstrated? Yes — confirmed locally with a debug build:

Without fix (stale ci/tmp/clickhouse): pytest test_refresh_recovers_from_stale_running_znode → FAILED, 1 error in 18.38s. node1 aborts with Logical error: 'Unexpected refresh lock in keeper' from DB::abortOnFailedAssertion.
With fix (re-built binary): pytest test_refresh_recovers_from_stale_running_znode → 1 passed in 17.93s. The LOG_WARNING fires, removeRunningZnodeIfMine clears the leftover, SYSTEM REFRESH VIEW + SYSTEM WAIT VIEW complete cleanly.

f) Fix is general, not a narrow patch? Yes — the fix targets the root cause (the assertion's assumption that this state is impossible) rather than guarding against the symptom. The recovery path handles every code path that can produce the state, not just the two enumerated scenarios. No defensive null/empty guards were added.

Session: cron:clickhouse-ci-task-worker:20260501-091500

groeneai · 2026-05-01T09:49:13Z

cc @al13n321 @tuanpach @alexey-milovidov — could you review this? It removes the Logical error: 'Unexpected refresh lock in keeper' debug-only abort from RefreshTask::refreshTask and unifies it with the existing release-build recovery, which already handles the two real scenarios (kill+restart inside keeper session timeout; DETACH+ATTACH with keeper unreachable). Includes a deterministic regression test (~22 hits across 7 unrelated PRs in 30 days, STIDs 2508-3e24 / 2508-4754). Per @alexey-milovidov directive on #103737.

al13n321 · 2026-05-01T09:56:56Z

(I'm changing how this lock works, will likely obsolete this PR soon.)

groeneai · 2026-05-01T11:16:12Z

Thanks @al13n321 — happy to defer to your redesign. A couple of options depending on your timeline:

If your refresh-lock redesign is close to landing, leave this open and I'll close it once yours merges (no rush from my side).
If it'll take a while, this could land first as a stopgap — the chassert is firing pretty regularly: ~22 hits across 7 unrelated PRs in the last 30 days (STIDs 2508-3e24 / 2508-4754), so dev branches keep tripping it on stress runs and DETACH+ATTACH paths.

Either way works for me. Just let me know if you'd prefer I close this now in favor of a draft of yours, or wait. Not blocking on a reply — flagging so you have the option.

clickhouse-gh · 2026-05-01T15:21:57Z

Workflow [PR], commit [0a7174c]

Summary: ❌

job_name	test_name	status	info
Bugfix validation (integration tests)		FAIL
	test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db	FAIL	cidb
	test_refreshable_mv/test.py::test_refreshable_mv_in_system_db	FAIL	cidb
	test_refreshable_mv/test.py::test_refreshable_mv_in_read_only_node	FAIL	cidb
	test_refreshable_mv/test.py::test_refreshable_mv_in_read_only_node_no_ddl	FAIL	cidb
	test_refreshable_mv/test.py::test_refresh_vs_shutdown_smoke	FAIL	cidb
	test_refreshable_mv/test.py::test_pause	FAIL	cidb
	test_refreshable_mv/test.py::test_backup_outer_table	FAIL	cidb
	test_refreshable_mv/test.py::test_backup_inner_table	FAIL	cidb
	test_refreshable_mv/test.py::test_adding_replica	FAIL	cidb
	test_refreshable_mv/test.py::test_replicated_db_startup_race	FAIL	cidb
	2 more test cases not shown
Unit tests (msan, function_prop_fuzzer)		FAIL
Upgrade check (amd_release)		FAIL
	Error message in clickhouse-server.log (see upgrade_error_messages.txt)	FAIL	cidb

AI Review

Summary

This PR removes a debug/sanitizer-only abortOnFailedAssertion path in RefreshTask::refreshTask and always uses the existing stale-/running recovery logic, with an integration regression test that reproduces and validates recovery from a leftover keeper lock state. The code changes themselves look correct and targeted, but the PR metadata currently uses a changelog category that does not match the actual user impact.

PR Metadata

⚠️ Changelog category is not semantically correct: this change affects debug/sanitizer behavior and CI/stress reliability, not user-visible behavior in an official stable release.
✅ Changelog entry is present and specific, but it should be paired with a matching category.
Suggested replacement category: CI Fix or Improvement (changelog entry is not required)

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	⚠️	`Changelog category` should be `CI Fix or Improvement`, not `Bug Fix (user-visible misbehavior in an official stable release)`
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
- Update PR template Changelog category to CI Fix or Improvement (changelog entry is not required).

clickhouse-gh · 2026-05-01T15:27:19Z

-#ifdef DEBUG_OR_SANITIZER_BUILD
-                    abortOnFailedAssertion("Unexpected refresh lock in keeper");
-#else
+                    /// The `/running` znode is ephemeral and tagged with this replica's name in


This change removes a DEBUG_OR_SANITIZER_BUILD-only abort and makes debug/sanitizer builds follow the same recovery path that release builds already used. Because user-visible behavior in stable builds does not change, the PR template Changelog category should be CI Fix or Improvement (or Build/Testing/Packaging Improvement) rather than Bug Fix (user-visible misbehavior in an official stable release).

Thanks @clickhouse-gh — you're right. Updated the PR description: changelog category is now CI Fix or Improvement (changelog entry is not required) and the changelog entry is removed.

Reasoning: release builds already ran the recovery path; only debug/sanitizer builds were aborting on a state that was always reachable. So the user-visible behavior on stable builds does not change — this only stops master CI sanitizer/debug stress runs from aborting on a state the recovery path was always meant to handle.

clickhouse-gh · 2026-05-01T18:26:09Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	91.10%	91.10%	+0.00%
Branches	76.60%	76.60%	+0.00%

Changed lines: 11.11% (1/9) · Uncovered code

Full report · Diff report

groeneai mentioned this pull request May 1, 2026

Support CSVWithNames download format in play.html #103737

Open

alexey-milovidov added the can be tested Allows running workflows for external contributors label May 1, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 1, 2026

clickhouse-gh Bot reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "Unexpected refresh lock in keeper" exception in RefreshTask#103849

Fix "Unexpected refresh lock in keeper" exception in RefreshTask#103849
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:groeneai/fix-refresh-task-unexpected-keeper-lock

groeneai commented May 1, 2026 •

edited

Loading

Uh oh!

groeneai commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

al13n321 commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

clickhouse-gh Bot commented May 1, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot May 1, 2026

Uh oh!

groeneai May 1, 2026

Uh oh!

clickhouse-gh Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

groeneai commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this is not a bug in RefreshTask

Reproduction / regression test

CIDB evidence

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

groeneai commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

al13n321 commented May 1, 2026

Uh oh!

groeneai commented May 1, 2026

Uh oh!

clickhouse-gh Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai May 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented May 1, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

groeneai commented May 1, 2026 •

edited

Loading

Why this is not a bug in `RefreshTask`

clickhouse-gh Bot commented May 1, 2026 •

edited

Loading