Skip to content

Fix "Unexpected refresh lock in keeper" exception in RefreshTask#103849

Open
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:groeneai/fix-refresh-task-unexpected-keeper-lock
Open

Fix "Unexpected refresh lock in keeper" exception in RefreshTask#103849
groeneai wants to merge 1 commit intoClickHouse:masterfrom
groeneai:groeneai/fix-refresh-task-unexpected-keeper-lock

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

@groeneai groeneai commented May 1, 2026

Removes the DEBUG_OR_SANITIZER_BUILD abort that fired
Logical error: 'Unexpected refresh lock in keeper' from
DB::RefreshTask::refreshTask() at src/Storages/MaterializedView/RefreshTask.cpp:523.
The recovery branch that was already in place for release builds now runs
unconditionally.

Why this is not a bug in RefreshTask

The assertion was added in #103427 as a defensive guard, but the state it
asserts against (/running exists in Keeper with our replica name in its data,
and this RefreshTask instance is in Scheduling) is reachable through normal
operation:

  1. Server kill+restart inside the Keeper session timeout window. The
    ephemeral /running znode created by the previous Keeper session of this
    server is still observable to the new session until Keeper expires the old
    session (~30s by default). The new RefreshTask instance reads /running
    with our replica's name in its data and wrongly concludes it is itself
    running a refresh. This is the dominant cause in stress tests with
    RandomQueryKiller.

  2. DETACH+ATTACH while Keeper is unreachable during shutdown.
    RefreshTask::shutdown does not proactively call removeRunningZnodeIfMine;
    it relies on the in-flight cleanup path
    updateCoordinationState(running=false). If Keeper is unreachable when
    shutdown runs, the cleanup cannot run, the znode lingers, and on ATTACH the
    new task observes the leftover.

The release-build recovery branch is correct in both cases:

  • removeRunningZnodeIfMine does tryGet + data match against this replica's
    name + tryRemove — all session-agnostic and safe across session boundaries.
  • schedule_keeper_retry requeues a Keeper re-read in 5s; the next iteration
    observes the cleaned-up state and proceeds normally.

This change drops the #ifdef DEBUG_OR_SANITIZER_BUILD gate so the recovery
runs in every build, and demotes the misleading Likely a bug LOG_ERROR to a
LOG_WARNING that explains the actual cause.

User-visible behavior on stable (release) builds does not change — release
builds already ran the recovery path. The fix only stops debug/sanitizer
builds from aborting on a state that was always reachable. Per the bot's
review (#103849 (comment)),
the category is therefore CI Fix or Improvement rather than Bug Fix.

Reproduction / regression test

The new integration test
test_refreshable_mv/test.py::test_refresh_recovers_from_stale_running_znode
plants a stale /running znode through a separate Kazoo session — exactly the
way an old Keeper session's leftover would look from the new session — and
verifies SYSTEM REFRESH VIEW + SYSTEM WAIT VIEW complete without server
abort and the leftover znode is cleaned up.

Verified locally in a debug build: test FAILS without this fix
(Logical error: 'Unexpected refresh lock in keeper' aborts node1) and PASSES
with it.

CIDB evidence

30-day window: ~22 hits across 7 unrelated PRs (STIDs 2508-3e24 and
2508-4754). Primary failure modes:

pull_request_number check_name hits
101757 Integration tests (amd_asan_ubsan, targeted) 7
101757 Integration tests (amd_msan, 5/6) 5
101757 Integration tests (amd_asan_ubsan, db disk, old analyzer, 5/6) 4
90740 Stress test (amd_msan) 1
93757 Integration tests (amd_msan, 2/6) 1
98533 Stress test (amd_tsan) 1
103737 Stress test (arm_ubsan) 1
103347 Stress test (amd_tsan) 1
101757 Integration tests (amd_tsan, 5/6) 1

Per @alexey-milovidov's directive on
#103737

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The chassert / abortOnFailedAssertion at RefreshTask.cpp:523 was added in
PR ClickHouse#103427 as a defensive guard, but the state it asserts against is reachable
through normal operation:

1. Server kill+restart inside the Keeper session timeout window: the ephemeral
   `/running` znode created by the previous Keeper session is still observable
   to the new session (with our replica name in its data) until Keeper expires
   the old session, ~30s by default. Dominant cause in stress tests with
   `RandomQueryKiller`.

2. DETACH+ATTACH while Keeper is unreachable during shutdown: `RefreshTask::shutdown`
   does not proactively call `removeRunningZnodeIfMine`; it relies on the
   in-flight cleanup path `updateCoordinationState(running=false)`, which can
   not run if Keeper is unreachable. After ATTACH the new task observes the
   leftover znode and aborts.

The release-build recovery branch (clear local stale flag, call
`removeRunningZnodeIfMine`, reschedule a Keeper re-read) is correct for both
scenarios — `tryGet` + data match against this replica's name + `tryRemove`
are session-agnostic. This change drops the `#ifdef DEBUG_OR_SANITIZER_BUILD`
gate so the recovery runs unconditionally, and demotes the misleading
`Likely a bug` `LOG_ERROR` to a `LOG_WARNING` that explains the actual cause.

Adds an integration test that plants a stale `/running` znode through a
separate Kazoo session — the same way an old session's leftover would look
from a new session — and verifies `SYSTEM REFRESH VIEW` + `SYSTEM WAIT VIEW`
complete without server abort.

CIDB: ~22 hits across 7 unrelated PRs over 30 days (STIDs 2508-3e24, 2508-4754).

Per @alexey-milovidov directive on
ClickHouse#103737

Session: cron:clickhouse-ci-task-worker:20260501-091500

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented May 1, 2026

Pre-PR validation gate (per the worker procedure):

a) Deterministic repro? Yes — the new integration test plants /running via Kazoo (separate session) and the abort fires 100% of runs against the unfixed binary.

b) Root cause explained? Yes. The chassert was added in #103427 as a defensive guard, but the state it asserts against is reachable in two normal paths: (1) old keeper session's ephemeral /running survives a kill+restart inside the keeper session timeout window, and (2) DETACH+ATTACH while keeper was unreachable during shutdown leaves /running behind. In both cases running_znode_exists=true + last_attempt_replica==me + state=Scheduling is observable on a brand-new task instance — not a bug in RefreshTask itself.

c) Fix matches root cause? Yes. The release-build branch already had a correct recovery (clear local stale flag, removeRunningZnodeIfMinetryGet + data-match + tryRemove, all session-agnostic, then schedule_keeper_retry). Dropping the #ifdef DEBUG_OR_SANITIZER_BUILD gate runs the same recovery in every build. Also demotes the misleading Likely a bug LOG_ERROR to a LOG_WARNING that names the actual cause.

d) Test intent preserved / new tests added? Yes — added test_refresh_recovers_from_stale_running_znode that deterministically triggers the scenario, asserts the refresh completes, and asserts the leftover znode is cleaned up. The existing tests are unchanged.

e) Both directions demonstrated? Yes — confirmed locally with a debug build:

  • Without fix (stale ci/tmp/clickhouse): pytest test_refresh_recovers_from_stale_running_znodeFAILED, 1 error in 18.38s. node1 aborts with Logical error: 'Unexpected refresh lock in keeper' from DB::abortOnFailedAssertion.
  • With fix (re-built binary): pytest test_refresh_recovers_from_stale_running_znode1 passed in 17.93s. The LOG_WARNING fires, removeRunningZnodeIfMine clears the leftover, SYSTEM REFRESH VIEW + SYSTEM WAIT VIEW complete cleanly.

f) Fix is general, not a narrow patch? Yes — the fix targets the root cause (the assertion's assumption that this state is impossible) rather than guarding against the symptom. The recovery path handles every code path that can produce the state, not just the two enumerated scenarios. No defensive null/empty guards were added.

Session: cron:clickhouse-ci-task-worker:20260501-091500

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented May 1, 2026

cc @al13n321 @tuanpach @alexey-milovidov — could you review this? It removes the Logical error: 'Unexpected refresh lock in keeper' debug-only abort from RefreshTask::refreshTask and unifies it with the existing release-build recovery, which already handles the two real scenarios (kill+restart inside keeper session timeout; DETACH+ATTACH with keeper unreachable). Includes a deterministic regression test (~22 hits across 7 unrelated PRs in 30 days, STIDs 2508-3e24 / 2508-4754). Per @alexey-milovidov directive on #103737.

@al13n321
Copy link
Copy Markdown
Member

al13n321 commented May 1, 2026

(I'm changing how this lock works, will likely obsolete this PR soon.)

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented May 1, 2026

Thanks @al13n321 — happy to defer to your redesign. A couple of options depending on your timeline:

  • If your refresh-lock redesign is close to landing, leave this open and I'll close it once yours merges (no rush from my side).
  • If it'll take a while, this could land first as a stopgap — the chassert is firing pretty regularly: ~22 hits across 7 unrelated PRs in the last 30 days (STIDs 2508-3e24 / 2508-4754), so dev branches keep tripping it on stress runs and DETACH+ATTACH paths.

Either way works for me. Just let me know if you'd prefer I close this now in favor of a draft of yours, or wait. Not blocking on a reply — flagging so you have the option.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 1, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 1, 2026

Workflow [PR], commit [0a7174c]

Summary:

job_name test_name status info comment
Bugfix validation (integration tests) FAIL
test_refreshable_mv/test.py::test_refreshable_mv_in_replicated_db FAIL cidb
test_refreshable_mv/test.py::test_refreshable_mv_in_system_db FAIL cidb
test_refreshable_mv/test.py::test_refreshable_mv_in_read_only_node FAIL cidb
test_refreshable_mv/test.py::test_refreshable_mv_in_read_only_node_no_ddl FAIL cidb
test_refreshable_mv/test.py::test_refresh_vs_shutdown_smoke FAIL cidb
test_refreshable_mv/test.py::test_pause FAIL cidb
test_refreshable_mv/test.py::test_backup_outer_table FAIL cidb
test_refreshable_mv/test.py::test_backup_inner_table FAIL cidb
test_refreshable_mv/test.py::test_adding_replica FAIL cidb
test_refreshable_mv/test.py::test_replicated_db_startup_race FAIL cidb
2 more test cases not shown
Unit tests (msan, function_prop_fuzzer) FAIL
Upgrade check (amd_release) FAIL
Error message in clickhouse-server.log (see upgrade_error_messages.txt) FAIL cidb

AI Review

Summary

This PR removes a debug/sanitizer-only abortOnFailedAssertion path in RefreshTask::refreshTask and always uses the existing stale-/running recovery logic, with an integration regression test that reproduces and validates recovery from a leftover keeper lock state. The code changes themselves look correct and targeted, but the PR metadata currently uses a changelog category that does not match the actual user impact.

PR Metadata
  • ⚠️ Changelog category is not semantically correct: this change affects debug/sanitizer behavior and CI/stress reliability, not user-visible behavior in an official stable release.
  • Changelog entry is present and specific, but it should be paired with a matching category.
  • Suggested replacement category: CI Fix or Improvement (changelog entry is not required)
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality ⚠️ Changelog category should be CI Fix or Improvement, not Bug Fix (user-visible misbehavior in an official stable release)
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Update PR template Changelog category to CI Fix or Improvement (changelog entry is not required).

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 1, 2026
#ifdef DEBUG_OR_SANITIZER_BUILD
abortOnFailedAssertion("Unexpected refresh lock in keeper");
#else
/// The `/running` znode is ephemeral and tagged with this replica's name in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes a DEBUG_OR_SANITIZER_BUILD-only abort and makes debug/sanitizer builds follow the same recovery path that release builds already used. Because user-visible behavior in stable builds does not change, the PR template Changelog category should be CI Fix or Improvement (or Build/Testing/Packaging Improvement) rather than Bug Fix (user-visible misbehavior in an official stable release).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clickhouse-gh — you're right. Updated the PR description: changelog category is now CI Fix or Improvement (changelog entry is not required) and the changelog entry is removed.

Reasoning: release builds already ran the recovery path; only debug/sanitizer builds were aborting on a state that was always reachable. So the user-visible behavior on stable builds does not change — this only stops master CI sanitizer/debug stress runs from aborting on a state the recovery path was always meant to handle.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 1, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.10% 91.10% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 11.11% (1/9) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants