Skip to content

Fix deadlock with CancellationChecker#84203

Merged
antonio2368 merged 2 commits intomasterfrom
fix-cancellation-checker-deadlock
Jul 28, 2025
Merged

Fix deadlock with CancellationChecker#84203
antonio2368 merged 2 commits intomasterfrom
fix-cancellation-checker-deadlock

Conversation

@antonio2368
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix deadlock caused by background cancellation checker thread.

Deadlock can happen in following situation:
T1: ProcessListEntry in its destructor takes ProcessList::mutex and wants to mark task as done in CancellationChecker but CancellationChecker::mutex is taken by T2.
T2: CancellationChecker::workerFunction took CancellationChecker::mutex and started cancelling tasks. Cancellation waits on ExecutingGraph::mutex to cancel the task but it's taken by T3.
T3: ExecutingGraph::cancel was called and it took ExecutingGraph::mutex. Inside cancel, OverCommitTracker is triggered which tries to take ProcessList::mutex but it's already taken by T1.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

if (!tasks_to_cancel.empty())
{
lock.unlock();
std::ranges::for_each(tasks_to_cancel, cancelTask);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the fix, we cancel tasks without lock

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 22, 2025

Workflow [PR], commit [c90cc4e]

Summary:

job_name test_name status info comment
Unit tests (tsan) error
Stress test (amd_debug) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 22, 2025
@antonio2368 antonio2368 requested a review from yariks5s July 24, 2025 12:54
@antonio2368 antonio2368 added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Jul 25, 2025
@antonio2368 antonio2368 added this pull request to the merge queue Jul 28, 2025
Merged via the queue into master with commit 90e78ab Jul 28, 2025
241 of 246 checks passed
@antonio2368 antonio2368 deleted the fix-cancellation-checker-deadlock branch July 28, 2025 08:04
@robot-ch-test-poll robot-ch-test-poll added pr-backports-created-cloud deprecated label, NOOP pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Jul 28, 2025
robot-ch-test-poll4 added a commit that referenced this pull request Jul 28, 2025
Cherry pick #84203 to 25.7: Fix deadlock with CancellationChecker
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 28, 2025
robot-ch-test-poll added a commit that referenced this pull request Jul 28, 2025
Cherry pick #84203 to 25.3: Fix deadlock with CancellationChecker
robot-ch-test-poll added a commit that referenced this pull request Jul 28, 2025
Cherry pick #84203 to 25.5: Fix deadlock with CancellationChecker
robot-ch-test-poll added a commit that referenced this pull request Jul 28, 2025
Cherry pick #84203 to 25.6: Fix deadlock with CancellationChecker
@robot-ch-test-poll robot-ch-test-poll added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jul 28, 2025
clickhouse-gh bot added a commit that referenced this pull request Jul 28, 2025
Backport #84203 to 25.6: Fix deadlock with CancellationChecker
antonio2368 added a commit that referenced this pull request Jul 29, 2025
Backport #84203 to 25.7: Fix deadlock with CancellationChecker
antonio2368 added a commit that referenced this pull request Aug 11, 2025
Backport #84203 to 25.3: Fix deadlock with CancellationChecker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP pr-bugfix Pull request with bugfix, not backported by default pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants