Fix DROP TABLE hang on Kafka tables after consumer heartbeat error by alexey-milovidov · Pull Request #100388 · ClickHouse/ClickHouse

alexey-milovidov · 2026-03-22T19:34:42Z

When a Kafka consumer experiences a heartbeat error followed by a rebalance, DROP TABLE can hang indefinitely. The shutdown sequence waits for the streaming task to finish, but the task is stuck inside CompletedPipelineExecutor::execute because consumers are blocked in librdkafka's poll_batch during the rebalance — and the pipeline has no external cancellation mechanism.

Use CompletedPipelineExecutor::setCancelCallback to allow the pipeline to be cancelled promptly when shutdown_called is set. The callback checks every 100 ms and triggers pipeline cancellation, letting streamToViews return quickly so the shutdown sequence can proceed.

Skip offset commits after cancellation: the cancelled pipeline may not have fully written data to dependent views, so committing offsets would cause data loss. The consumer is marked as dirty instead, and offsets remain uncommitted.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=933424a0835a809944686507ee5c8a40f5d440e4&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%202%2F6%29

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix DROP TABLE hanging indefinitely on Kafka engine tables when consumers are stuck in a rebalance after a heartbeat error.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

When a Kafka consumer experiences a heartbeat error followed by a rebalance, `DROP TABLE` can hang indefinitely. The shutdown sequence waits for the streaming task to finish, but the task is stuck inside `CompletedPipelineExecutor::execute` because consumers are blocked in librdkafka's `poll_batch` during the rebalance — and the pipeline has no external cancellation mechanism. Use `CompletedPipelineExecutor::setCancelCallback` to allow the pipeline to be cancelled promptly when `shutdown_called` is set. The callback checks every 100 ms and triggers pipeline cancellation, letting `streamToViews` return quickly so the shutdown sequence can proceed. Skip offset commits after cancellation: the cancelled pipeline may not have fully written data to dependent views, so committing offsets would cause data loss. The consumer is marked as dirty instead, and offsets remain uncommitted. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=933424a0835a809944686507ee5c8a40f5d440e4&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%202%2F6%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-22T19:35:21Z

Workflow [PR], commit [10112ad]

Summary: ✅

AI Review

Summary

This PR addresses DROP TABLE hangs for Kafka tables by adding pipeline cancellation and skipping offset commits during shutdown. The fix direction is good, but there is one blocker: the new shutdown guard around source->commit() has a race and can still commit offsets after shutdown begins, which can reintroduce data-loss behavior the patch intends to prevent.

Missing context

⚠️ No CI logs or test artifacts were reviewed beyond PR description, so runtime validation of the new shutdown/cancellation path is not confirmed.

Findings

❌ Blockers
- [src/Storages/Kafka/StorageKafka.cpp:738] TOCTOU race in if (!shutdown_called) source->commit();.
  shutdown_called can flip to true immediately after the check, and commit may still execute after cancellation starts. That can commit offsets for data not fully flushed to dependent views.
  Suggested fix: track cancellation outcome for this execution (e.g., local cancelled_by_shutdown flag set by cancel callback) and skip commit based on that state instead of a late atomic read.

Tests

⚠️ Add a regression test that starts Kafka ingestion with attached MVs, triggers shutdown/cancellation while processing, and verifies offsets are not committed in the cancelled batch.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	⚠️	Core streaming/shutdown path changed; one race remains in commit gating during cancellation.
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	⚠️	Potential offset commit after shutdown can cause message loss semantics for existing users.
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	⚠️	Rollout is unsafe until commit/cancellation race is fixed.
Compilation time	✅

Performance & Safety

Cancellation is implemented via CompletedPipelineExecutor::setCancelCallback, but safety depends on correctly suppressing commit for cancelled execution. The current TOCTOU around shutdown_called leaves a data-loss window.

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
- Fix commit gating so a shutdown-triggered cancellation cannot race into source->commit().
- Add a regression test covering shutdown cancellation vs offset commit behavior.

tuanpach · 2026-03-23T01:14:07Z

+        /// The cancelled pipeline may not have fully written data to dependent views,
+        /// so committing offsets would cause data loss. The consumer will be marked
+        /// as dirty and offsets will remain uncommitted.
+        if (!shutdown_called)


I think we should print some logs when shutdown_called is true.

This is a minor comment.

clickhouse-gh · 2026-03-24T11:38:08Z

+        /// The cancelled pipeline may not have fully written data to dependent views,
+        /// so committing offsets would cause data loss. The consumer will be marked
+        /// as dirty and offsets will remain uncommitted.
+        if (!shutdown_called)


❌ There is a TOCTOU race here: if (!shutdown_called) can still call source->commit() if shutdown_called flips to true right after the check.

That re-introduces the data-loss case this patch is trying to avoid (commit offsets even though shutdown cancellation started and the pipeline may not have fully flushed to MVs).

Please gate commit on a local cancelled_by_shutdown flag set from the cancel callback (or another state that reflects whether this execute was actually cancelled), instead of a late read of shutdown_called.

clickhouse-gh · 2026-03-24T14:11:22Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.00%	-0.10%
Functions	24.50%	24.70%	+0.20%
Branches	76.60%	76.60%	+0.00%

PR changed lines: PR changed-lines coverage: 100.00% (16/16, 0 noise lines excluded)
Diff coverage report
Uncovered code

azat · 2026-04-02T17:49:54Z

+        /// Without this, DROP TABLE can hang waiting for the pipeline to finish naturally,
+        /// which may take a very long time if consumers are stuck in a rebalance after
+        /// a heartbeat error.
+        executor.setCancelCallback([this]() { return shutdown_called.load(); }, 100);


While working on #101217

I realized that this may not be a good enough fix:

first it may create duplicates, due to commit was not called

secondly, this check is racy

Maybe it is better to revert this and fix kafka differently - since after this change the tests fails more often, @alexey-milovidov @tuanpach @antaljanosbenjamin WDYT?

Also, maybe this will help with fixing DROP - #100612

I will prepare a revert for now -#101646

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 22, 2026

tuanpach reviewed Mar 23, 2026

View reviewed changes

tuanpach self-assigned this Mar 23, 2026

Merge branch 'master' into fix-kafka-consumer-hang-drop-table

f2931da

tuanpach approved these changes Mar 24, 2026

View reviewed changes

Merge master into fix-kafka-consumer-hang-drop-table

10112ad

clickhouse-gh Bot reviewed Mar 24, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue Mar 24, 2026

Merged via the queue into master with commit d3fa644 Mar 24, 2026
152 checks passed

alexey-milovidov deleted the fix-kafka-consumer-hang-drop-table branch March 24, 2026 18:08

robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 24, 2026

This was referenced Mar 30, 2026

Fix Kafka offset commit skipped when pipeline completes before cancel callback fires #101217

Closed

Revert "Fix DROP TABLE hang on Kafka tables after consumer heartbeat error" #101646

Merged

azat reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DROP TABLE hang on Kafka tables after consumer heartbeat error#100388

Fix DROP TABLE hang on Kafka tables after consumer heartbeat error#100388
alexey-milovidov merged 3 commits intomasterfrom
fix-kafka-consumer-hang-drop-table

alexey-milovidov commented Mar 22, 2026

Uh oh!

clickhouse-gh Bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

tuanpach Mar 23, 2026

Uh oh!

tuanpach Mar 24, 2026

Uh oh!

clickhouse-gh Bot Mar 24, 2026

Uh oh!

clickhouse-gh Bot commented Mar 24, 2026

Uh oh!

Uh oh!

azat Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexey-milovidov commented Mar 22, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context

Findings

Tests

ClickHouse Rules

Performance & Safety

Final Verdict

Uh oh!

tuanpach Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpach Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Mar 24, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

azat Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clickhouse-gh Bot commented Mar 22, 2026 •

edited

Loading