Fix DROP TABLE hang on Kafka tables after consumer heartbeat error#100388
Fix DROP TABLE hang on Kafka tables after consumer heartbeat error#100388alexey-milovidov merged 3 commits intomasterfrom
Conversation
When a Kafka consumer experiences a heartbeat error followed by a rebalance, `DROP TABLE` can hang indefinitely. The shutdown sequence waits for the streaming task to finish, but the task is stuck inside `CompletedPipelineExecutor::execute` because consumers are blocked in librdkafka's `poll_batch` during the rebalance — and the pipeline has no external cancellation mechanism. Use `CompletedPipelineExecutor::setCancelCallback` to allow the pipeline to be cancelled promptly when `shutdown_called` is set. The callback checks every 100 ms and triggers pipeline cancellation, letting `streamToViews` return quickly so the shutdown sequence can proceed. Skip offset commits after cancellation: the cancelled pipeline may not have fully written data to dependent views, so committing offsets would cause data loss. The consumer is marked as dirty instead, and offsets remain uncommitted. https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=933424a0835a809944686507ee5c8a40f5d440e4&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%202%2F6%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [10112ad] Summary: ✅ AI ReviewSummaryThis PR addresses Missing context
Findings
Tests
ClickHouse Rules
Performance & Safety
Final Verdict
|
| /// The cancelled pipeline may not have fully written data to dependent views, | ||
| /// so committing offsets would cause data loss. The consumer will be marked | ||
| /// as dirty and offsets will remain uncommitted. | ||
| if (!shutdown_called) |
There was a problem hiding this comment.
I think we should print some logs when shutdown_called is true.
| /// The cancelled pipeline may not have fully written data to dependent views, | ||
| /// so committing offsets would cause data loss. The consumer will be marked | ||
| /// as dirty and offsets will remain uncommitted. | ||
| if (!shutdown_called) |
There was a problem hiding this comment.
❌ There is a TOCTOU race here: if (!shutdown_called) can still call source->commit() if shutdown_called flips to true right after the check.
That re-introduces the data-loss case this patch is trying to avoid (commit offsets even though shutdown cancellation started and the pipeline may not have fully flushed to MVs).
Please gate commit on a local cancelled_by_shutdown flag set from the cancel callback (or another state that reflects whether this execute was actually cancelled), instead of a late read of shutdown_called.
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 100.00% (16/16, 0 noise lines excluded) |
| /// Without this, DROP TABLE can hang waiting for the pipeline to finish naturally, | ||
| /// which may take a very long time if consumers are stuck in a rebalance after | ||
| /// a heartbeat error. | ||
| executor.setCancelCallback([this]() { return shutdown_called.load(); }, 100); |
There was a problem hiding this comment.
While working on #101217
I realized that this may not be a good enough fix:
- first it may create duplicates, due to commit was not called
- secondly, this check is racy
Maybe it is better to revert this and fix kafka differently - since after this change the tests fails more often, @alexey-milovidov @tuanpach @antaljanosbenjamin WDYT?
Also, maybe this will help with fixing DROP - #100612
I will prepare a revert for now -#101646
When a Kafka consumer experiences a heartbeat error followed by a rebalance,
DROP TABLEcan hang indefinitely. The shutdown sequence waits for the streaming task to finish, but the task is stuck insideCompletedPipelineExecutor::executebecause consumers are blocked in librdkafka'spoll_batchduring the rebalance — and the pipeline has no external cancellation mechanism.Use
CompletedPipelineExecutor::setCancelCallbackto allow the pipeline to be cancelled promptly whenshutdown_calledis set. The callback checks every 100 ms and triggers pipeline cancellation, lettingstreamToViewsreturn quickly so the shutdown sequence can proceed.Skip offset commits after cancellation: the cancelled pipeline may not have fully written data to dependent views, so committing offsets would cause data loss. The consumer is marked as dirty instead, and offsets remain uncommitted.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=933424a0835a809944686507ee5c8a40f5d440e4&name_0=MasterCI&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20db%20disk%2C%20old%20analyzer%2C%202%2F6%29
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix
DROP TABLEhanging indefinitely on Kafka engine tables when consumers are stuck in a rebalance after a heartbeat error.Documentation entry for user-facing changes