Start Redpanda in stress tests, ignore StorageKafka errors in upgrade check#102287
Merged
alexey-milovidov merged 7 commits intomasterfrom Apr 26, 2026
Merged
Start Redpanda in stress tests, ignore StorageKafka errors in upgrade check#102287alexey-milovidov merged 7 commits intomasterfrom
alexey-milovidov merged 7 commits intomasterfrom
Conversation
The upgrade check environment has no Kafka broker, so `StorageKafka` tables left behind by stress tests produce spurious librdkafka connection errors (`[rdk:FAIL]`, `[rdk:ERROR]`). The existing `Connection refused` filter only matches ClickHouse's `Code: 1000` format, not librdkafka's native error format. This was observed as a flaky failure in PR #100701: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100701&sha=ad1433a4eedc1984573d106f2def96a41b52e564&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
Workflow [PR], commit [dc4ebf4] Summary: ✅ AI ReviewSummaryThis PR starts Redpanda in Findings
ClickHouse Rules
Final VerdictStatus: Minimum required actions:
|
1 task
The stress test runs Kafka engine tests (e.g. `03919_kafka_virtual_columns`) but does not start a Kafka-compatible broker. The Redpanda broker is only started in the stateless functional test job. Without a broker: - `StorageKafka` tables are created but cannot connect to `127.0.0.1:9092` - librdkafka spawns internal retry threads - If the test cleanup does not run (killed by time limit), the table survives - On post-stress server restart under TSan, these threads cause enough contention to freeze the entire server for ~18 minutes, making `SELECT 1` time out and the check fail with "Cannot start clickhouse-server" Start Redpanda before the stress test using the same `setup_kafka.sh` script that the stateless functional tests already use. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100230&sha=4b813531c8476538475475c0c3db0925fcb948cd&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29 PR: #100230 Closes #101320 Closes #101322 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, a `setup_kafka.sh` failure only produced a warning and the stress test proceeded without a broker. That regresses into the exact failure mode this mitigation is meant to prevent: `StorageKafka` retry threads surviving into the post-stress restart and freezing the server under sanitizers. Exit the runner on failure so that regressions are visible immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `-e "StorageKafka"` pattern suppressed the entire logger, which would also mask real backward-compatibility issues originating from `StorageKafka` / `StorageKafka2`. Narrow the filter to the specific librdkafka broker-unavailability signatures that appear when no broker is reachable: - `[rdk:FAIL]` / `[rdk:ERROR]` — the `[rdk:<facility>]` prefix comes from the `librdkafka` log callback (see `KafkaConfigLoader.cpp`) and is emitted for broker transport failures and connection errors - `Error during draining` / `Timeout during draining` — emitted from `StorageKafkaUtils::drainConsumer` when the consumer cannot reach the broker during shutdown Other `StorageKafka` messages (`Couldn't start replica`, `Only errors left`, real exceptions, etc.) are no longer silenced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The stress test runs Kafka engine tests (e.g.
03919_kafka_virtual_columns) but does not start a Kafka-compatible broker — Redpanda is only started in the stateless functional test job. Without a broker:StorageKafkatables are created but cannot connect to127.0.0.1:9092SELECT 1time out and the check fail with "Cannot start clickhouse-server"This PR:
stress_runner.shusing the samesetup_kafka.shthatfunctional_tests.pyalready usesStorageKafkaerrors in the upgrade check log analysis (these are expected when the broker is not available during upgrade)CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100230&sha=4b813531c8476538475475c0c3db0925fcb948cd&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29
Closes #101320
Closes #101322
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes