Skip to content

Start Redpanda in stress tests, ignore StorageKafka errors in upgrade check#102287

Merged
alexey-milovidov merged 7 commits intomasterfrom
fix-upgrade-check-kafka-errors
Apr 26, 2026
Merged

Start Redpanda in stress tests, ignore StorageKafka errors in upgrade check#102287
alexey-milovidov merged 7 commits intomasterfrom
fix-upgrade-check-kafka-errors

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov commented Apr 9, 2026

The stress test runs Kafka engine tests (e.g. 03919_kafka_virtual_columns) but does not start a Kafka-compatible broker — Redpanda is only started in the stateless functional test job. Without a broker:

  • StorageKafka tables are created but cannot connect to 127.0.0.1:9092
  • librdkafka spawns internal retry threads that spin on reconnection
  • If the test cleanup does not run (killed by time limit), the table survives into the post-stress restart
  • Under TSan, these threads cause enough contention to freeze the entire server for ~18 minutes, making SELECT 1 time out and the check fail with "Cannot start clickhouse-server"

This PR:

  1. Starts Redpanda in stress_runner.sh using the same setup_kafka.sh that functional_tests.py already uses
  2. Ignores StorageKafka errors in the upgrade check log analysis (these are expected when the broker is not available during upgrade)

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100230&sha=4b813531c8476538475475c0c3db0925fcb948cd&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29

Closes #101320
Closes #101322

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The upgrade check environment has no Kafka broker, so `StorageKafka`
tables left behind by stress tests produce spurious librdkafka
connection errors (`[rdk:FAIL]`, `[rdk:ERROR]`). The existing
`Connection refused` filter only matches ClickHouse's `Code: 1000`
format, not librdkafka's native error format.

This was observed as a flaky failure in PR #100701:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100701&sha=ad1433a4eedc1984573d106f2def96a41b52e564&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 9, 2026

Workflow [PR], commit [dc4ebf4]

Summary:


AI Review

Summary

This PR starts Redpanda in stress_runner.sh and extends the upgrade-log allowlist for Kafka-related errors. Starting Redpanda is a good fix for the stress freeze scenario, but the new broad log suppressions in upgrade_runner.sh weaken backward-compatibility signal and should be narrowed before merge.

Findings

⚠️ Majors

  • [tests/docker_scripts/upgrade_runner.sh:339-342] The new filters ([rdk:FAIL], [rdk:ERROR], Error during draining, Timeout during draining) suppress all StorageKafka error lines in upgrade validation, including potentially real compatibility regressions.
    Suggested fix: replace broad substrings with tightly scoped, known-benign patterns (for example, explicit connection-refused/no-broker messages for bootstrap hosts) and keep unexpected StorageKafka errors visible.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility ⚠️ Upgrade validation now hides broad StorageKafka errors, reducing detection of real incompatibilities
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict

Status: ⚠️ Request changes

Minimum required actions:

  1. Narrow the new Kafka-related ignore patterns in upgrade_runner.sh to specific known-benign messages instead of suppressing all StorageKafka error classes.

The stress test runs Kafka engine tests (e.g. `03919_kafka_virtual_columns`)
but does not start a Kafka-compatible broker. The Redpanda broker is only
started in the stateless functional test job. Without a broker:

- `StorageKafka` tables are created but cannot connect to `127.0.0.1:9092`
- librdkafka spawns internal retry threads
- If the test cleanup does not run (killed by time limit), the table survives
- On post-stress server restart under TSan, these threads cause enough
  contention to freeze the entire server for ~18 minutes, making `SELECT 1`
  time out and the check fail with "Cannot start clickhouse-server"

Start Redpanda before the stress test using the same `setup_kafka.sh` script
that the stateless functional tests already use.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100230&sha=4b813531c8476538475475c0c3db0925fcb948cd&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29
PR: #100230
Closes #101320
Closes #101322

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov changed the title Ignore StorageKafka errors in upgrade check Start Redpanda in stress tests, ignore StorageKafka errors in upgrade check Apr 9, 2026
Comment thread tests/docker_scripts/stress_runner.sh Outdated
Comment thread tests/docker_scripts/upgrade_runner.sh Outdated
alexey-milovidov and others added 5 commits April 19, 2026 02:31
Previously, a `setup_kafka.sh` failure only produced a warning and
the stress test proceeded without a broker. That regresses into the
exact failure mode this mitigation is meant to prevent: `StorageKafka`
retry threads surviving into the post-stress restart and freezing
the server under sanitizers.

Exit the runner on failure so that regressions are visible immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `-e "StorageKafka"` pattern suppressed the entire logger,
which would also mask real backward-compatibility issues originating
from `StorageKafka` / `StorageKafka2`.

Narrow the filter to the specific librdkafka broker-unavailability
signatures that appear when no broker is reachable:

- `[rdk:FAIL]` / `[rdk:ERROR]` — the `[rdk:<facility>]` prefix comes
  from the `librdkafka` log callback (see `KafkaConfigLoader.cpp`) and
  is emitted for broker transport failures and connection errors
- `Error during draining` / `Timeout during draining` — emitted from
  `StorageKafkaUtils::drainConsumer` when the consumer cannot reach
  the broker during shutdown

Other `StorageKafka` messages (`Couldn't start replica`, `Only errors
left`, real exceptions, etc.) are no longer silenced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@alexey-milovidov alexey-milovidov self-assigned this Apr 26, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Apr 26, 2026
Merged via the queue into master with commit b063ae8 Apr 26, 2026
164 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-upgrade-check-kafka-errors branch April 26, 2026 11:28
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check failed Cannot start clickhouse-server

2 participants