Do not fail to start server on transient Azure errors during disk initialization#100701
Conversation
When `container_already_exists` is not set in config, `getContainerClient` calls `containerExists` which does an HTTP `GetProperties` request to Azure. If the endpoint is unreachable (DNS failure, connection refused, etc.), our HTTP client wraps the error as `InternalServerError` (500). `containerExists` only handled `NotFound` (404) and rethrew everything else, making the exception propagate through `DiskSelector::initialize` — killing the server. The fix: treat `InternalServerError` in `containerExists` as "assume the container exists". If it doesn't actually exist, subsequent I/O operations will fail with a clear error rather than preventing the server from starting. Unlike S3 which never makes network calls during object storage construction, Azure is the only backend that eagerly connects during init via `getContainerClient` → `containerExists` → `GetProperties`. ClickHouse#100448 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [ad1433a] Summary: ❌
AI ReviewSummaryThis PR changes Azure container existence probing during object-storage disk initialization: when Missing context
ClickHouse Rules
Final Verdict
|
LLVM Coverage Report
Changed lines: 100.00% (13/13) · Uncovered code |
The upgrade check environment has no Kafka broker, so `StorageKafka` tables left behind by stress tests produce spurious librdkafka connection errors (`[rdk:FAIL]`, `[rdk:ERROR]`). The existing `Connection refused` filter only matches ClickHouse's `Code: 1000` format, not librdkafka's native error format. This was observed as a flaky failure in PR #100701: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100701&sha=ad1433a4eedc1984573d106f2def96a41b52e564&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The upgrade check failure was caused by |
|
UBSan error fixed here: #100086 |
|
Kafka is also fixed. |
|
Hi — this PR may need backporting to Affected code: Why: The Other supported branches ( If this should be backported, consider adding |
|
@clickgapai No, it's fine. We only backport critical bugfixes and this isn't. |
|
@Algunenano Understood, thanks for the clarification — I'll keep that bar in mind for future backport suggestions. |
When
container_already_existsis not set in config,getContainerClientcallscontainerExistswhich does an HTTPGetPropertiesrequest to Azure. If the endpoint is unreachable (DNS failure, connection refused, etc.), our HTTP client wraps the error asInternalServerError(500).containerExistsonly handledNotFound(404) and rethrew everything else, making the exception propagate throughDiskSelector::initialize— preventing the server from starting.The fix: treat
InternalServerErrorincontainerExistsas "assume the container exists". If it doesn't actually exist, subsequent I/O operations will fail with a clear error rather than preventing the server from starting.Unlike S3 which never makes network calls during object storage construction, Azure is the only backend that eagerly connects during init via
getContainerClient→containerExists→GetProperties.Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100448&sha=ff1348d3d36895a9a96e88da282779203cbc5f57&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29
#100448
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Server no longer fails to start when an Azure blob storage disk is configured but the endpoint is temporarily unreachable (e.g. DNS failure).
Documentation entry for user-facing changes