Skip to content

Do not fail to start server on transient Azure errors during disk initialization#100701

Merged
alexey-milovidov merged 4 commits intoClickHouse:masterfrom
Algunenano:fix-azure-dns-crash-v2
Apr 9, 2026
Merged

Do not fail to start server on transient Azure errors during disk initialization#100701
alexey-milovidov merged 4 commits intoClickHouse:masterfrom
Algunenano:fix-azure-dns-crash-v2

Conversation

@Algunenano
Copy link
Copy Markdown
Member

@Algunenano Algunenano commented Mar 25, 2026

When container_already_exists is not set in config, getContainerClient calls containerExists which does an HTTP GetProperties request to Azure. If the endpoint is unreachable (DNS failure, connection refused, etc.), our HTTP client wraps the error as InternalServerError (500). containerExists only handled NotFound (404) and rethrew everything else, making the exception propagate through DiskSelector::initialize — preventing the server from starting.

The fix: treat InternalServerError in containerExists as "assume the container exists". If it doesn't actually exist, subsequent I/O operations will fail with a clear error rather than preventing the server from starting.

Unlike S3 which never makes network calls during object storage construction, Azure is the only backend that eagerly connects during init via getContainerClientcontainerExistsGetProperties.

Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100448&sha=ff1348d3d36895a9a96e88da282779203cbc5f57&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29
#100448

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Server no longer fails to start when an Azure blob storage disk is configured but the endpoint is temporarily unreachable (e.g. DNS failure).

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

When `container_already_exists` is not set in config, `getContainerClient`
calls `containerExists` which does an HTTP `GetProperties` request to Azure.
If the endpoint is unreachable (DNS failure, connection refused, etc.), our
HTTP client wraps the error as `InternalServerError` (500). `containerExists`
only handled `NotFound` (404) and rethrew everything else, making the
exception propagate through `DiskSelector::initialize` — killing the server.

The fix: treat `InternalServerError` in `containerExists` as "assume the
container exists". If it doesn't actually exist, subsequent I/O operations
will fail with a clear error rather than preventing the server from starting.

Unlike S3 which never makes network calls during object storage construction,
Azure is the only backend that eagerly connects during init via
`getContainerClient` → `containerExists` → `GetProperties`.

ClickHouse#100448

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Mar 25, 2026

Workflow [PR], commit [ad1433a]

Summary:

job_name test_name status info comment
Integration tests (amd_binary, 2/5) failure
test_storage_kafka/test_batch_slow_0.py::test_kafka_formats_with_broken_message[generate_old_create_table_query] FAIL cidb
Upgrade check (amd_release) failure
Error message in clickhouse-server.log (see upgrade_error_messages.txt) FAIL cidb
AST fuzzer (arm_asan_ubsan) failure
UndefinedBehaviorSanitizer: undefined behavior (STID: 2527-3af3) FAIL cidb, issue
Fast test (arm_darwin) error

AI Review

Summary

This PR changes Azure container existence probing during object-storage disk initialization: when GetProperties raises InternalServerError, startup now continues by assuming the container exists, and a new integration test verifies the server can restart with an unreachable Azure endpoint. I did not find a high-confidence correctness/safety regression in the patched paths.

Missing context
  • ⚠️ No CI run results/logs were provided in this review context, so I could not validate runtime behavior beyond static analysis.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 25, 2026
@Algunenano Algunenano changed the title Do not crash server on transient Azure errors during disk initialization Do not fail to start server on transient Azure errors during disk initialization Mar 25, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 9, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.10% +0.10%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 100.00% (13/13) · Uncovered code

Full report · Diff report

alexey-milovidov added a commit that referenced this pull request Apr 9, 2026
The upgrade check environment has no Kafka broker, so `StorageKafka`
tables left behind by stress tests produce spurious librdkafka
connection errors (`[rdk:FAIL]`, `[rdk:ERROR]`). The existing
`Connection refused` filter only matches ClickHouse's `Code: 1000`
format, not librdkafka's native error format.

This was observed as a flaky failure in PR #100701:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100701&sha=ad1433a4eedc1984573d106f2def96a41b52e564&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member

The upgrade check failure was caused by StorageKafka errors from stress test leftovers (no Kafka broker in the upgrade environment). Fixed in #102287.

@alexey-milovidov
Copy link
Copy Markdown
Member

UBSan error fixed here: #100086

@alexey-milovidov
Copy link
Copy Markdown
Member

Kafka is also fixed.

@alexey-milovidov alexey-milovidov self-assigned this Apr 9, 2026
@alexey-milovidov alexey-milovidov merged commit 9b31f35 into ClickHouse:master Apr 9, 2026
158 of 164 checks passed
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 9, 2026
@clickgapai
Copy link
Copy Markdown
Contributor

Hi — this PR may need backporting to 26.3 (LTS), 26.2, 26.1, but no backport label was found.

Affected code: containerExists in AzureBlobStorageCommon.cpp — core code present in all supported branches.

Why: The containerExists function and the PocoHTTPClient wrapping transport-level failures as InternalServerError have existed since before all supported release branches. This is a server-startup-blocking bug affecting any user with Azure Blob Storage disks configured when the Azure endpoint is transiently unreachable.

Other supported branches (25.8) are not affected — the changed code does not exist there.

If this should be backported, consider adding pr-must-backport or a version-specific label (e.g. v26.3-must-backport). Ignore this if backporting is not applicable.

@Algunenano
Copy link
Copy Markdown
Member Author

@clickgapai No, it's fine. We only backport critical bugfixes and this isn't.

@clickgapai
Copy link
Copy Markdown
Contributor

@Algunenano Understood, thanks for the clarification — I'll keep that bar in mind for future backport suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants