Do not fail to start server on transient Azure errors during disk initialization by Algunenano · Pull Request #100701 · ClickHouse/ClickHouse

Algunenano · 2026-03-25T12:41:58Z

When container_already_exists is not set in config, getContainerClient calls containerExists which does an HTTP GetProperties request to Azure. If the endpoint is unreachable (DNS failure, connection refused, etc.), our HTTP client wraps the error as InternalServerError (500). containerExists only handled NotFound (404) and rethrew everything else, making the exception propagate through DiskSelector::initialize — preventing the server from starting.

The fix: treat InternalServerError in containerExists as "assume the container exists". If it doesn't actually exist, subsequent I/O operations will fail with a clear error rather than preventing the server from starting.

Unlike S3 which never makes network calls during object storage construction, Azure is the only backend that eagerly connects during init via getContainerClient → containerExists → GetProperties.

Failure report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100448&sha=ff1348d3d36895a9a96e88da282779203cbc5f57&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29
#100448

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Server no longer fails to start when an Azure blob storage disk is configured but the endpoint is temporarily unreachable (e.g. DNS failure).

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

When `container_already_exists` is not set in config, `getContainerClient` calls `containerExists` which does an HTTP `GetProperties` request to Azure. If the endpoint is unreachable (DNS failure, connection refused, etc.), our HTTP client wraps the error as `InternalServerError` (500). `containerExists` only handled `NotFound` (404) and rethrew everything else, making the exception propagate through `DiskSelector::initialize` — killing the server. The fix: treat `InternalServerError` in `containerExists` as "assume the container exists". If it doesn't actually exist, subsequent I/O operations will fail with a clear error rather than preventing the server from starting. Unlike S3 which never makes network calls during object storage construction, Azure is the only backend that eagerly connects during init via `getContainerClient` → `containerExists` → `GetProperties`. ClickHouse#100448 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-25T12:42:37Z

Workflow [PR], commit [ad1433a]

Summary: ❌

job_name	test_name	status	info
Integration tests (amd_binary, 2/5)		failure
	test_storage_kafka/test_batch_slow_0.py::test_kafka_formats_with_broken_message[generate_old_create_table_query]	FAIL	cidb
Upgrade check (amd_release)		failure
	Error message in clickhouse-server.log (see upgrade_error_messages.txt)	FAIL	cidb
AST fuzzer (arm_asan_ubsan)		failure
	UndefinedBehaviorSanitizer: undefined behavior (STID: 2527-3af3)	FAIL	cidb, issue
Fast test (arm_darwin)		error

AI Review

Summary

This PR changes Azure container existence probing during object-storage disk initialization: when GetProperties raises InternalServerError, startup now continues by assuming the container exists, and a new integration test verifies the server can restart with an unreachable Azure endpoint. I did not find a high-confidence correctness/safety regression in the patched paths.

Missing context

⚠️ No CI run results/logs were provided in this review context, so I could not validate runtime behavior beyond static analysis.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅
No large/binary files	✅

Final Verdict

Status: ✅ Approve

…h-v2

clickhouse-gh · 2026-04-09T15:26:21Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.00%	84.10%	+0.10%
Functions	90.90%	90.90%	+0.00%
Branches	76.50%	76.60%	+0.10%

Changed lines: 100.00% (13/13) · Uncovered code

Full report · Diff report

The upgrade check environment has no Kafka broker, so `StorageKafka` tables left behind by stress tests produce spurious librdkafka connection errors (`[rdk:FAIL]`, `[rdk:ERROR]`). The existing `Connection refused` filter only matches ClickHouse's `Code: 1000` format, not librdkafka's native error format. This was observed as a flaky failure in PR #100701: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100701&sha=ad1433a4eedc1984573d106f2def96a41b52e564&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-04-09T19:48:19Z

The upgrade check failure was caused by StorageKafka errors from stress test leftovers (no Kafka broker in the upgrade environment). Fixed in #102287.

alexey-milovidov · 2026-04-09T19:50:18Z

UBSan error fixed here: #100086

alexey-milovidov · 2026-04-09T19:50:27Z

Kafka is also fixed.

clickgapai · 2026-04-10T01:00:55Z

Hi — this PR may need backporting to 26.3 (LTS), 26.2, 26.1, but no backport label was found.

Affected code: containerExists in AzureBlobStorageCommon.cpp — core code present in all supported branches.

Why: The containerExists function and the PocoHTTPClient wrapping transport-level failures as InternalServerError have existed since before all supported release branches. This is a server-startup-blocking bug affecting any user with Azure Blob Storage disks configured when the Azure endpoint is transiently unreachable.

Other supported branches (25.8) are not affected — the changed code does not exist there.

If this should be backported, consider adding pr-must-backport or a version-specific label (e.g. v26.3-must-backport). Ignore this if backporting is not applicable.

Algunenano · 2026-04-10T08:48:12Z

@clickgapai No, it's fine. We only backport critical bugfixes and this isn't.

clickgapai · 2026-04-10T08:58:01Z

@Algunenano Understood, thanks for the clarification — I'll keep that bar in mind for future backport suggestions.

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 25, 2026

Algunenano changed the title ~~Do not crash server on transient Azure errors during disk initialization~~ Do not fail to start server on transient Azure errors during disk initialization Mar 25, 2026

Algunenano added 2 commits April 7, 2026 09:57

Merge remote-tracking branch 'blessed/master' into fix-azure-dns-cras…

7636e7d

…h-v2

Update tests

4429bcc

Algunenano mentioned this pull request Apr 9, 2026

Add DNS_ERROR to upgrade check ignored errors #102190

Closed

1 task

Merge branch 'master' into fix-azure-dns-crash-v2

ad1433a

alexey-milovidov approved these changes Apr 9, 2026

View reviewed changes

alexey-milovidov self-assigned this Apr 9, 2026

alexey-milovidov merged commit 9b31f35 into ClickHouse:master Apr 9, 2026
158 of 164 checks passed

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not fail to start server on transient Azure errors during disk initialization#100701

Do not fail to start server on transient Azure errors during disk initialization#100701
alexey-milovidov merged 4 commits intoClickHouse:masterfrom
Algunenano:fix-azure-dns-crash-v2

Algunenano commented Mar 25, 2026 •

edited by alexey-milovidov

Loading

Uh oh!

clickhouse-gh Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

Uh oh!

clickgapai commented Apr 10, 2026

Uh oh!

Algunenano commented Apr 10, 2026

Uh oh!

clickgapai commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Algunenano commented Mar 25, 2026 • edited by alexey-milovidov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh Bot commented Apr 9, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

alexey-milovidov commented Apr 9, 2026

Uh oh!

Uh oh!

clickgapai commented Apr 10, 2026

Uh oh!

Algunenano commented Apr 10, 2026

Uh oh!

clickgapai commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Algunenano commented Mar 25, 2026 •

edited by alexey-milovidov

Loading

clickhouse-gh Bot commented Mar 25, 2026 •

edited

Loading