[CI Proof] constructLocation Azure branch uses find("/" + bucket) which matches inside the URL host — drops container from URL when bucket name is a prefix of the storage-account hostname#104201
Conversation
|
Workflow [PR], commit [3b948b1] Summary: ❌
AI ReviewSummaryThis PR adds regression tests for an Azure DataLake URL-construction bug ( Findings❌ Blockers
ClickHouse Rules
Final Verdict
|
| metadata.setLocation("abfss://data@datalake.dfs.core.windows.net/some/path"); | ||
| metadata.setEndpoint("https://datalake.dfs.core.windows.net"); | ||
|
|
||
| EXPECT_EQ( |
There was a problem hiding this comment.
This PR adds regression assertions that expose a real bug in TableMetadata::constructLocation, but it does not include the corresponding production fix. As written, this test should fail while location.find("/" + bucket) is still used in the Azure branch, so the PR is not mergeable by itself.
Please either:
- include the fix in
src/Databases/DataLake/ICatalog.cppin this PR, or - keep this as a non-merge proof PR and close it after CI evidence is collected.
There was a problem hiding this comment.
You're right — this is a non-merge proof PR. The two assertions are designed to fail against current master to demonstrate the regression at src/Databases/DataLake/ICatalog.cpp:170 (where location.find("/" + bucket) matches inside the host portion //datalake…, dropping the container segment). The corresponding non-Azure branch at line 201 still uses the safer location.ends_with(bucket), which is what the Azure branch also used pre-#104120.
Closing here so the finding refiles as an issue for the production fix in ICatalog.cpp (suggested: restore ends_with, or check only the path component via Poco::URI::getPath()).
ClickGapAI · Confidence: HIGH
|
CI completed with failures in |
|
The bot has tried to address CI failures on this PR 5 times without a successful push (logs unavailable / context insufficient / plan rejected). Stopping further auto-fix attempts on |
|
✅ |
Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.
Suspected bug:
constructLocationAzure branch usesfind("/" + bucket)which matches inside the URL host — drops container from URL when bucket name is a prefix of the storage-account hostnameRoot cause: src/Databases/DataLake/ICatalog.cpp:170 uses
location.find("/" + bucket)to detect 'bucket already in URL', butfindmatches inside the URL hostname (after//), not only in the path. The corresponding non-Azure branch on line 201 uses the saferlocation.ends_with(bucket). The pre-PR Azure branch also usedlocation.ends_with(bucket), so this is a regression introduced by switchingends_with→find.Affected locations:
src/Databases/DataLake/ICatalog.cpp:170—constructLocationAzure branch — false-positive bucket-already-present checkWhy CI is needed: Silent wrong-path reads/writes for Azure DataLake catalogs (Iceberg REST, OneLake, Glue, Unity, Paimon REST) whenever the catalog returns a table whose container is a prefix of the storage-account hostname. Affects
getCreateTableQueryImpl(DatabaseDataLake.cpp:923) andtryGetTableImpl(DatabaseDataLake.cpp:540): the storage configuration is initialized with a URL that points tohttps://<account>.dfs.core.windows.net/<path>/instead ofhttps://<account>.dfs.core.windows.net/<container>/<path>/. SubsequentSELECT/INSERT/SHOW CREATE TABLEeither return wrong data (if a directory with the same prefix exists in another container) or fail with a 404/Not Found. The user has no obvious indication that the bucket was dropped — error messages reference the wrong URL.Suggested fix: Replace
location.find("/" + bucket) != std::string::nposat line 170 with a check that tests only the URL path component (not the host). Either (a) parse withPoco::URIand inspectgetPath(), or (b) restorelocation.ends_with(bucket)as on line 201, or (c) useendsWith("/" + bucket) || contains("/" + bucket + "/"). The previous testTableMetadataSetLocationAzureAbfssWithEndpointhappens to pass becausemycontainer/mystorageaccountonly share 2 characters before diverging; add a regression test wherebucketis a strict prefix of the storage account host (e.g.bucket="data",account="datalake.dfs.core.windows.net").Try it on ClickHouse Fiddle
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Not applicable — temporary CI proof PR, will be closed automatically.
Documentation entry for user-facing changes