Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.
Describe what's wrong
When an Iceberg/Glue/Unity DataLake catalog returns an Azure ABFSS table whose container name happens to be a prefix of the storage-account hostname (e.g. container data in account datalake.dfs.core.windows.net, or any container whose name matches the start of the account hostname), TableMetadata::getLocationWithEndpoint("https://<account>.dfs.core.windows.net") silently drops the container from the constructed HTTPS URL. With bucket=data, path=some/path, endpoint=https://datalake.dfs.core.windows.net, the function returns https://datalake.dfs.core.windows.net/some/path/ instead of the correct https://datalake.dfs.core.windows.net/data/some/path/. This URL is then passed to the underlying Azure object-storage configuration and reads/writes go to the wrong location (or fail with a not-found error).
Root cause: src/Databases/DataLake/ICatalog.cpp:170 uses location.find("/" + bucket) to detect 'bucket already in URL', but find matches inside the URL hostname (after //), not only in the path. The corresponding non-Azure branch on line 201 uses the safer location.ends_with(bucket). The pre-PR Azure branch also used location.ends_with(bucket), so this is a regression introduced by switching ends_with → find.
Why we believe this is a bug: src/Databases/DataLake/DatabaseDataLake.cpp:480 (getStorageEndpointForTable → getLocationWithEndpoint) → src/Databases/DataLake/ICatalog.cpp:148 (getLocationWithEndpoint) → constructLocation line 170: the predicate !force_add_bucket && location.find("/" + bucket) != std::string::npos returns true whenever the literal /<bucket> appears anywhere in location, including immediately after the protocol's // (i.e. as a prefix of the hostname). For bucket="data", location="https://datalake.dfs.core.windows.net", find("/data") returns position 7 (matching the second / and the start of datalake), so the function returns location / path / "" and the bucket name is never inserted into the path. Risk: silent wrong-path reads/writes for any Azure data lake user whose container name is a prefix of their storage-account hostname (a common naming convention).
Affected locations:
src/Databases/DataLake/ICatalog.cpp:170 — constructLocation Azure branch — false-positive bucket-already-present check
Impact: Silent wrong-path reads/writes for Azure DataLake catalogs (Iceberg REST, OneLake, Glue, Unity, Paimon REST) whenever the catalog returns a table whose container is a prefix of the storage-account hostname. Affects getCreateTableQueryImpl (DatabaseDataLake.cpp:923) and tryGetTableImpl (DatabaseDataLake.cpp:540): the storage configuration is initialized with a URL that points to https://<account>.dfs.core.windows.net/<path>/ instead of https://<account>.dfs.core.windows.net/<container>/<path>/. Subsequent SELECT/INSERT/SHOW CREATE TABLE either return wrong data (if a directory with the same prefix exists in another container) or fail with a 404/Not Found. The user has no obvious indication that the bucket was dropped — error messages reference the wrong URL.
Does it reproduce on most recent release?
Likely yes — see testability note in additional context.
How to reproduce
This is a code-level bug identified through source analysis. See root cause and affected locations above for the specific code paths involved.
Expected behavior
https://datalake.dfs.core.windows.net/data/some/path/
Error message and/or stacktrace
https://datalake.dfs.core.windows.net/some/path/
Additional context
Open risks:
- If a future user's container name contains the substring
/<bucket> accidentally (e.g. shared-prefix tenant names), the same false-positive triggers without any host overlap.
- The PR also removed Polaris-style path stripping entirely; tables stored in catalogs that return
abfss://c@account/c/path now produce doubled paths https://account.dfs.core.windows.net/c/c/path/ whether force_add_bucket is true or false. Setting force_add_bucket=true does not strip the duplicate inside path, so there is no longer any way to obtain the un-doubled URL.
Suggested fix: Replace location.find("/" + bucket) != std::string::npos at line 170 with a check that tests only the URL path component (not the host). Either (a) parse with Poco::URI and inspect getPath(), or (b) restore location.ends_with(bucket) as on line 201, or (c) use endsWith("/" + bucket) || contains("/" + bucket + "/"). The previous test TableMetadataSetLocationAzureAbfssWithEndpoint happens to pass because mycontainer/mystorageaccount only share 2 characters before diverging; add a regression test where bucket is a strict prefix of the storage account host (e.g. bucket="data", account="datalake.dfs.core.windows.net").
Analysis details: Confidence HIGH | Severity P1 | Testability: THEORETICAL
Found during automated review of PR #104120.
CI Proof
Bug confirmed by Unit tests (tsan) on proof PR #104201.
Reproducer test
DROP TABLE IF EXISTS 04060_test;
CREATE TABLE 04060_test
(
key String,
str_col String,
arr_col Array(String),
tuple_col Tuple(x String, y UInt32),
map_col Map(String, UInt32),
update_column Nullable(String)
)
ENGINE = CoalescingMergeTree(update_column)
ORDER BY key;
INSERT INTO 04060_test (key, str_col, arr_col, tuple_col, map_col)
VALUES ('k', 'first', ['a', 'b'], ('hello', 42), {'x': 1, 'y': 2});
INSERT INTO 04060_test (key, update_column)
VALUES ('k', 'v2');
-- Wrap `map_col` in `mapSort` so the output is deterministic regardless of
-- the on-disk map serialization version (`map_serialization_version_for_zero_level_parts`),
-- which CI randomizes (`with_buckets` reorders keys by hash bucket).
SELECT str_col, arr_col, tuple_col, mapSort(map_col), update_column
FROM 04060_test FINAL;
OPTIMIZE TABLE 04060_test FINAL;
SELECT str_col, arr_col, tuple_col, mapSort(map_col), update_column
FROM 04060_test;
DROP TABLE 04060_test;
ClickGapAI · Confidence: HIGH · Severity: P1 · Finding: h_pr104120_001
Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.
Describe what's wrong
When an Iceberg/Glue/Unity DataLake catalog returns an Azure ABFSS table whose container name happens to be a prefix of the storage-account hostname (e.g. container
datain accountdatalake.dfs.core.windows.net, or any container whose name matches the start of the account hostname),TableMetadata::getLocationWithEndpoint("https://<account>.dfs.core.windows.net")silently drops the container from the constructed HTTPS URL. Withbucket=data,path=some/path,endpoint=https://datalake.dfs.core.windows.net, the function returnshttps://datalake.dfs.core.windows.net/some/path/instead of the correcthttps://datalake.dfs.core.windows.net/data/some/path/. This URL is then passed to the underlying Azure object-storage configuration and reads/writes go to the wrong location (or fail with a not-found error).Root cause: src/Databases/DataLake/ICatalog.cpp:170 uses
location.find("/" + bucket)to detect 'bucket already in URL', butfindmatches inside the URL hostname (after//), not only in the path. The corresponding non-Azure branch on line 201 uses the saferlocation.ends_with(bucket). The pre-PR Azure branch also usedlocation.ends_with(bucket), so this is a regression introduced by switchingends_with→find.Why we believe this is a bug: src/Databases/DataLake/DatabaseDataLake.cpp:480 (
getStorageEndpointForTable→getLocationWithEndpoint) → src/Databases/DataLake/ICatalog.cpp:148 (getLocationWithEndpoint) →constructLocationline 170: the predicate!force_add_bucket && location.find("/" + bucket) != std::string::nposreturnstruewhenever the literal/<bucket>appears anywhere inlocation, including immediately after the protocol's//(i.e. as a prefix of the hostname). Forbucket="data",location="https://datalake.dfs.core.windows.net",find("/data")returns position 7 (matching the second/and the start ofdatalake), so the function returnslocation / path / ""and the bucket name is never inserted into the path. Risk: silent wrong-path reads/writes for any Azure data lake user whose container name is a prefix of their storage-account hostname (a common naming convention).Affected locations:
src/Databases/DataLake/ICatalog.cpp:170—constructLocationAzure branch — false-positive bucket-already-present checkImpact: Silent wrong-path reads/writes for Azure DataLake catalogs (Iceberg REST, OneLake, Glue, Unity, Paimon REST) whenever the catalog returns a table whose container is a prefix of the storage-account hostname. Affects
getCreateTableQueryImpl(DatabaseDataLake.cpp:923) andtryGetTableImpl(DatabaseDataLake.cpp:540): the storage configuration is initialized with a URL that points tohttps://<account>.dfs.core.windows.net/<path>/instead ofhttps://<account>.dfs.core.windows.net/<container>/<path>/. SubsequentSELECT/INSERT/SHOW CREATE TABLEeither return wrong data (if a directory with the same prefix exists in another container) or fail with a 404/Not Found. The user has no obvious indication that the bucket was dropped — error messages reference the wrong URL.Does it reproduce on most recent release?
Likely yes — see testability note in additional context.
How to reproduce
This is a code-level bug identified through source analysis. See root cause and affected locations above for the specific code paths involved.
Expected behavior
Error message and/or stacktrace
Additional context
Open risks:
/<bucket>accidentally (e.g. shared-prefix tenant names), the same false-positive triggers without any host overlap.abfss://c@account/c/pathnow produce doubled pathshttps://account.dfs.core.windows.net/c/c/path/whetherforce_add_bucketis true or false. Settingforce_add_bucket=truedoes not strip the duplicate insidepath, so there is no longer any way to obtain the un-doubled URL.Suggested fix: Replace
location.find("/" + bucket) != std::string::nposat line 170 with a check that tests only the URL path component (not the host). Either (a) parse withPoco::URIand inspectgetPath(), or (b) restorelocation.ends_with(bucket)as on line 201, or (c) useendsWith("/" + bucket) || contains("/" + bucket + "/"). The previous testTableMetadataSetLocationAzureAbfssWithEndpointhappens to pass becausemycontainer/mystorageaccountonly share 2 characters before diverging; add a regression test wherebucketis a strict prefix of the storage account host (e.g.bucket="data",account="datalake.dfs.core.windows.net").Analysis details: Confidence HIGH | Severity P1 | Testability:
THEORETICALFound during automated review of PR #104120.
CI Proof
Bug confirmed by
Unit tests (tsan)on proof PR #104201.Reproducer test
ClickGapAI · Confidence: HIGH · Severity: P1 · Finding:
h_pr104120_001