Skip to content

constructLocation Azure branch uses find("/" + bucket) which matches inside the URL host — drops container from URL when bucket name is a prefix of the storage-account hostname #104249

@clickgapai

Description

@clickgapai

Found via ClickGap automated review. Please close or comment if this is incorrect or needs adjustment.

Describe what's wrong

When an Iceberg/Glue/Unity DataLake catalog returns an Azure ABFSS table whose container name happens to be a prefix of the storage-account hostname (e.g. container data in account datalake.dfs.core.windows.net, or any container whose name matches the start of the account hostname), TableMetadata::getLocationWithEndpoint("https://<account>.dfs.core.windows.net") silently drops the container from the constructed HTTPS URL. With bucket=data, path=some/path, endpoint=https://datalake.dfs.core.windows.net, the function returns https://datalake.dfs.core.windows.net/some/path/ instead of the correct https://datalake.dfs.core.windows.net/data/some/path/. This URL is then passed to the underlying Azure object-storage configuration and reads/writes go to the wrong location (or fail with a not-found error).

Root cause: src/Databases/DataLake/ICatalog.cpp:170 uses location.find("/" + bucket) to detect 'bucket already in URL', but find matches inside the URL hostname (after //), not only in the path. The corresponding non-Azure branch on line 201 uses the safer location.ends_with(bucket). The pre-PR Azure branch also used location.ends_with(bucket), so this is a regression introduced by switching ends_withfind.

Why we believe this is a bug: src/Databases/DataLake/DatabaseDataLake.cpp:480 (getStorageEndpointForTablegetLocationWithEndpoint) → src/Databases/DataLake/ICatalog.cpp:148 (getLocationWithEndpoint) → constructLocation line 170: the predicate !force_add_bucket && location.find("/" + bucket) != std::string::npos returns true whenever the literal /<bucket> appears anywhere in location, including immediately after the protocol's // (i.e. as a prefix of the hostname). For bucket="data", location="https://datalake.dfs.core.windows.net", find("/data") returns position 7 (matching the second / and the start of datalake), so the function returns location / path / "" and the bucket name is never inserted into the path. Risk: silent wrong-path reads/writes for any Azure data lake user whose container name is a prefix of their storage-account hostname (a common naming convention).

Affected locations:

  • src/Databases/DataLake/ICatalog.cpp:170constructLocation Azure branch — false-positive bucket-already-present check

Impact: Silent wrong-path reads/writes for Azure DataLake catalogs (Iceberg REST, OneLake, Glue, Unity, Paimon REST) whenever the catalog returns a table whose container is a prefix of the storage-account hostname. Affects getCreateTableQueryImpl (DatabaseDataLake.cpp:923) and tryGetTableImpl (DatabaseDataLake.cpp:540): the storage configuration is initialized with a URL that points to https://<account>.dfs.core.windows.net/<path>/ instead of https://<account>.dfs.core.windows.net/<container>/<path>/. Subsequent SELECT/INSERT/SHOW CREATE TABLE either return wrong data (if a directory with the same prefix exists in another container) or fail with a 404/Not Found. The user has no obvious indication that the bucket was dropped — error messages reference the wrong URL.

Does it reproduce on most recent release?

Likely yes — see testability note in additional context.

How to reproduce

This is a code-level bug identified through source analysis. See root cause and affected locations above for the specific code paths involved.

Expected behavior

https://datalake.dfs.core.windows.net/data/some/path/

Error message and/or stacktrace

https://datalake.dfs.core.windows.net/some/path/

Additional context

Open risks:

  • If a future user's container name contains the substring /<bucket> accidentally (e.g. shared-prefix tenant names), the same false-positive triggers without any host overlap.
  • The PR also removed Polaris-style path stripping entirely; tables stored in catalogs that return abfss://c@account/c/path now produce doubled paths https://account.dfs.core.windows.net/c/c/path/ whether force_add_bucket is true or false. Setting force_add_bucket=true does not strip the duplicate inside path, so there is no longer any way to obtain the un-doubled URL.

Suggested fix: Replace location.find("/" + bucket) != std::string::npos at line 170 with a check that tests only the URL path component (not the host). Either (a) parse with Poco::URI and inspect getPath(), or (b) restore location.ends_with(bucket) as on line 201, or (c) use endsWith("/" + bucket) || contains("/" + bucket + "/"). The previous test TableMetadataSetLocationAzureAbfssWithEndpoint happens to pass because mycontainer/mystorageaccount only share 2 characters before diverging; add a regression test where bucket is a strict prefix of the storage account host (e.g. bucket="data", account="datalake.dfs.core.windows.net").

Analysis details: Confidence HIGH | Severity P1 | Testability: THEORETICAL

Found during automated review of PR #104120.

CI Proof

Bug confirmed by Unit tests (tsan) on proof PR #104201.

Reproducer test
DROP TABLE IF EXISTS 04060_test;

CREATE TABLE 04060_test
(
    key            String,
    str_col        String,
    arr_col        Array(String),
    tuple_col      Tuple(x String, y UInt32),
    map_col        Map(String, UInt32),
    update_column  Nullable(String)
)
ENGINE = CoalescingMergeTree(update_column)
ORDER BY key;

INSERT INTO 04060_test (key, str_col, arr_col, tuple_col, map_col)
VALUES ('k', 'first', ['a', 'b'], ('hello', 42), {'x': 1, 'y': 2});

INSERT INTO 04060_test (key, update_column)
VALUES ('k', 'v2');

-- Wrap `map_col` in `mapSort` so the output is deterministic regardless of
-- the on-disk map serialization version (`map_serialization_version_for_zero_level_parts`),
-- which CI randomizes (`with_buckets` reorders keys by hash bucket).
SELECT str_col, arr_col, tuple_col, mapSort(map_col), update_column
FROM 04060_test FINAL;

OPTIMIZE TABLE 04060_test FINAL;

SELECT str_col, arr_col, tuple_col, mapSort(map_col), update_column
FROM 04060_test;

DROP TABLE 04060_test;

ClickGapAI · Confidence: HIGH · Severity: P1 · Finding: h_pr104120_001

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions