Antalya 26.3: normalizing format check using lower case by zvonand · Pull Request #1803 · Altinity/ClickHouse

zvonand · 2026-05-15T19:34:42Z

Changelog category (leave one):

Not for changelog (changelog entry is not required)

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

It looks like we were not able to use the parquet metadata cache because the parquet format string gets passed many different ways and we were only looking for "Parquet". For example "PARQUET" from tools like Spark, "Parquet" internally in our code base, and "parquet" if you are following the iceberg spec.

In the future, it would be good to normalize this format string in one place, because we are dealing with this same issue in a few different places in the code.

…solution in next commit) --- Original cherry-pick message follows: Merge pull request ClickHouse#102825 from grantholly-clickhouse/fix_iceberg_parquet_lru_cache normalizing format check using lower case # Conflicts: # src/Storages/ObjectStorage/StorageObjectStorageSource.cpp

github-actions · 2026-05-15T19:36:16Z

Workflow [PR], commit [c8947da]

`TokenAccessStorage::authenticateImpl` was holding its own recursive mutex while calling `AccessChangesNotifier::sendNotifications`, which takes the notifier's mutex and then dispatches subscribers -- one of which (`processRoleChange`, subscribed to `Role` changes from the same storage) takes the storage's mutex back. The reverse order is taken by a concurrent `CREATE ROLE`, which first goes through `sendNotifications` (notifier mutex) and then into `processRoleChange` (storage mutex). TSan flagged the resulting lock-order inversion and the server aborted on first authentication, breaking all four `test_token_roles_mapping` tests in the `Integration tests (amd_tsan, 2/6)` shard. Switch the `lock_guard` to a `unique_lock` and explicitly `unlock` it right before draining notifications. The drain still has to happen -- it's why this call was added in the first place -- but it doesn't need to run under the storage mutex, and dropping the lock breaks the cycle. Addresses 4 failing test(s) in Integration tests (amd_tsan, 2/6) on #1803. After this fix the still-failing set shrank from 4 to 0. CI report: https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1803&sha=34750777806dc0e657e750822401efc331b74ef3&name_0=PR&name_1=Integration%20tests%20%28amd_tsan%2C%202%2F6%29 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zvonand · 2026-05-17T17:33:16Z

RelEasy `analyze-fails` — DONE

run completed at 2026-05-17T17:33:15Z

Head SHA: 3475077780 (feature/antalya-26.3/ClickHouse-ClickHouse-pr-102825)
Tests considered: 5 across 2 CI shard(s)
Outcomes: 1 done · 0 partial · 1 unrelated · 0 unresolved
Commits added by AI: 1 (✅ pushed)
Anthropic cost: $3.5285

Per-shard outcomes

⏭️ UNRELATED — `Stateless tests (arm_asan, azure, sequential, 1/2)`

1 failed test(s) considered — cost $0.3352
full report

AI narration

The test is failing with TOO_SLOW on a trivial INSERT INTO ... SELECT number FROM numbers(500). The estimated query time is reported as ~600+ seconds for 500 rows — a clear environment / runner slowness issue on ARM ASan rather than something this PR introduced. The PR doesn't change backup, tar archive, or trivial-insert code paths in any way that would slow such an INSERT. Combined with the flaky-elsewhere annotation (failing on 4 other tracked PRs), this is unrelated.

No code changes made.

Summary

03760_backup_tar_archive — [unrelated] — TOO_SLOW exception on a trivial INSERT INTO t0 SELECT number FROM numbers(500) (estimated ~604s for 500 rows). This is shard slowness on arm_asan, not a regression: the PR's diff doesn't touch numbers, partitioned-insert, backup, or tar-archive code paths in any way that would slow this query. The flaky-elsewhere annotation confirms it (failing on 4 other tracked PRs).

UNRELATED

Summary

03760_backup_tar_archive — [unrelated] — TOO_SLOW exception on a trivial INSERT INTO t0 SELECT number FROM numbers(500) (estimated ~604s for 500 rows). This is shard slowness on arm_asan, not a regression: the PR's diff doesn't touch numbers, partitioned-insert, backup, or tar-archive code paths in any way that would slow this query. The flaky-elsewhere annotation confirms it (failing on 4 other tracked PRs).

UNRELATED

✅ DONE — `Integration tests (amd_tsan, 2/6)`

4 failed test(s) considered — +1 commit(s) — cost $3.1933
full report

AI narration

The test file test_token_roles_mapping is added by this PR (89 new lines, new directory). Despite the flaky-elsewhere annotation, these failures could very well be caused by this PR. Let me investigate.
This test is newly added by this PR and exercises new code (the roles_mapping config in TokenAccessStorage). Despite flaky-elsewhere annotation, this is in scope. Let me fetch CI logs.
Found it! TSan reports a lock-order-inversion (potential deadlock) — server dies on the first request, subsequent tests get "connection refused". This is clearly caused by this PR's diff (new TokenAccessStorage / ExternalAuthenticators code). Let me get the full TSan stack to find the inversion.
Clear deadlock! TSan reports:

Thread T5 (CREATE ROLE handler): locks AccessChangesNotifier mutex (M0) at AccessChangesNotifier.cpp:89, then while still holding it, calls handler that locks TokenAccessStorage recursive_mutex (M1) at TokenAccessStorage.cpp:338 — M0→M1
Thread T6 (HTTP auth): locks M1 in authenticateImpl at TokenAccessStorage.cpp:571, then while still holding it calls sendNotifications which locks M0 at AccessChangesNotifier.cpp:89 — M1→M0

The fix needs to be in TokenAccessStorage::authenticateImpl. Let me read the relevant code.
The root cause is clear: authenticateImpl holds TokenAccessStorage::mutex (M1) at line 571 then calls sendNotifications at line 759 which acquires AccessChangesNotifier::mutex (M0). The reverse order happens on CREATE ROLE. Fix: release mutex before calling sendNotifications.

Let me verify the mutex type and look at the header.
Confirmed — sendNotifications() was added by this PR (pre-PR file at antalya-26.3 has no sendNotifications call). Now let me apply the fix.
Now build:
Build succeeded (incremental). Now run the failing tests in batch.
Good, the cd persisted. Now run praktika:
All 4 tests pass now. Let me commit and finalize.
Clean up the scratch tmp directory I created (only contains downloaded logs).
Working tree is clean. Final summary:

Summary — Integration tests (amd_tsan, 2/6) on PR #1803

Root cause: this PR's cherry-pick of ClickHouse#102825 added a call to AccessChangesNotifier::sendNotifications at the end of TokenAccessStorage::authenticateImpl while still holding the storage's recursive mutex. The notifier's mutex is taken first inside sendNotifications and then processRoleChange (a subscriber of this same storage) takes the storage mutex back — the reverse order is taken by concurrent CREATE ROLE. TSan reported lock-order-inversion on the first request and the server aborted, which is why the first test saw RemoteDisconnected and the next three saw Connection refused.

Fix: switched std::lock_guard to std::unique_lock in authenticateImpl and explicitly lock.unlock() right before the sendNotifications call (src/Access/TokenAccessStorage.cpp:571,766). Drain still happens, just outside the storage mutex.

Test results after fix (all 4 in one re-run via python3 -m ci.praktika run "integration" --test ...):

[fixed] test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role
[fixed] test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles
[fixed] test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter
[fixed] test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles

Note on the flaky-elsewhere annotation: the failures are real and caused by this PR's diff. The 5 other tracked PRs likely carry the same cherry-pick and trip the same TSan inversion — this fix should help them too. The test_token_roles_mapping test file itself is new in this PR.

Built once, ran once. 1/6 build attempts used.

DONE
Working tree is clean. Final summary:

Summary — Integration tests (amd_tsan, 2/6) on PR #1803

Root cause: this PR's cherry-pick of ClickHouse#102825 added a call to AccessChangesNotifier::sendNotifications at the end of TokenAccessStorage::authenticateImpl while still holding the storage's recursive mutex. The notifier's mutex is taken first inside sendNotifications and then processRoleChange (a subscriber of this same storage) takes the storage mutex back — the reverse order is taken by concurrent CREATE ROLE. TSan reported lock-order-inversion on the first request and the server aborted, which is why the first test saw RemoteDisconnected and the next three saw Connection refused.

Fix: switched std::lock_guard to std::unique_lock in authenticateImpl and explicitly lock.unlock() right before the sendNotifications call (src/Access/TokenAccessStorage.cpp:571,766). Drain still happens, just outside the storage mutex.

Test results after fix (all 4 in one re-run via python3 -m ci.praktika run "integration" --test ...):

[fixed] test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role
[fixed] test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles
[fixed] test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter
[fixed] test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles

Note on the flaky-elsewhere annotation: the failures are real and caused by this PR's diff. The 5 other tracked PRs likely carry the same cherry-pick and trip the same TSan inversion — this fix should help them too. The test_token_roles_mapping test file itself is new in this PR.

Built once, ran once. 1/6 build attempts used.

DONE

🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

grantholly-clickhouse and others added 2 commits May 15, 2026 21:31

Resolve conflicts in cherry-pick of ClickHouse#102825

3475077

zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels May 15, 2026

zvonand added the antalya-26.3 label May 15, 2026

svb-alt added the backport Backport label May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antalya 26.3: normalizing format check using lower case#1803

Antalya 26.3: normalizing format check using lower case#1803
zvonand wants to merge 3 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-102825

zvonand commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

zvonand commented May 17, 2026

Summary

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zvonand commented May 15, 2026

Changelog category (leave one):

Documentation entry for user-facing changes

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented May 17, 2026

RelEasy analyze-fails — DONE

Per-shard outcomes

⏭️ UNRELATED — Stateless tests (arm_asan, azure, sequential, 1/2)

Summary

Summary

✅ DONE — Integration tests (amd_tsan, 2/6)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 15, 2026 •

edited

Loading

RelEasy `analyze-fails` — DONE

⏭️ UNRELATED — `Stateless tests (arm_asan, azure, sequential, 1/2)`

✅ DONE — `Integration tests (amd_tsan, 2/6)`