Skip to content

Antalya 26.3: normalizing format check using lower case#1803

Open
zvonand wants to merge 3 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-102825
Open

Antalya 26.3: normalizing format check using lower case#1803
zvonand wants to merge 3 commits into
antalya-26.3from
feature/antalya-26.3/ClickHouse-ClickHouse-pr-102825

Conversation

@zvonand
Copy link
Copy Markdown
Collaborator

@zvonand zvonand commented May 15, 2026

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Cherry-picked from ClickHouse#102825.


Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

It looks like we were not able to use the parquet metadata cache because the parquet format string gets passed many different ways and we were only looking for "Parquet". For example "PARQUET" from tools like Spark, "Parquet" internally in our code base, and "parquet" if you are following the iceberg spec.

In the future, it would be good to normalize this format string in one place, because we are dealing with this same issue in a few different places in the code.

grantholly-clickhouse and others added 2 commits May 15, 2026 21:31
…solution in next commit)

---
Original cherry-pick message follows:

Merge pull request ClickHouse#102825 from grantholly-clickhouse/fix_iceberg_parquet_lru_cache

normalizing format check using lower case

# Conflicts:
#	src/Storages/ObjectStorage/StorageObjectStorageSource.cpp
@zvonand zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels May 15, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Workflow [PR], commit [c8947da]

@svb-alt svb-alt added the backport Backport label May 16, 2026
`TokenAccessStorage::authenticateImpl` was holding its own recursive
mutex while calling `AccessChangesNotifier::sendNotifications`, which
takes the notifier's mutex and then dispatches subscribers -- one of
which (`processRoleChange`, subscribed to `Role` changes from the same
storage) takes the storage's mutex back. The reverse order is taken by a
concurrent `CREATE ROLE`, which first goes through `sendNotifications`
(notifier mutex) and then into `processRoleChange` (storage mutex).
TSan flagged the resulting lock-order inversion and the server aborted
on first authentication, breaking all four `test_token_roles_mapping`
tests in the `Integration tests (amd_tsan, 2/6)` shard.

Switch the `lock_guard` to a `unique_lock` and explicitly `unlock` it
right before draining notifications. The drain still has to happen --
it's why this call was added in the first place -- but it doesn't need
to run under the storage mutex, and dropping the lock breaks the cycle.

Addresses 4 failing test(s) in Integration tests (amd_tsan, 2/6) on
#1803. After this fix the
still-failing set shrank from 4 to 0.

CI report:
https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1803&sha=34750777806dc0e657e750822401efc331b74ef3&name_0=PR&name_1=Integration%20tests%20%28amd_tsan%2C%202%2F6%29

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zvonand
Copy link
Copy Markdown
Collaborator Author

zvonand commented May 17, 2026

RelEasy analyze-fails — DONE

run completed at 2026-05-17T17:33:15Z

  • Head SHA: 3475077780 (feature/antalya-26.3/ClickHouse-ClickHouse-pr-102825)
  • Tests considered: 5 across 2 CI shard(s)
  • Outcomes: 1 done · 0 partial · 1 unrelated · 0 unresolved
  • Commits added by AI: 1 (✅ pushed)
  • Anthropic cost: $3.5285

Per-shard outcomes

⏭️ UNRELATED — Stateless tests (arm_asan, azure, sequential, 1/2)

1 failed test(s) considered — cost $0.3352
full report

AI narration

The test is failing with TOO_SLOW on a trivial INSERT INTO ... SELECT number FROM numbers(500). The estimated query time is reported as ~600+ seconds for 500 rows — a clear environment / runner slowness issue on ARM ASan rather than something this PR introduced. The PR doesn't change backup, tar archive, or trivial-insert code paths in any way that would slow such an INSERT. Combined with the flaky-elsewhere annotation (failing on 4 other tracked PRs), this is unrelated.

No code changes made.

Summary

  • 03760_backup_tar_archive[unrelated] — TOO_SLOW exception on a trivial INSERT INTO t0 SELECT number FROM numbers(500) (estimated ~604s for 500 rows). This is shard slowness on arm_asan, not a regression: the PR's diff doesn't touch numbers, partitioned-insert, backup, or tar-archive code paths in any way that would slow this query. The flaky-elsewhere annotation confirms it (failing on 4 other tracked PRs).

UNRELATED

Summary

  • 03760_backup_tar_archive[unrelated] — TOO_SLOW exception on a trivial INSERT INTO t0 SELECT number FROM numbers(500) (estimated ~604s for 500 rows). This is shard slowness on arm_asan, not a regression: the PR's diff doesn't touch numbers, partitioned-insert, backup, or tar-archive code paths in any way that would slow this query. The flaky-elsewhere annotation confirms it (failing on 4 other tracked PRs).

UNRELATED

✅ DONE — Integration tests (amd_tsan, 2/6)

4 failed test(s) considered — +1 commit(s) — cost $3.1933
full report

AI narration

The test file test_token_roles_mapping is added by this PR (89 new lines, new directory). Despite the flaky-elsewhere annotation, these failures could very well be caused by this PR. Let me investigate.
This test is newly added by this PR and exercises new code (the roles_mapping config in TokenAccessStorage). Despite flaky-elsewhere annotation, this is in scope. Let me fetch CI logs.
Found it! TSan reports a lock-order-inversion (potential deadlock) — server dies on the first request, subsequent tests get "connection refused". This is clearly caused by this PR's diff (new TokenAccessStorage / ExternalAuthenticators code). Let me get the full TSan stack to find the inversion.
Clear deadlock! TSan reports:

  • Thread T5 (CREATE ROLE handler): locks AccessChangesNotifier mutex (M0) at AccessChangesNotifier.cpp:89, then while still holding it, calls handler that locks TokenAccessStorage recursive_mutex (M1) at TokenAccessStorage.cpp:338M0→M1
  • Thread T6 (HTTP auth): locks M1 in authenticateImpl at TokenAccessStorage.cpp:571, then while still holding it calls sendNotifications which locks M0 at AccessChangesNotifier.cpp:89M1→M0

The fix needs to be in TokenAccessStorage::authenticateImpl. Let me read the relevant code.
The root cause is clear: authenticateImpl holds TokenAccessStorage::mutex (M1) at line 571 then calls sendNotifications at line 759 which acquires AccessChangesNotifier::mutex (M0). The reverse order happens on CREATE ROLE. Fix: release mutex before calling sendNotifications.

Let me verify the mutex type and look at the header.
Confirmed — sendNotifications() was added by this PR (pre-PR file at antalya-26.3 has no sendNotifications call). Now let me apply the fix.
Now build:
Build succeeded (incremental). Now run the failing tests in batch.
Good, the cd persisted. Now run praktika:
All 4 tests pass now. Let me commit and finalize.
Clean up the scratch tmp directory I created (only contains downloaded logs).
Working tree is clean. Final summary:


Summary — Integration tests (amd_tsan, 2/6) on PR #1803

Root cause: this PR's cherry-pick of ClickHouse#102825 added a call to AccessChangesNotifier::sendNotifications at the end of TokenAccessStorage::authenticateImpl while still holding the storage's recursive mutex. The notifier's mutex is taken first inside sendNotifications and then processRoleChange (a subscriber of this same storage) takes the storage mutex back — the reverse order is taken by concurrent CREATE ROLE. TSan reported lock-order-inversion on the first request and the server aborted, which is why the first test saw RemoteDisconnected and the next three saw Connection refused.

Fix: switched std::lock_guard to std::unique_lock in authenticateImpl and explicitly lock.unlock() right before the sendNotifications call (src/Access/TokenAccessStorage.cpp:571,766). Drain still happens, just outside the storage mutex.

Test results after fix (all 4 in one re-run via python3 -m ci.praktika run "integration" --test ...):

  • [fixed] test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role
  • [fixed] test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles
  • [fixed] test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter
  • [fixed] test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles

Note on the flaky-elsewhere annotation: the failures are real and caused by this PR's diff. The 5 other tracked PRs likely carry the same cherry-pick and trip the same TSan inversion — this fix should help them too. The test_token_roles_mapping test file itself is new in this PR.

Built once, ran once. 1/6 build attempts used.

DONE
Working tree is clean. Final summary:


Summary — Integration tests (amd_tsan, 2/6) on PR #1803

Root cause: this PR's cherry-pick of ClickHouse#102825 added a call to AccessChangesNotifier::sendNotifications at the end of TokenAccessStorage::authenticateImpl while still holding the storage's recursive mutex. The notifier's mutex is taken first inside sendNotifications and then processRoleChange (a subscriber of this same storage) takes the storage mutex back — the reverse order is taken by concurrent CREATE ROLE. TSan reported lock-order-inversion on the first request and the server aborted, which is why the first test saw RemoteDisconnected and the next three saw Connection refused.

Fix: switched std::lock_guard to std::unique_lock in authenticateImpl and explicitly lock.unlock() right before the sendNotifications call (src/Access/TokenAccessStorage.cpp:571,766). Drain still happens, just outside the storage mutex.

Test results after fix (all 4 in one re-run via python3 -m ci.praktika run "integration" --test ...):

  • [fixed] test_token_roles_mapping/test.py::test_mapped_guid_grants_mapped_role
  • [fixed] test_token_roles_mapping/test.py::test_multiple_guids_grant_multiple_roles
  • [fixed] test_token_roles_mapping/test.py::test_unmapped_guid_is_dropped_by_filter
  • [fixed] test_token_roles_mapping/test.py::test_only_unmapped_guids_yield_no_roles

Note on the flaky-elsewhere annotation: the failures are real and caused by this PR's diff. The 5 other tracked PRs likely carry the same cherry-pick and trip the same TSan inversion — this fix should help them too. The test_token_roles_mapping test file itself is new in this PR.

Built once, ran once. 1/6 build attempts used.

DONE


🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-resolved Port conflict auto-resolved by Claude antalya-26.3 backport Backport releasy Created/managed by RelEasy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants