Antalya 26.3: apassos-1: combined port of 5 PRs#1685
Antalya 26.3: apassos-1: combined port of 5 PRs#1685zvonand wants to merge 11 commits intoantalya-26.3from
Conversation
Three new background tasks introduced by the `enable_experimental_export_merge_tree_partition_feature` forwardport call ZooKeeper without entering a component scope. With `enforce_keeper_component_tracking = true` (set in fast-test config via `zookeeper_enforce_component_name.yaml`), this triggers a logical error in `Coordination::ZooKeeper::pushRequest` the moment any `ReplicatedMergeTree` table activates the tasks on startup, aborting the server. The 247 failing fast-test replicated-table tests are all downstream effects of this abort (they surface as KEEPER_EXCEPTION / TABLE_IS_READ_ONLY). Wrap the entry of each background task method in `Coordination::setCurrentComponent`, matching the convention used by other replicated background work (e.g. `ReplicatedMergeTreeRestartingThread`, `ReplicatedMergeTreeCleanupThread`). Addresses 247 failing tests in the Fast test shard on #1685. After this fix the still-failing set shrank from 247 -> 0 (locally: 245 OK, 2 SKIPPED, 0 FAILED across the same input list).
RelEasy
|
…etion callback The previous fix wrapped the three periodic background tasks (`ExportPartitionTaskScheduler::run`, `ExportPartitionManifestUpdatingTask::poll`, `ExportPartitionManifestUpdatingTask::handleStatusChanges`) with a `Coordination::setCurrentComponent` guard, but two more code paths in the same feature still touch ZooKeeper without a component scope: * `StorageReplicatedMergeTree::exportPartitionToTable` — the synchronous handler for `ALTER TABLE ... EXPORT PARTITION ... TO TABLE ...`. It calls `getZooKeeperAndAssertNotReadonly` and issues `tryGet`, `exists`, `tryRemoveRecursive`, and `tryMulti` against the table's `zookeeper_path`. * `ExportPartitionTaskScheduler::handlePartExportCompletion` — the per-part completion callback. It is registered as a lambda in `ExportPartitionTaskScheduler::run` but actually fires from the `background_moves_assignee` thread (via `ExportPartTask::executeStep` → `manifest.completion_callback`), so the component scope set in `run` does not apply. With `enforce_keeper_component_tracking = true` (set in stateless-test config via `zookeeper_enforce_component_name.yaml`) and `abort_on_logical_error = true` (debug build), either path triggers the logical error in `Coordination::ZooKeeper::pushRequest` and aborts the server, causing 03604_export_merge_tree_partition to fail with `server died` / `ConnectionRefusedError`. Wrap both function entries in `Coordination::setCurrentComponent`, matching the convention used by sibling methods. Addresses 2 failing tests in the Stateless tests (amd_debug, distributed plan, s3 storage, sequential) shard on #1685.
Two more issues surfaced after the previous Keeper component patches. * `system.replicated_partition_exports` reads route through `StorageReplicatedMergeTree::getPartitionExportsInfo`, which calls `tryGetChildren` / `tryGet` on `zookeeper_path/exports/...` without a component scope. With `enforce_keeper_component_tracking = true` (set by the integration test helper `0_common_enforce_zookeeper_component_name.xml`) and `abort_on_logical_error` (debug build), every test that polls the system table during an export aborted the server with the `Current component is empty` `LOGICAL_ERROR` from `Coordination::ZooKeeper::pushRequest`. The release-build CI surfaced the same error as `Code: 49` returned to the client. The symmetric command path `StorageReplicatedMergeTree::killExportPartition` (used by `KILL EXPORT PARTITION`) had the same gap. Wrap both function entries in `Coordination::setCurrentComponent`, matching `exportPartitionToTable` and the background-task handlers. * `MergeTreeData::scheduleDataMovingJob` runs the export-manifest loop unconditionally, even after `SYSTEM STOP MOVES` flips the moves blocker. Tests `test_export_partition_scheduler_skipped_when_moves_stopped` and `test_export_partition_resumes_after_stop_moves` register an export while moves are stopped and assert the status stays `PENDING` until `SYSTEM START MOVES`; without this guard the scheduler still picks up the manifest and the export completes. Skip the export-manifest loop early when the blocker is cancelled — `ExportPartTask::isCancelled` already covers the in-flight case, so this only blocks new scheduling. Addresses 21 failing tests in the Integration tests (amd_binary, 3/5) shard on #1685. Locally the full 21-test list now passes (19 after the first patch, the remaining 2 STOP-MOVES tests after this patch).
The test asserts that on a second `SELECT *` over the cached Iceberg table, every byte that was written to or read from the filesystem cache during the first query is read back from cache (i.e. `read_from_cache_second_select == read_from_cache_first_select + written_to_cache_first_select`). It also asserts `S3GetObject == 0` on the second query. This PR pulls in the `Poco::toLower` fix to the parquet metadata cache condition in `StorageObjectStorageSource::createReader` (PR #1631 / commit `c7fa2310ea7`). Before that fix, Iceberg manifest files report the format as `PARQUET` (uppercase), so the previous check `getFileFormat() == "Parquet"` always failed and the parquet metadata cache was never used for Iceberg tables. After the fix, the lowercased comparison matches and the cache is now active for Iceberg with `use_parquet_metadata_cache = 1` (default) and `input_format_parquet_use_native_reader_v3 = 1` (default in 26.x). With the parquet metadata cache active, the parsed Parquet metadata (footer + row group metadata) is served from in-memory cache on the second query, so those bytes are no longer pulled through the read buffer and do not register on `CachedReadBufferReadFromCacheBytes`. The test then sees a value of about `1450` on the second select (only the Iceberg JSON / Avro metadata) instead of `1450 + 6144`, and fails as: assert (1450 == (1450 + 6144)) This is the same isolation pattern already used by `tests/integration/test_storage_delta/test.py` and `tests/queries/0_stateless/03723_parquet_prefetcher_read_big_at.sql`, where filesystem-cache-focused tests pass `use_parquet_metadata_cache = 0` to keep the parquet metadata cache from interfering with byte-level accounting. Add `use_parquet_metadata_cache = 0` to both `SELECT *` statements in `test_filesystem_cache[s3]` so the test exercises the filesystem cache path in isolation. Locally the test now passes: test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] PASSED Addresses 1 failing test in Integration tests (arm_binary, distributed plan, 1/4) on #1685. The other test in the shard (`test_dirty_pages_force_purge`) is listed in `tests/broken_tests.yaml` as `KNOWN: https://github.com/Altinity/ClickHouse/issues/1369` and is unrelated.
RelEasy
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add cache for S3 list objects calls and support for exporting MergeTree parts and partitions. Fix Apache Iceberg queries not hitting the parquet metadata cache.
Add cache for S3 list objects calls and support for exporting MergeTree parts and partitions. Fix Apache Iceberg queries not hitting the parquet metadata cache (#1405 by @arthurpassos, #1388 by @arthurpassos, #1593 by @arthurpassos, #1517 by @arthurpassos, #1631 by @arthurpassos).
CI/CD Options
Exclude tests:
Regression jobs to run:
Combined port of 5 PR(s) (group
apassos-1). Cherry-picked from #1405, #1388, #1593, #1517, #1631.#1405: Antalya 26.1 - Forward port of list objects cache #1040
Documentation entry for user-facing changes
Cache for listobjects calls
#1388: Antalya 26.1 - Forward port of export part and partition
Documentation entry for user-facing changes
Export merge tree part and partition (we still need to rebase #1177 afterwards)
#1593: Export Partition - release the part lock when the query is cancelled
During export partition, parts are locked by replicas for exports. This PR introduces a change that releases these locks when an export task is cancelled. Previously, it would not release the lock. We did not catch this error before because the only cases an export task was cancelled we tested were
KILL EXPORT PARTITIONandDROP TABLE. In those cases, the entire task is cancelled, so it does not matter if a replica does not release its lock.But a query can also be cancelled with 'SYSTEM STOP MOVES', and in that case, it is a local operation. The lock must be released so other replicas can continue.
Documentation entry for user-facing changes
...
#1517: Fix IPartitionStrategy race condition
IPartitionStrategy::computePartitionKey might be called from different threads, and it writes to cached_result concurrently without any sort of protection. It would be easier to add a mutex around it, but we can actually make it lock-free by moving the cache write to the constructor.
Documentation entry for user-facing changes
...
#1631: Fix condition for using parquet metadata cache
Apache Iceberg queries were not htiting the parquet metadata cache because
object_info->getFileFormat()resolves toIcebergDataObjectInfo::getFileFormat, which gets its return value fromIcebergObjectSerializableInfo. This field is filled with the value from Apache Iceberg manifest file, and it is upper case by default, which then fails clickhouse check for parquet metadata cache usage.Documentation entry for user-facing changes
...