Antalya 26.3: apassos-1: combined port of 5 PRs by zvonand · Pull Request #1685 · Altinity/ClickHouse

zvonand · 2026-04-23T19:12:30Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add cache for S3 list objects calls and support for exporting MergeTree parts and partitions. Fix Apache Iceberg queries not hitting the parquet metadata cache.
Add cache for S3 list objects calls and support for exporting MergeTree parts and partitions. Fix Apache Iceberg queries not hitting the parquet metadata cache (#1405 by @arthurpassos, #1388 by @arthurpassos, #1593 by @arthurpassos, #1517 by @arthurpassos, #1631 by @arthurpassos).

CI/CD Options

Exclude tests:

Regression jobs to run:

Combined port of 5 PR(s) (group apassos-1). Cherry-picked from #1405, #1388, #1593, #1517, #1631.

Antalya 26.1 - Forward port of list objects cache #1040 #1405 — Antalya 26.1 - Forward port of list objects cache Antalya 25.8 - Forward port of #805 List objects cache #1040
Antalya 26.1 - Forward port of export part and partition #1388 — Antalya 26.1 - Forward port of export part and partition
Export Partition - release the part lock when the query is cancelled #1593 — Export Partition - release the part lock when the query is cancelled
Fix IPartitionStrategy race condition #1517 — Fix IPartitionStrategy race condition
Fix condition for using parquet metadata cache #1631 — Fix condition for using parquet metadata cache

#1405: Antalya 26.1 - Forward port of list objects cache #1040

Documentation entry for user-facing changes

Cache for listobjects calls

#1388: Antalya 26.1 - Forward port of export part and partition

Documentation entry for user-facing changes

Export merge tree part and partition (we still need to rebase #1177 afterwards)

#1593: Export Partition - release the part lock when the query is cancelled

During export partition, parts are locked by replicas for exports. This PR introduces a change that releases these locks when an export task is cancelled. Previously, it would not release the lock. We did not catch this error before because the only cases an export task was cancelled we tested were KILL EXPORT PARTITION and DROP TABLE. In those cases, the entire task is cancelled, so it does not matter if a replica does not release its lock.

But a query can also be cancelled with 'SYSTEM STOP MOVES', and in that case, it is a local operation. The lock must be released so other replicas can continue.

Documentation entry for user-facing changes

...

#1517: Fix IPartitionStrategy race condition

IPartitionStrategy::computePartitionKey might be called from different threads, and it writes to cached_result concurrently without any sort of protection. It would be easier to add a mutex around it, but we can actually make it lock-free by moving the cache write to the constructor.

Documentation entry for user-facing changes

...

#1631: Fix condition for using parquet metadata cache

Apache Iceberg queries were not htiting the parquet metadata cache because object_info->getFileFormat() resolves to IcebergDataObjectInfo::getFileFormat, which gets its return value from IcebergObjectSerializableInfo. This field is filled with the value from Apache Iceberg manifest file, and it is upper case by default, which then fails clickhouse check for parquet metadata cache usage.

Documentation entry for user-facing changes

...

…ache Antalya 26.1 - Forward port of list objects cache #1040 Source-PR: #1405 (#1405)

…rtition Antalya 26.1 - Forward port of export part and partition Source-PR: #1388 (#1388)

…ing_the_lock Export Partition - release the part lock when the query is cancelled Source-PR: #1593 (#1593)

…trategy Fix IPartitionStrategy race condition Source-PR: #1517 (#1517)

Source-PR: #1631 (#1631)

github-actions · 2026-04-23T19:13:38Z

Workflow [PR], commit [32c0ccb]

Three new background tasks introduced by the `enable_experimental_export_merge_tree_partition_feature` forwardport call ZooKeeper without entering a component scope. With `enforce_keeper_component_tracking = true` (set in fast-test config via `zookeeper_enforce_component_name.yaml`), this triggers a logical error in `Coordination::ZooKeeper::pushRequest` the moment any `ReplicatedMergeTree` table activates the tasks on startup, aborting the server. The 247 failing fast-test replicated-table tests are all downstream effects of this abort (they surface as KEEPER_EXCEPTION / TABLE_IS_READ_ONLY). Wrap the entry of each background task method in `Coordination::setCurrentComponent`, matching the convention used by other replicated background work (e.g. `ReplicatedMergeTreeRestartingThread`, `ReplicatedMergeTreeCleanupThread`). Addresses 247 failing tests in the Fast test shard on #1685. After this fix the still-failing set shrank from 247 -> 0 (locally: 245 OK, 2 SKIPPED, 0 FAILED across the same input list).

zvonand · 2026-04-27T12:49:11Z

RelEasy `analyze-fails` — DONE

run completed at 2026-04-27T12:49:09Z

Head SHA: 52d29f4542 (feature/antalya-26.3/apassos-1)
Tests considered: 247 across 1 CI shard(s)
Outcomes: 1 done · 0 partial · 0 unrelated · 0 unresolved
Commits added by AI: 1 (✅ pushed)
Anthropic cost: $70.4348

Per-shard outcomes

✅ DONE — `Fast test`

247 failed test(s) considered — +1 commit(s) — cost $70.4348
full report

AI narration

…(narration truncated; last 4500 chars)

fter_drop_partition_zookeeper, 03409_coalescing_replicated_merge_tree, 03062_analyzer_join_engine_missing_column, 03100_lwu_13_block_number, 02882_replicated_fetch_checksums_doesnt_match, 03080_incorrect_join_with_merge, 01560_optimize_on_insert_zookeeper, 02438_sync_replica_lightweight, 03100_lwu_02_basics, 01526_alter_add_and_modify_order_zookeeper, 01532_primary_key_without_order_by_zookeeper, 03120_analyzer_dist_join, 03222_parallel_replicas_min_marks_to_read_overflow, 02731_parallel_replicas_join_subquery, 03362_create_table_after_truncate_replicated_database, 03365_json_with_variant_subcolumn, 02461_alter_update_respect_part_column_type_bug, 03776_bad_rewrite_random_order_by, 03100_lwu_11_with_mutations, 02449_check_dependencies_and_table_shutdown, 03231_pr_duplicate_announcement_2, 03100_lwu_42_bytes_limits, 03020_order_by_SimpleAggregateFunction, 00577_replacing_merge_tree_vertical_merge, 03623_pr_join_with_group_by_subquery, 02541_lightweight_delete_on_cluster, 04027_system_start_replicated_view, 02514_database_replicated_no_arguments_for_rmt, 02296_ttl_non_deterministic, 03612_freeze_partition_parallel_verbose_zookeeper, 00980_zookeeper_merge_tree_alter_settings, 04035_insert_deduplication_token_insert_select, 02933_replicated_database_forbid_create_as_select, 03047_on_fly_mutations_skip_index, 03234_replicated_table_parent_znode_cleanup, 02597_column_update_and_replication, 03047_on_fly_update_delete, 01181_db_atomic_drop_on_cluster, 02006_test_positional_arguments_on_cluster, 02907_backup_restore_default_nullable, 03257_reverse_sorting_key_zookeeper, 02910_replicated_merge_parameters_must_consistent, 01149_zookeeper_mutation_stuck_after_replace_partition, 02946_parallel_replicas_force_primary_key, 03221_variant_logical_error, 03247_pr_local_plan_non_constant_in_source, 03100_lwu_22_detach_attach_patches, 03100_lwu_06_apply_patches, 02487_create_index_normalize_functions, 03231_pr_reverse_in_order, 03100_lwu_28_mutation_wait_lwu, 03209_parallel_replicas_lost_decimal_conversion, 03100_lwu_deletes_3, 03100_lwu_31_merge_memory_usage, 02319_sql_standard_create_drop_index, 03283_optimize_on_insert_level, 00661_optimize_final_replicated_without_partition_zookeeper, 03224_invalid_alter, 03032_rmt_create_columns_from_replica, 03409_show_cluster_with_macros, 03100_lwu_41_bytes_limits, 03919_modify_nullable_statistics, 00509_extended_storage_definition_syntax_zookeeper, 03100_lwu_43_subquery_from_rmt, 02597_column_delete_and_replication, 00563_insert_into_remote_and_zookeeper_long, 02020_alter_table_modify_comment, 00715_fetch_merged_or_mutated_part_zookeeper, 03717_async_deduplication_with_mv, 02915_move_partition_inactive_replica, 03716_mutations_parts_in_progress_rmt, 03047_on_fly_mutations_events_2, 01705_normalize_create_alter_function_names, 03274_prewarm_primary_index_cache, 01013_sync_replica_timeout_zookeeper, 03716_mutations_parts_in_progress_rmt_2, 00510_materizlized_view_and_deduplication_zookeeper, 02916_another_move_partition_inactive_replica, 04000_mutation_sync_replica_removed_race, 03305_rename_mutations_counter_replicated, 01045_zookeeper_system_mutations_with_parts_names, 02989_system_tables_metadata_version, 02943_rmt_alter_metadata_merge_checksum_mismatch, 02718_insert_meet_hardware_error, 03716_mutations_parts_postpone_reasons_rmt, 02919_insert_meet_eternal_hardware_error, 02724_limit_num_mutations, 01307_multiple_leaders_zookeeper, 03350_alter_table_fetch_partition_thread_pool, 02887_insert_quorum_wo_keeper_retries, 03783_part_log_mutation_ids_rmt, 03100_lwu_21_on_fly_mutations, 01148_zookeeper_path_macros_unfolding, 03254_system_prewarm_mark_cache, 03305_mutations_counters_replicated, 03277_prewarm_cache_2, 01164_alter_memory_database, 02911_row_policy_on_cluster, 01901_test_attach_partition_from, 03254_prewarm_mark_cache_rmt, 02884_async_insert_skip_settings, 01417_freeze_partition_verbose_zookeeper, 01530_drop_database_atomic_sync, 03169_attach_as_replicated, 02179_dict_reload_on_cluster, 02888_replicated_merge_tree_creation, 03919_attach_partition_all_intersecting_detached_parts, 03985_prewarm_index_mark_cache, 02863_delayed_source_with_totals_and_extremes— all[fixed]`.

clickhouse-test — [unrelated] (this entry is the test-runner name, not a real test).

DONE
DONE
DONE

🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

…etion callback The previous fix wrapped the three periodic background tasks (`ExportPartitionTaskScheduler::run`, `ExportPartitionManifestUpdatingTask::poll`, `ExportPartitionManifestUpdatingTask::handleStatusChanges`) with a `Coordination::setCurrentComponent` guard, but two more code paths in the same feature still touch ZooKeeper without a component scope: * `StorageReplicatedMergeTree::exportPartitionToTable` — the synchronous handler for `ALTER TABLE ... EXPORT PARTITION ... TO TABLE ...`. It calls `getZooKeeperAndAssertNotReadonly` and issues `tryGet`, `exists`, `tryRemoveRecursive`, and `tryMulti` against the table's `zookeeper_path`. * `ExportPartitionTaskScheduler::handlePartExportCompletion` — the per-part completion callback. It is registered as a lambda in `ExportPartitionTaskScheduler::run` but actually fires from the `background_moves_assignee` thread (via `ExportPartTask::executeStep` → `manifest.completion_callback`), so the component scope set in `run` does not apply. With `enforce_keeper_component_tracking = true` (set in stateless-test config via `zookeeper_enforce_component_name.yaml`) and `abort_on_logical_error = true` (debug build), either path triggers the logical error in `Coordination::ZooKeeper::pushRequest` and aborts the server, causing 03604_export_merge_tree_partition to fail with `server died` / `ConnectionRefusedError`. Wrap both function entries in `Coordination::setCurrentComponent`, matching the convention used by sibling methods. Addresses 2 failing tests in the Stateless tests (amd_debug, distributed plan, s3 storage, sequential) shard on #1685.

Two more issues surfaced after the previous Keeper component patches. * `system.replicated_partition_exports` reads route through `StorageReplicatedMergeTree::getPartitionExportsInfo`, which calls `tryGetChildren` / `tryGet` on `zookeeper_path/exports/...` without a component scope. With `enforce_keeper_component_tracking = true` (set by the integration test helper `0_common_enforce_zookeeper_component_name.xml`) and `abort_on_logical_error` (debug build), every test that polls the system table during an export aborted the server with the `Current component is empty` `LOGICAL_ERROR` from `Coordination::ZooKeeper::pushRequest`. The release-build CI surfaced the same error as `Code: 49` returned to the client. The symmetric command path `StorageReplicatedMergeTree::killExportPartition` (used by `KILL EXPORT PARTITION`) had the same gap. Wrap both function entries in `Coordination::setCurrentComponent`, matching `exportPartitionToTable` and the background-task handlers. * `MergeTreeData::scheduleDataMovingJob` runs the export-manifest loop unconditionally, even after `SYSTEM STOP MOVES` flips the moves blocker. Tests `test_export_partition_scheduler_skipped_when_moves_stopped` and `test_export_partition_resumes_after_stop_moves` register an export while moves are stopped and assert the status stays `PENDING` until `SYSTEM START MOVES`; without this guard the scheduler still picks up the manifest and the export completes. Skip the export-manifest loop early when the blocker is cancelled — `ExportPartTask::isCancelled` already covers the in-flight case, so this only blocks new scheduling. Addresses 21 failing tests in the Integration tests (amd_binary, 3/5) shard on #1685. Locally the full 21-test list now passes (19 after the first patch, the remaining 2 STOP-MOVES tests after this patch).

The test asserts that on a second `SELECT *` over the cached Iceberg table, every byte that was written to or read from the filesystem cache during the first query is read back from cache (i.e. `read_from_cache_second_select == read_from_cache_first_select + written_to_cache_first_select`). It also asserts `S3GetObject == 0` on the second query. This PR pulls in the `Poco::toLower` fix to the parquet metadata cache condition in `StorageObjectStorageSource::createReader` (PR #1631 / commit `c7fa2310ea7`). Before that fix, Iceberg manifest files report the format as `PARQUET` (uppercase), so the previous check `getFileFormat() == "Parquet"` always failed and the parquet metadata cache was never used for Iceberg tables. After the fix, the lowercased comparison matches and the cache is now active for Iceberg with `use_parquet_metadata_cache = 1` (default) and `input_format_parquet_use_native_reader_v3 = 1` (default in 26.x). With the parquet metadata cache active, the parsed Parquet metadata (footer + row group metadata) is served from in-memory cache on the second query, so those bytes are no longer pulled through the read buffer and do not register on `CachedReadBufferReadFromCacheBytes`. The test then sees a value of about `1450` on the second select (only the Iceberg JSON / Avro metadata) instead of `1450 + 6144`, and fails as: assert (1450 == (1450 + 6144)) This is the same isolation pattern already used by `tests/integration/test_storage_delta/test.py` and `tests/queries/0_stateless/03723_parquet_prefetcher_read_big_at.sql`, where filesystem-cache-focused tests pass `use_parquet_metadata_cache = 0` to keep the parquet metadata cache from interfering with byte-level accounting. Add `use_parquet_metadata_cache = 0` to both `SELECT *` statements in `test_filesystem_cache[s3]` so the test exercises the filesystem cache path in isolation. Locally the test now passes: test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] PASSED Addresses 1 failing test in Integration tests (arm_binary, distributed plan, 1/4) on #1685. The other test in the shard (`test_dirty_pages_force_purge`) is listed in `tests/broken_tests.yaml` as `KNOWN: https://github.com/Altinity/ClickHouse/issues/1369` and is unrelated.

zvonand · 2026-04-27T18:18:34Z

RelEasy `analyze-fails` — PARTIAL

run completed at 2026-04-27T18:18:32Z

Head SHA: dbdad44519 (feature/antalya-26.3/apassos-1)
Tests considered: 29 across 7 CI shard(s)
Outcomes: 2 done · 2 partial · 3 unrelated · 0 unresolved
Commits added by AI: 3 (✅ pushed)
Anthropic cost: $96.4998

Per-shard outcomes

🟡 PARTIAL — `Stateless tests (amd_debug, distributed plan, s3 storage, sequential)`

2 failed test(s) considered — +1 commit(s) — cost $11.3453
full report

AI narration

Now check that setCurrentComponent is available (the ZooKeeperCommon.h header).
Good, setCurrentComponent is already used 41 times in StorageReplicatedMergeTree.cpp so it's available via transitive includes. Now let me add the second guard in handlePartExportCompletion.
Now let's run the build.
I'll wait for the build notification rather than scheduling a wakeup.
I'll wait for the build notification rather than scheduling a wakeup.
Build succeeded. Now let me run the failing tests.
The fix follows the same pattern as the previous commit. Let me commit it.

Summary

Both items in the failed-test list trace to the same root cause: the enable_experimental_export_merge_tree_partition_feature calls ZooKeeper without a component scope, and enforce_keeper_component_tracking = true (set for stateless tests) turns that into a LOGICAL_ERROR that aborts the debug-build server.

The previous commit (dbdad44519b) covered the three periodic background tasks but missed two more paths exercised by 03604_export_merge_tree_partition:

StorageReplicatedMergeTree::exportPartitionToTable — the synchronous handler for ALTER TABLE … EXPORT PARTITION … TO TABLE …, which directly issues ZooKeeper requests.
ExportPartitionTaskScheduler::handlePartExportCompletion — the per-part completion callback that runs in the background_moves_assignee thread (so the component set in run doesn't propagate there).

I added Coordination::setCurrentComponent guards at both entry points, matching the convention used by 41+ sibling methods in the same file. The build succeeded.

I did not run the test locally — the orphan server in tmp/click_data was operating on a deleted data directory, and rebuilding the full stateless-test environment (with --s3-storage, enforce_keeper_component_tracking, abort_on_logical_error, etc.) within budget would have crowded out the iteration window. The fix is mechanical and follows an established pattern that was already verified locally on the same root cause for 245/247 fast-test failures in the prior commit.

Per-test classification:

03604_export_merge_tree_partition — [fixed] (high confidence — fix targets the exact code path the test exercises)
Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-4547) — [fixed] (this is the literal exception text captured as a "test", same root cause; eliminating the ZK-without-component calls eliminates the exception)

PARTIAL

Summary

Both items in the failed-test list trace to the same root cause: the enable_experimental_export_merge_tree_partition_feature calls ZooKeeper without a component scope, and enforce_keeper_component_tracking = true (set for stateless tests) turns that into a LOGICAL_ERROR that aborts the debug-build server.

The previous commit (dbdad44519b) covered the three periodic background tasks but missed two more paths exercised by 03604_export_merge_tree_partition:

StorageReplicatedMergeTree::exportPartitionToTable — the synchronous handler for ALTER TABLE … EXPORT PARTITION … TO TABLE …, which directly issues ZooKeeper requests.
ExportPartitionTaskScheduler::handlePartExportCompletion — the per-part completion callback that runs in the background_moves_assignee thread (so the component set in run doesn't propagate there).

I added Coordination::setCurrentComponent guards at both entry points, matching the convention used by 41+ sibling methods in the same file. The build succeeded.

I did not run the test locally — the orphan server in tmp/click_data was operating on a deleted data directory, and rebuilding the full stateless-test environment (with --s3-storage, enforce_keeper_component_tracking, abort_on_logical_error, etc.) within budget would have crowded out the iteration window. The fix is mechanical and follows an established pattern that was already verified locally on the same root cause for 245/247 fast-test failures in the prior commit.

Per-test classification:

03604_export_merge_tree_partition — [fixed] (high confidence — fix targets the exact code path the test exercises)
Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-4547) — [fixed] (this is the literal exception text captured as a "test", same root cause; eliminating the ZK-without-component calls eliminates the exception)

PARTIAL

⏭️ UNRELATED — `Stateless tests (amd_tsan, s3 storage, sequential, 2/2)`

1 failed test(s) considered — cost $2.3368
full report

AI narration

Based on my investigation, the failure is clickhouse local --only-system-tables failing to find system.trace_log while dumping system table artifacts after the test suite. This is a CI infrastructure check at ci/jobs/scripts/clickhouse_proc.py:1098-1234, not a regular stateless test.

Why I conclude this is unrelated:

The PR's recent commits (7ed3bf7a75a, dbdad44519b) only wrap Keeper calls in Coordination::setCurrentComponent for the EXPORT PARTITION code paths.
git diff antalya-26.3..HEAD shows no changes to src/Interpreters/SystemLog.cpp, src/Common/SystemLogBase.*, src/Common/QueryProfiler.cpp, src/Storages/System/, programs/local/, or ci/jobs/scripts/clickhouse_proc.py.
trace_log is a SystemLog table created lazily via prepareTable on first flush; with prepare_system_log_tables_on_startup=false and a TSan build, it can simply not have been materialized on disk before server stop, which then makes --only-system-tables produce UNKNOWN_TABLE.
The same infra recently had a workaround commit (2b96af4a017 — remove blob_storage_log from dumped system tables) that confirms this dumper is fragile to which logs happen to be created.

I made no code changes.

Final classification:

Scraping system tables — [unrelated]

UNRELATED
Based on my investigation, the failure is clickhouse local --only-system-tables failing to find system.trace_log while dumping system table artifacts after the test suite. This is a CI infrastructure check at ci/jobs/scripts/clickhouse_proc.py:1098-1234, not a regular stateless test.

Why I conclude this is unrelated:

The PR's recent commits (7ed3bf7a75a, dbdad44519b) only wrap Keeper calls in Coordination::setCurrentComponent for the EXPORT PARTITION code paths.
git diff antalya-26.3..HEAD shows no changes to src/Interpreters/SystemLog.cpp, src/Common/SystemLogBase.*, src/Common/QueryProfiler.cpp, src/Storages/System/, programs/local/, or ci/jobs/scripts/clickhouse_proc.py.
trace_log is a SystemLog table created lazily via prepareTable on first flush; with prepare_system_log_tables_on_startup=false and a TSan build, it can simply not have been materialized on disk before server stop, which then makes --only-system-tables produce UNKNOWN_TABLE.
The same infra recently had a workaround commit (2b96af4a017 — remove blob_storage_log from dumped system tables) that confirms this dumper is fragile to which logs happen to be created.

I made no code changes.

Final classification:

Scraping system tables — [unrelated]

UNRELATED

⏭️ UNRELATED — `Stateless tests (arm_asan, azure, sequential)`

1 failed test(s) considered — cost $0.3682
full report

AI narration

Triage finding: only one test failed, 03760_backup_tar_archive, with TOO_SLOW (601 s estimated) on a trivial INSERT INTO numbers(500) step under sanitizer — already system-classified as "BROKEN: KNOWN: Random timeout with sanitizer". The test exercises BACKUP/RESTORE to a tar disk and uses heavy PARTITION BY (c1 % 6451). The PR diff is about EXPORT PARTITION/Keeper, JWT auth, list objects cache, etc. — nothing in the BACKUP path, INSERT path, or partition code relevant to this timeout.
No code changes made. Summary:

03760_backup_tar_archive — [unrelated] (sanitizer-induced random timeout on a trivial INSERT into a table with ~500 partitions; already classified as BROKEN: KNOWN: Random timeout with sanitizer; PR diff does not touch BACKUP, INSERT, or partitioning code paths)

UNRELATED
No code changes made. Summary:

03760_backup_tar_archive — [unrelated] (sanitizer-induced random timeout on a trivial INSERT into a table with ~500 partitions; already classified as BROKEN: KNOWN: Random timeout with sanitizer; PR diff does not touch BACKUP, INSERT, or partitioning code paths)

UNRELATED

🟡 PARTIAL — `Stateless tests (arm_binary, sequential)`

1 failed test(s) considered — cost $5.8595
full report

AI narration

Let me check the build state and try to run the test to verify the fix on HEAD works.
Build succeeded. Let me run the actual failing test to verify the fix on HEAD works.
Let me start a server with the existing test config and run the test.
Let me start the server and verify the fix.
The fix is already on HEAD as commit 7ed3bf7a75a, which the build verified compiles. Let me verify everything is in order with git status.

Summary

Investigation: The shard "Stateless tests (arm_binary, sequential)" reported one failure described as Logical error: Current component is empty. The CI ran on SHA dbdad44519b (penultimate commit). The praktika report confirms the actual failing test was 03604_export_merge_tree_partition, with the stack trace pointing at StorageReplicatedMergeTree::exportPartitionToTable → zookeeper->exists (no component scope) under enforce_keeper_component_tracking=true.

Fix status: HEAD commit 7ed3bf7a75a (already on branch, made before this session) wraps both StorageReplicatedMergeTree::exportPartitionToTable and ExportPartitionTaskScheduler::handlePartExportCompletion in Coordination::setCurrentComponent, exactly addressing the path in the failure stack trace. The earlier sibling commit dbdad44519b covers the three periodic background tasks (run, poll, handleStatusChanges).

Code verification: All five ZooKeeper-touching entry points in the export-partition feature now have a component guard. Helpers (handlePartExportSuccess, handlePartExportFailure, tryToMovePartToProcessed, areAllPartsProcessed, ExportPartitionUtils::commit, ExportPartitionUtils::getExportedPaths) are reached only from already-guarded scopes and inherit the component.

Build: bash .releasy/build.sh succeeded with the fix in place (only the changed translation units relinked).

Runtime verification: Could not be performed locally — the harness disallows starting a server, and the test relies on minio (s3_conn). RelEasy will re-run CI after pushing.

Test results

Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-3e47) — [fixed] (HEAD commit 7ed3bf7a75a adds the missing component scope on the exact path named in the stack trace; build confirms it compiles)

PARTIAL

Summary

Investigation: The shard "Stateless tests (arm_binary, sequential)" reported one failure described as Logical error: Current component is empty. The CI ran on SHA dbdad44519b (penultimate commit). The praktika report confirms the actual failing test was 03604_export_merge_tree_partition, with the stack trace pointing at StorageReplicatedMergeTree::exportPartitionToTable → zookeeper->exists (no component scope) under enforce_keeper_component_tracking=true.

Fix status: HEAD commit 7ed3bf7a75a (already on branch, made before this session) wraps both StorageReplicatedMergeTree::exportPartitionToTable and ExportPartitionTaskScheduler::handlePartExportCompletion in Coordination::setCurrentComponent, exactly addressing the path in the failure stack trace. The earlier sibling commit dbdad44519b covers the three periodic background tasks (run, poll, handleStatusChanges).

Code verification: All five ZooKeeper-touching entry points in the export-partition feature now have a component guard. Helpers (handlePartExportSuccess, handlePartExportFailure, tryToMovePartToProcessed, areAllPartsProcessed, ExportPartitionUtils::commit, ExportPartitionUtils::getExportedPaths) are reached only from already-guarded scopes and inherit the component.

Build: bash .releasy/build.sh succeeded with the fix in place (only the changed translation units relinked).

Runtime verification: Could not be performed locally — the harness disallows starting a server, and the test relies on minio (s3_conn). RelEasy will re-run CI after pushing.

Test results

Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 5319-3e47) — [fixed] (HEAD commit 7ed3bf7a75a adds the missing component scope on the exact path named in the stack trace; build confirms it compiles)

PARTIAL

✅ DONE — `Integration tests (amd_binary, 3/5)`

21 failed test(s) considered — +1 commit(s) — cost $66.8779
full report

AI narration

Let me look at the test file and the export-partition code paths that touch ZooKeeper to find any remaining unguarded paths.
Let me check the latest fix in exportPartitionToTable and run the tests to verify.
While that runs, let me look at what other Keeper-touching paths might still lack a component guard.
Found the root cause. The crash is in StorageReplicatedMergeTree::getPartitionExportsInfo at line 4602, called from StorageSystemReplicatedPartitionExports::fillData.
Now let me apply the fix to getPartitionExportsInfo.

All tests are passing! Let me wait for the full run to complete.

19 of 21 now passing. Let me check the 2 remaining failures.
Build successful. Let me re-run the 2 still-failing tests to verify the fix.

All 3 tests pass. Let me verify the full set passes too, then commit.
Let me do a final validation run with all 21 tests.

All 21 tests pass. Final tally: 21 passed in 137s.
All 21 tests in the failing list are now [fixed]:

test_restart_nodes_during_export — [fixed]
test_kill_export — [fixed]
test_drop_source_table_during_export — [fixed]
test_concurrent_exports_to_different_targets — [fixed]
test_failure_is_logged_in_system_table — [fixed]
test_inject_short_living_failures — [fixed]
test_export_ttl — [fixed]
test_export_partition_file_already_exists_policy — [fixed]
test_export_partition_permissions — [fixed]
test_multiple_exports_within_a_single_query — [fixed]
test_pending_mutations_throw_before_export_partition — [fixed]
test_pending_mutations_skip_before_export_partition — [fixed]
test_pending_patch_parts_throw_before_export_partition — [fixed]
test_pending_patch_parts_skip_before_export_partition — [fixed]
test_mutations_after_export_partition_started — [fixed]
test_patch_parts_after_export_partition_started — [fixed]
test_mutation_in_partition_clause — [fixed]
test_export_partition_with_mixed_computed_columns — [fixed]
test_export_partition_scheduler_skipped_when_moves_stopped — [fixed]
test_export_partition_resumes_after_stop_moves — [fixed]
test_export_partition_resumes_after_stop_moves_during_export — [fixed]

Two root causes, both addressed in commit 1cbf4f34d91:

getPartitionExportsInfo and killExportPartition lacked Coordination::setCurrentComponent, so reading system.replicated_partition_exports (and KILL EXPORT PARTITION) crashed with the Current component is empty LOGICAL_ERROR under enforce_keeper_component_tracking.
MergeTreeData::scheduleDataMovingJob ran the export-manifest loop unconditionally, so SYSTEM STOP MOVES did not actually pause new export scheduling.

Final 21-test run: 21 passed in 137.10s.

DONE
All 21 tests in the failing list are now [fixed]:

test_restart_nodes_during_export — [fixed]
test_kill_export — [fixed]
test_drop_source_table_during_export — [fixed]
test_concurrent_exports_to_different_targets — [fixed]
test_failure_is_logged_in_system_table — [fixed]
test_inject_short_living_failures — [fixed]
test_export_ttl — [fixed]
test_export_partition_file_already_exists_policy — [fixed]
test_export_partition_permissions — [fixed]
test_multiple_exports_within_a_single_query — [fixed]
test_pending_mutations_throw_before_export_partition — [fixed]
test_pending_mutations_skip_before_export_partition — [fixed]
test_pending_patch_parts_throw_before_export_partition — [fixed]
test_pending_patch_parts_skip_before_export_partition — [fixed]
test_mutations_after_export_partition_started — [fixed]
test_patch_parts_after_export_partition_started — [fixed]
test_mutation_in_partition_clause — [fixed]
test_export_partition_with_mixed_computed_columns — [fixed]
test_export_partition_scheduler_skipped_when_moves_stopped — [fixed]
test_export_partition_resumes_after_stop_moves — [fixed]
test_export_partition_resumes_after_stop_moves_during_export — [fixed]

Two root causes, both addressed in commit 1cbf4f34d91:

getPartitionExportsInfo and killExportPartition lacked Coordination::setCurrentComponent, so reading system.replicated_partition_exports (and KILL EXPORT PARTITION) crashed with the Current component is empty LOGICAL_ERROR under enforce_keeper_component_tracking.
MergeTreeData::scheduleDataMovingJob ran the export-manifest loop unconditionally, so SYSTEM STOP MOVES did not actually pause new export scheduling.

Final 21-test run: 21 passed in 137.10s.

DONE

✅ DONE — `Integration tests (arm_binary, distributed plan, 1/4)`

2 failed test(s) considered — +1 commit(s) — cost $8.5653
full report

AI narration

Looking at the situation: test 2 is marked as broken with a known issue (UNRELATED). Let me investigate test 1 more carefully.
Let me investigate test 1 by examining its setup files and seeing what could fail.
Let me check if there's a similar test on CI master to compare. Let me look at the recent changes most likely to affect filesystem cache behavior.
Let me check if the test 2 (BROKEN) is supposed to fail on this branch and if it's a known issue:
I found a likely root cause. The PR enables parquet metadata cache for Iceberg via the Poco::toLower("PARQUET") == "parquet" change. With this active, parsed parquet metadata is served from in-memory cache on second read, bypassing filesystem cache reads — so the second select reads less from filesystem cache than expected.

Let me check if this test actually has S3GetObject == 0 after the failing assert (indicates whether data file is fully cached):
Let me update todos and apply the fix.
Now I'll apply the fix to disable parquet metadata cache in the test:
Now let me try to run the tests:
The test now passes. Let me update todos and commit.
Final summary:

test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] [fixed] — The PR enabled the parquet metadata cache for Iceberg via the Poco::toLower("PARQUET") == "parquet" change in StorageObjectStorageSource::createReader. With parsed parquet metadata served from in-memory cache on the second select, those bytes no longer flow through the filesystem cache, so the byte-level assertion failed (1450 == 1450 + 6144). Fix adds use_parquet_metadata_cache = 0 to both SELECT * statements, matching the isolation pattern used by test_storage_delta and 03723_parquet_prefetcher_read_big_at.sql. Confirmed locally with 1 passed in 17.37s.
test_dirty_pages_force_purge/test.py::test_dirty_pages_force_purge [unrelated] — Listed in tests/broken_tests.yaml:244 as KNOWN: https://github.com/Altinity/ClickHouse/issues/1369; the praktika BROKEN status is the expected outcome.

DONE
Final summary:

test_storage_iceberg_with_spark_cache/test_filesystem_cache.py::test_filesystem_cache[s3] [fixed] — The PR enabled the parquet metadata cache for Iceberg via the Poco::toLower("PARQUET") == "parquet" change in StorageObjectStorageSource::createReader. With parsed parquet metadata served from in-memory cache on the second select, those bytes no longer flow through the filesystem cache, so the byte-level assertion failed (1450 == 1450 + 6144). Fix adds use_parquet_metadata_cache = 0 to both SELECT * statements, matching the isolation pattern used by test_storage_delta and 03723_parquet_prefetcher_read_big_at.sql. Confirmed locally with 1 passed in 17.37s.
test_dirty_pages_force_purge/test.py::test_dirty_pages_force_purge [unrelated] — Listed in tests/broken_tests.yaml:244 as KNOWN: https://github.com/Altinity/ClickHouse/issues/1369; the praktika BROKEN status is the expected outcome.

DONE

⏭️ UNRELATED — `Integration tests (arm_binary, distributed plan, 4/4)`

1 failed test(s) considered — cost $1.1467
full report

AI narration

The test test_backup_restore_on_cluster/test_different_versions.py::test_different_versions is already listed in tests/broken_tests.yaml (line 154–156) with reason INVESTIGATE - NETLINK_ERROR and message NETLINK_ERROR. The CI report confirms this matched and the test was relabeled to BROKEN status. In ci/praktika/result.py:224, BROKEN counts as is_ok — i.e. it is not a CI failure. The underlying cause is an environmental issue: the old (24.9.2.42) container raising Can't receive Netlink response: error -2 from hasLinuxCapability. This PR's diff does not touch backup/restore-on-cluster nor hasLinuxCapability, and it did not modify tests/broken_tests.yaml. No code change is warranted.

Summary

test_backup_restore_on_cluster/test_different_versions.py::test_different_versions — [unrelated]

The test is pre-marked as broken in tests/broken_tests.yaml (entry exists in the base branch, not added by this PR) for exactly the NETLINK_ERROR signature observed. The CI labeled it BROKEN, which ci/praktika/result.py treats as is_ok. The error originates inside the old 24.9.2.42 container's hasLinuxCapability codepath — an environment artifact unrelated to anything in the PR's diff. No code change made.

UNRELATED

Summary

test_backup_restore_on_cluster/test_different_versions.py::test_different_versions — [unrelated]

The test is pre-marked as broken in tests/broken_tests.yaml (entry exists in the base branch, not added by this PR) for exactly the NETLINK_ERROR signature observed. The CI labeled it BROKEN, which ci/praktika/result.py treats as is_ok. The error originates inside the old 24.9.2.42 container's hasLinuxCapability codepath — an environment artifact unrelated to anything in the PR's diff. No code change made.

UNRELATED

🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.

zvonand and others added 5 commits April 23, 2026 20:16

Merge pull request #1405 from Altinity/fp_antalya_26_1_list_objects_c…

a1f5ec4

…ache Antalya 26.1 - Forward port of list objects cache #1040 Source-PR: #1405 (#1405)

Merge pull request #1388 from Altinity/fp_antalya_26_1_export_part_pa…

b1075af

…rtition Antalya 26.1 - Forward port of export part and partition Source-PR: #1388 (#1388)

Merge pull request #1593 from Altinity/fix_query_cancelled_not_releas…

9cbbb72

…ing_the_lock Export Partition - release the part lock when the query is cancelled Source-PR: #1593 (#1593)

Merge pull request #1517 from Altinity/fix_race_condition_partition_s…

14e8d3f

…trategy Fix IPartitionStrategy race condition Source-PR: #1517 (#1517)

Merge c7fa231 into d078d53

c7db19c

Source-PR: #1631 (#1631)

zvonand added releasy Created/managed by RelEasy ai-resolved Port conflict auto-resolved by Claude labels Apr 23, 2026

zvonand added the antalya-26.3 label Apr 23, 2026

zvonand and others added 3 commits April 24, 2026 19:38

Update SettingsChangesHistory.cpp

efeb0e5

Merge branch 'antalya-26.3' into feature/antalya-26.3/apassos-1

52d29f4

zvonand added 3 commits April 27, 2026 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antalya 26.3: apassos-1: combined port of 5 PRs#1685

Antalya 26.3: apassos-1: combined port of 5 PRs#1685
zvonand wants to merge 11 commits intoantalya-26.3from
feature/antalya-26.3/apassos-1

zvonand commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

zvonand commented Apr 27, 2026

Uh oh!

zvonand commented Apr 27, 2026

Summary

Summary

Summary

Test results

Summary

Test results

Summary

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zvonand commented Apr 23, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

CI/CD Options

Exclude tests:

Regression jobs to run:

#1405: Antalya 26.1 - Forward port of list objects cache #1040

Documentation entry for user-facing changes

#1388: Antalya 26.1 - Forward port of export part and partition

Documentation entry for user-facing changes

#1593: Export Partition - release the part lock when the query is cancelled

Documentation entry for user-facing changes

#1517: Fix IPartitionStrategy race condition

Documentation entry for user-facing changes

#1631: Fix condition for using parquet metadata cache

Documentation entry for user-facing changes

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented Apr 27, 2026

RelEasy analyze-fails — DONE

Per-shard outcomes

✅ DONE — Fast test

Uh oh!

zvonand commented Apr 27, 2026

RelEasy analyze-fails — PARTIAL

Per-shard outcomes

🟡 PARTIAL — Stateless tests (amd_debug, distributed plan, s3 storage, sequential)

Summary

Summary

⏭️ UNRELATED — Stateless tests (amd_tsan, s3 storage, sequential, 2/2)

⏭️ UNRELATED — Stateless tests (arm_asan, azure, sequential)

🟡 PARTIAL — Stateless tests (arm_binary, sequential)

Summary

Test results

Summary

Test results

✅ DONE — Integration tests (amd_binary, 3/5)

✅ DONE — Integration tests (arm_binary, distributed plan, 1/4)

⏭️ UNRELATED — Integration tests (arm_binary, distributed plan, 4/4)

Summary

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 23, 2026 •

edited

Loading

RelEasy `analyze-fails` — DONE

✅ DONE — `Fast test`

RelEasy `analyze-fails` — PARTIAL

🟡 PARTIAL — `Stateless tests (amd_debug, distributed plan, s3 storage, sequential)`

⏭️ UNRELATED — `Stateless tests (amd_tsan, s3 storage, sequential, 2/2)`

⏭️ UNRELATED — `Stateless tests (arm_asan, azure, sequential)`

🟡 PARTIAL — `Stateless tests (arm_binary, sequential)`

✅ DONE — `Integration tests (amd_binary, 3/5)`

✅ DONE — `Integration tests (arm_binary, distributed plan, 1/4)`

⏭️ UNRELATED — `Integration tests (arm_binary, distributed plan, 4/4)`