26.1 Antalya port - improvements for cluster requests by ianton-ru · Pull Request #1414 · Altinity/ClickHouse

ianton-ru · 2026-02-17T14:12:07Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Frontports for Antalya 26.1

CI/CD Options

Exclude tests:

Regression jobs to run:

…e_count_in_datalake 25.8 Antalya ports: Read optimization using Iceberg metadata

…wipes_my_snot 25.8 Antalya: Fixes for #1069

…el_segfailt Check multiplexed connection before usage

…_in_task Small iceberg read optimization

…orage_cluster_profile_events Profile events for task distribution in ObjectStorageCluster requests

…_storage_cluster_distribution

…n_unexpected_shutdown Fix segfault on unexpected node shutdown

github-actions · 2026-02-17T14:13:14Z

Workflow [PR], commit [7e9b163]

ianton-ru · 2026-02-18T16:03:46Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bd95cb8cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-18T16:17:47Z

src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp

+    replica_to_files_to_be_processed.erase(number_of_current_replica);
+    for (const auto & file : processed_file_list_ptr->second)


Keep task list before removing lost replica entry

In rescheduleTasksFromReplica, the code erases replica_to_files_to_be_processed[number_of_current_replica] and then immediately iterates processed_file_list_ptr->second. That iterator points to the erased element, so this is undefined behavior on the replica-failover path. When a replica drops mid-query, this can crash the coordinator or prevent tasks from being re-queued reliably.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-18T16:17:47Z

src/QueryPipeline/RemoteQueryExecutorReadContext.cpp

+            if (read_context.packet.type == Protocol::Server::Data)
+                read_context.has_data_packets = true;


Ignore empty Data packets when deciding task rerun

The failover guard marks has_data_packets true for any Protocol::Server::Data packet, but empty Data blocks are valid and do not imply rows were processed. If a replica sends only empty Data blocks and then disconnects, the EOF handler will skip the ConnectionLost rerun path (!has_data_packets is false), so unprocessed tasks are not reassigned even with skip_unavailable_shards enabled.

Useful? React with 👍 / 👎.

ianton-ru · 2026-02-23T18:04:56Z

Failed stateless and integration tests seem unrelated.

…shing

ianton-ru · 2026-03-03T11:40:19Z

tests/integration/test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py

+        # Weird, bu looks like ReadFileMetadata does not used local file cache in 26.1
+        # metadata.json always downloaded in 26.1, once per query or subquery
+        # In 25.8 count was equal to expected, in 26.1 it is expected * 3 + 1 for Local case
+        # expected * 3 + 4 for Cluster case, because each subquery loads mettadata.json


Guess, it is an effect of ClickHouse#93416

…shing

zvonand · 2026-03-05T13:15:46Z

03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers -- does not look related and/or critical

CarlosFelipeOR · 2026-03-06T03:30:18Z

PR #1414 Audit Review

AI review comment (model: gpt-5.3-codex)

1) Scope and partitions

PR reviewed: Altinity/ClickHouse#1414

Large PR; reviewed in functional partitions:

Partition A (task distribution / retries / swarm): StorageObjectStorageStableTaskDistributor, StorageObjectStorageCluster, RemoteQueryExecutor*, Context, ClusterDiscovery, MultiplexedConnections
Partition B (task payload protocol): ClusterFunctionReadTask*, IObjectStorage::{RelativePathWithMetadata, CommandInTaskResponse}, Protocol
Partition C (Iceberg read optimization): IDataLakeMetadata, StorageObjectStorageSource, Range
Partition D (control/observability/docs): settings, access rights, profile/current metrics, SYSTEM docs/tests

2) Call graph

Main changed runtime path for cluster table functions:

initiator -> RemoteQueryExecutor::sendReadTaskRequestAssumeLocked()
worker task source -> TaskIterator::operator() (object-storage/file/url variants)
object-storage scheduling -> StorageObjectStorageStableTaskDistributor::getNextTask()
response encode/decode -> ClusterFunctionReadTaskResponse::{serialize, deserialize, getObjectInfo}
worker read loop -> StorageObjectStorageSource::createReader()
connection-loss path -> RemoteQueryExecutorReadContext::Task::run() -> synthetic Protocol::Server::ConnectionLost -> RemoteQueryExecutor::processPacket() -> TaskIterator::rescheduleTasksFromReplica()

State/caches touched:

distributor queues: connection_to_files, unprocessed_files, replica_to_files_to_be_processed, last_node_activity
swarm mode gates in Context and cluster registration in ClusterDiscovery

3) Transition matrix

ID	Transition	Invariant
T1	iterator object -> stable identifier -> queue key	same identifier function must be used on enqueue/dequeue/reschedule
T2	processed-task tracking -> replica-loss reschedule	must preserve task granularity (file-bucket/archive unit)
T3	serialized task path -> `ObjectInfo` reconstruction	plain paths must never be reinterpreted as control command
T4	retry command task -> reader loop	control command must not be treated as data path
T5	connection-loss rerun	only safe to rerun unprocessed units

4) Logical code-path testing summary

Reviewed changed branches:

matched/prequeued/unprocessed selection in distributor
delayed stealing (lock_object_storage_task_distribution_ms) and retry command response
rerun-on-connection-lost when no data packet seen
command-in-path parsing and reader handling
constant-column pruning path from Iceberg metadata

Concurrency/interleavings checked:

multiple replicas requesting tasks with shared distributor mutex
replica failure while tasks already assigned but before data packet
retry-command assignment + subsequent replica loss

5) Fault categories and category-by-category injection results

Fault category	Status	Outcome	Defects
Identifier consistency across queue lifecycle	Executed	Fail	1
Retry/reschedule correctness under node loss	Executed	Fail	1 (same root cause)
Command/data boundary in task protocol	Executed	Fail	1
Concurrency/race/deadlock	Executed	Pass	0
Exception/partial-update behavior	Executed	Pass	0
Error-contract consistency	Executed	Pass	0
Performance/resource risks	Executed	Pass	0

6) Confirmed defects (High/Medium/Low)

High — Reschedule path uses a different task identity than normal scheduling, causing task collapse/misrouting after replica loss

Impact: possible data loss or incorrect re-execution granularity when a replica is lost before producing data; especially for split-file buckets and archive modes.
Where: StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica
Trigger condition: replica connection loss before data packet + tasks containing bucket-level identity or archive-derived identity.
Affected transition: T1/T2.
Why defect: normal scheduling keys/hashes by getIdentifier() (or archive path in special mode), but reschedule rehashes/keys by getPath().

Evidence:

for (const auto & file : processed_file_list_ptr->second)
{
    auto file_replica_idx = getReplicaForFile(file->getPath());
    unprocessed_files.emplace(file->getPath(), std::make_pair(file, file_replica_idx));
    connection_to_files[file_replica_idx].push_back(file);
}

String file_identifier;
if (send_over_whole_archive && object_info->isArchive())
    file_identifier = object_info->getPathOrPathToArchiveIfArchive();
else
    file_identifier = object_info->getIdentifier();

String ObjectInfo::getIdentifier() const
{
    String result = getPath();
    if (file_bucket_info)
        result += file_bucket_info->getIdentifier();

Smallest logical repro: split one file into multiple file_bucket_info tasks; lose assigned replica before first data packet; reschedule path inserts by plain getPath() key so bucket identities collide (emplace drops duplicates).
Likely fix direction: in reschedule, reuse the same identifier logic as enqueue path (getIdentifier() / archive-aware rule), both for hashing and unprocessed_files key.
Regression test direction: fail one replica before data while bucket-split is enabled; assert all buckets are eventually processed exactly once.
Blast radius: object-storage cluster reads (s3Cluster, icebergCluster, etc.) under failure/retry conditions.

Medium — Any JSON-looking file path is interpreted as a control command and can be dropped as empty path

Impact: valid object/file keys that are JSON objects (or command-shaped strings) can be misparsed as control commands, leading to skipped data.
Where: RelativePathWithMetadata::CommandInTaskResponse parsing + RelativePathWithMetadata constructor + reader empty-path check.
Affected transition: T3/T4.
Why defect: command parser marks any parsed JSON object as valid command even without required command fields (file_path/retry_after_us), then constructor replaces relative_path with empty fallback.

Evidence:

auto json = parser.parse(task).extract<Poco::JSON::Object::Ptr>();
if (!json)
    return;

is_valid = true;

if (json->has("file_path"))
    file_path = json->getValue<std::string>("file_path");
if (json->has("retry_after_us"))
    retry_after_us = json->getValue<size_t>("retry_after_us");

explicit RelativePathWithMetadata(String command_or_path, std::optional<ObjectMetadata> metadata_ = std::nullopt)
    : relative_path(std::move(command_or_path))
    , metadata(std::move(metadata_))
    , command(relative_path)
{
    if (command.isValid())
    {
        relative_path = command.getFilePath().value_or("");

if (object_info->relative_path_with_metadata.getCommand().isValid())
{
    auto retry_after_us = object_info->relative_path_with_metadata.getCommand().getRetryAfterUs();
    ...
}
if (object_info->getPath().empty())
    return {};

Smallest logical repro: task path string equals a JSON object without file_path; it becomes valid command with empty path, reader returns end-of-stream for that task.
Likely fix direction: require explicit command marker/version field before treating payload as command; otherwise keep as plain path.
Regression test direction: cluster task with JSON-like literal path should still be read as path unless marker is present.
Blast radius: protocol/task parsing edge cases in cluster reads.

7) Coverage accounting + stop-condition status

Call-graph coverage: reviewed all changed nodes in partitions A-C relevant to task scheduling/retry/protocol and Iceberg optimization plumbing.
Transitions reviewed: T1-T5 all covered.
Fault category completion: all listed categories executed; none deferred.
Coverage stop condition: met (all in-scope nodes/transitions/categories reviewed or explicitly passed).

8) Assumptions & Limits

Static audit only; no runtime execution in this pass.
Failure scenarios reasoned from code paths and state transitions, not live chaos tests.
No sanitizer/TSAN run performed here.

9) Confidence rating and confidence-raising evidence

Overall confidence: Medium-High.
To raise confidence to High:
- run integration test with induced replica loss before first data packet under bucket-splitting/archive modes,
- add protocol parsing test for JSON-like path payloads,
- run failure-injection regression for s3Cluster/icebergCluster.

10) Residual risks and untested paths

Residual risk around mixed archive + bucket-split + retry-command interleavings.
Swarm mode register/unregister behavior under repeated toggles appears reasonable but not dynamically validated here.
Iceberg constant-column optimization correctness depends on metadata quality; runtime matrix not executed.

CarlosFelipeOR · 2026-03-06T06:16:09Z

QA Verification

PR #1414 is a frontport of 9 Antalya sub-PRs from 25.8 to 26.1, covering swarm mode, Iceberg read optimization, cluster task distribution improvements, and crash fixes.

Verdict: All new integration tests pass, no CI failures are related to the PR. Approved.

CI Failures Analysis (non-regression)

All builds passed. No test failures are related to PR #1414.

1. `test_export_merge_tree_part_to_object_storage` / `03572_export_merge_tree_part_limits_and_table_functions`

Tests: test_add_column_during_export, test_drop_column_during_export_snapshot, 03572_export_merge_tree_part_limits_and_table_functions
Root cause: Introduced by PR #1388 (export part/partition forward port). Affects all antalya-26.1 PRs. Fixed by PR #1478.
Related to PR #1414: No

2. `01171_mv_select_insert_isolation_long`

Job: Stateless tests (arm_asan, targeted)
Root cause: Known flaky test. Fails across multiple branches (antalya-25.6.5, 25.8, 26.1) and multiple PRs. Race condition in materialized view isolation timing.
Related to PR #1414: No

3. `03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers`

Job: Stateless tests (arm_asan, targeted)
Root cause: Known flaky test under sanitizer builds. Fails across antalya-25.8, 26.1 and upstream ClickHouse CI. The test checks clickhouse-local startup time, which exceeds the threshold due to ASAN overhead.
Related to PR #1414: No

4. Integration tests (arm_binary, distributed plan, 3/4)

Status: CANCELLED / ERROR
Root cause: ARM CI infrastructure timeout issue, being fixed by PR #1466. Affects all antalya-26.1 PRs.
Related to PR #1414: No

New Integration Tests Added by PR

Test	Coverage	CI Result
`test_s3_cache_locality` (3 tests)	Cache locality, rendezvous hashing, `lock_object_storage_task_distribution_ms`	OK on amd_binary/arm_binary. BROKEN on ASAN/TSAN
`test_s3_cluster::test_graceful_shutdown`	Graceful shutdown of S3 cluster node (SYSTEM STOP SWARM MODE)	OK on all builds
`test_read_constant_columns_optimization` (4 variants: S3/Azure × True/False)	Iceberg constant columns read optimization	OK on all builds
`gtest_rendezvous_hashing` (unit test, updated)	Rendezvous hashing algorithm	OK

Note: Sub-PR #1201 (segfault on unexpected node shutdown) does not have a dedicated test. The scenario would be covered by the swarms/node_failure regression suite, but it cannot run yet due to the missing object_storage_cluster setting.

Regression Tests (Altinity/clickhouse-regression)

All failures are pre-existing and affect all antalya-26.1 PRs:

Suite	x86	arm64	Reason
swarms	FAIL	FAIL	`object_storage_cluster` setting not yet forward-ported to 26.1 (will be added by PR #1390)
iceberg_1, iceberg_2	FAIL	FAIL	Same `object_storage_cluster` missing setting; swarm and iterator race condition tests use it
parquet	FAIL	FAIL	Antalya features still being forward-ported to 26.1
settings	FAIL	FAIL	Settings list/default values not updated for 26.1 yet

The swarms and iceberg suites test functionality related to this PR, but all tests fail immediately because the object_storage_cluster setting (String, cluster name) does not exist yet in antalya-26.1. It exists in antalya-25.8 and will be forward-ported by PR #1390.

The new integration tests added by this PR (test_graceful_shutdown, test_s3_cache_locality, test_read_constant_columns_optimization) confirm the PR code works correctly.

ianton-ru · 2026-03-06T14:44:48Z

@CarlosFelipeOR
"Reschedule path uses a different task identity" - looks like a bug after frontport, make sense to make an issue
"Any JSON-looking file path is interpreted as a control command" - it's a compromise. We can't extend protocol only in Antalya branch because it produces conflicts with future versions when upstream extends protocol by different way. So this JSON is a dirty hack to send some additional info to the nodes.

CarlosFelipeOR · 2026-03-06T19:46:22Z

Issue created for "Reschedule path uses a different task identity" : #1486

Enmk and others added 13 commits February 17, 2026 12:03

26.1 Antalya ports, improvements for cluster requests

cc2dea7

Merge pull request #1069 from Altinity/frontport/antalya-25.8/optimiz…

1ffe4de

…e_count_in_datalake 25.8 Antalya ports: Read optimization using Iceberg metadata

Merge pull request #1076 from Altinity/bugfix/antalya-25.8/filimonov_…

fe1cd88

…wipes_my_snot 25.8 Antalya: Fixes for #1069

Merge pull request #1102 from Altinity/bugfix/antalya-25.8/swarm_canc…

2a1f088

…el_segfailt Check multiplexed connection before usage

Merge pull request #1109 from Altinity/feature/antalya-25.8/less_info…

b409fb5

…_in_task Small iceberg read optimization

Merge pull request #1172 from Altinity/feature/antalya-25.8/object_st…

d9b29c6

…orage_cluster_profile_events Profile events for task distribution in ObjectStorageCluster requests

Merge pull request #1178 from Altinity/bugfix/antalya-25.8/fix_object…

735736c

…_storage_cluster_distribution

Merge pull request #1201 from Altinity/bugfix/antalya-25.8/segfault_o…

1e3f9b7

…n_unexpected_shutdown Fix segfault on unexpected node shutdown

Change protocol version on JSON for file column info

d867608

Serialize range in Base64

45785de

Check setting before JSON sending

829c2c6

Fix path when is true and object is not an archive

62e3696

Set file_meta_info only when filled

063c885

ianton-ru added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.1 labels Feb 17, 2026

ianton-ru changed the title ~~Frontport/antalya 26.1/rendezvous hashing~~ 26.1 Antalya port - improvements for cluster requests Feb 17, 2026

Fix test

9bd95cb

chatgpt-codex-connector bot reviewed Feb 18, 2026

View reviewed changes

ianton-ru added 3 commits February 20, 2026 09:22

Fix test

bc916b9

Fix after review

c523f29

Fix catch JSON parsing error

e7dd8cb

ianton-ru added 2 commits February 27, 2026 23:10

Merge branch 'antalya-26.1' into frontport/antalya-26.1/rendezvous_ha…

0a410fd

…shing

Merge branch 'antalya-26.1' into frontport/antalya-26.1/rendezvous_ha…

32305e5

…shing

ianton-ru commented Mar 3, 2026

View reviewed changes

zvonand added 2 commits March 3, 2026 13:46

update settings changes history

456466b

return the line break

342686a

CarlosFelipeOR and others added 3 commits March 4, 2026 02:16

Merge branch 'antalya-26.1' into frontport/antalya-26.1/rendezvous_ha…

52fb34c

…shing

Merge branch 'antalya-26.1' into frontport/antalya-26.1/rendezvous_ha…

8b0a220

…shing

Merge branch 'antalya-26.1' into frontport/antalya-26.1/rendezvous_ha…

7e9b163

…shing

This was referenced Mar 5, 2026

Export Part/Partition integration tests (PR #1388) failing consistently under ASAN builds Altinity/clickhouse-regression#112

Open

Antalya 26.1 - Forward port of list objects cache #1040 #1405

Merged

zvonand merged commit 05010e8 into antalya-26.1 Mar 5, 2026
881 of 945 checks passed

alsugiliazova mentioned this pull request Mar 6, 2026

(not present in released builds) Segfault in ClusterDiscovery::getNodeNames (null ZooKeeper dereference) in PR #1414/#1390 #1483

Open

CarlosFelipeOR mentioned this pull request Mar 6, 2026

Object-storage task rescheduling uses inconsistent task identity, causing task loss/misrouting after replica failure #1486

Open

CarlosFelipeOR added the verified-with-issue Verified by QA and issue(s) found. label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

26.1 Antalya port - improvements for cluster requests#1414

26.1 Antalya port - improvements for cluster requests#1414
zvonand merged 24 commits intoantalya-26.1from
frontport/antalya-26.1/rendezvous_hashing

ianton-ru commented Feb 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

ianton-ru commented Feb 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 18, 2026

Uh oh!

chatgpt-codex-connector bot Feb 18, 2026

Uh oh!

ianton-ru commented Feb 23, 2026

Uh oh!

ianton-ru Mar 3, 2026

Uh oh!

zvonand commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

CarlosFelipeOR commented Mar 6, 2026

Uh oh!

CarlosFelipeOR commented Mar 6, 2026

Uh oh!

ianton-ru commented Mar 6, 2026

Uh oh!

CarlosFelipeOR commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		replica_to_files_to_be_processed.erase(number_of_current_replica);
		for (const auto & file : processed_file_list_ptr->second)

		if (read_context.packet.type == Protocol::Server::Data)
		read_context.has_data_packets = true;

Conversation

ianton-ru commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianton-ru commented Feb 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru commented Feb 23, 2026

Uh oh!

ianton-ru Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zvonand commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

CarlosFelipeOR commented Mar 6, 2026

PR #1414 Audit Review

1) Scope and partitions

2) Call graph

3) Transition matrix

4) Logical code-path testing summary

5) Fault categories and category-by-category injection results

6) Confirmed defects (High/Medium/Low)

High — Reschedule path uses a different task identity than normal scheduling, causing task collapse/misrouting after replica loss

Medium — Any JSON-looking file path is interpreted as a control command and can be dropped as empty path

7) Coverage accounting + stop-condition status

8) Assumptions & Limits

9) Confidence rating and confidence-raising evidence

10) Residual risks and untested paths

Uh oh!

CarlosFelipeOR commented Mar 6, 2026

QA Verification

CI Failures Analysis (non-regression)

1. test_export_merge_tree_part_to_object_storage / 03572_export_merge_tree_part_limits_and_table_functions

2. 01171_mv_select_insert_isolation_long

3. 03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers

4. Integration tests (arm_binary, distributed plan, 3/4)

New Integration Tests Added by PR

Regression Tests (Altinity/clickhouse-regression)

Uh oh!

ianton-ru commented Mar 6, 2026

Uh oh!

CarlosFelipeOR commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ianton-ru commented Feb 17, 2026 •

edited

Loading

github-actions bot commented Feb 17, 2026 •

edited

Loading

zvonand commented Mar 5, 2026 •

edited

Loading

1. `test_export_merge_tree_part_to_object_storage` / `03572_export_merge_tree_part_limits_and_table_functions`

2. `01171_mv_select_insert_isolation_long`

3. `03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers`