Skip to content

26.1 Antalya port - improvements for cluster requests#1414

Merged
zvonand merged 24 commits intoantalya-26.1from
frontport/antalya-26.1/rendezvous_hashing
Mar 5, 2026
Merged

26.1 Antalya port - improvements for cluster requests#1414
zvonand merged 24 commits intoantalya-26.1from
frontport/antalya-26.1/rendezvous_hashing

Conversation

@ianton-ru
Copy link

@ianton-ru ianton-ru commented Feb 17, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Frontports for Antalya 26.1

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@ianton-ru ianton-ru added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.1 labels Feb 17, 2026
@github-actions
Copy link

github-actions bot commented Feb 17, 2026

Workflow [PR], commit [7e9b163]

@ianton-ru ianton-ru changed the title Frontport/antalya 26.1/rendezvous hashing 26.1 Antalya port - improvements for cluster requests Feb 17, 2026
@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bd95cb8cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +311 to +312
replica_to_files_to_be_processed.erase(number_of_current_replica);
for (const auto & file : processed_file_list_ptr->second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep task list before removing lost replica entry

In rescheduleTasksFromReplica, the code erases replica_to_files_to_be_processed[number_of_current_replica] and then immediately iterates processed_file_list_ptr->second. That iterator points to the erased element, so this is undefined behavior on the replica-failover path. When a replica drops mid-query, this can crash the coordinator or prevent tasks from being re-queued reliably.

Useful? React with 👍 / 👎.

Comment on lines +74 to +75
if (read_context.packet.type == Protocol::Server::Data)
read_context.has_data_packets = true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ignore empty Data packets when deciding task rerun

The failover guard marks has_data_packets true for any Protocol::Server::Data packet, but empty Data blocks are valid and do not imply rows were processed. If a replica sends only empty Data blocks and then disconnects, the EOF handler will skip the ConnectionLost rerun path (!has_data_packets is false), so unprocessed tasks are not reassigned even with skip_unavailable_shards enabled.

Useful? React with 👍 / 👎.

@ianton-ru
Copy link
Author

Failed stateless and integration tests seem unrelated.

# Weird, bu looks like ReadFileMetadata does not used local file cache in 26.1
# metadata.json always downloaded in 26.1, once per query or subquery
# In 25.8 count was equal to expected, in 26.1 it is expected * 3 + 1 for Local case
# expected * 3 + 4 for Cluster case, because each subquery loads mettadata.json
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess, it is an effect of ClickHouse#93416

@zvonand
Copy link
Collaborator

zvonand commented Mar 5, 2026

03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers -- does not look related and/or critical

@zvonand zvonand merged commit 05010e8 into antalya-26.1 Mar 5, 2026
881 of 945 checks passed
@CarlosFelipeOR
Copy link
Collaborator

PR #1414 Audit Review

AI review comment (model: gpt-5.3-codex)

1) Scope and partitions

PR reviewed: Altinity/ClickHouse#1414

Large PR; reviewed in functional partitions:

  • Partition A (task distribution / retries / swarm): StorageObjectStorageStableTaskDistributor, StorageObjectStorageCluster, RemoteQueryExecutor*, Context, ClusterDiscovery, MultiplexedConnections
  • Partition B (task payload protocol): ClusterFunctionReadTask*, IObjectStorage::{RelativePathWithMetadata, CommandInTaskResponse}, Protocol
  • Partition C (Iceberg read optimization): IDataLakeMetadata, StorageObjectStorageSource, Range
  • Partition D (control/observability/docs): settings, access rights, profile/current metrics, SYSTEM docs/tests

2) Call graph

Main changed runtime path for cluster table functions:

  • initiator -> RemoteQueryExecutor::sendReadTaskRequestAssumeLocked()
  • worker task source -> TaskIterator::operator() (object-storage/file/url variants)
  • object-storage scheduling -> StorageObjectStorageStableTaskDistributor::getNextTask()
  • response encode/decode -> ClusterFunctionReadTaskResponse::{serialize, deserialize, getObjectInfo}
  • worker read loop -> StorageObjectStorageSource::createReader()
  • connection-loss path -> RemoteQueryExecutorReadContext::Task::run() -> synthetic Protocol::Server::ConnectionLost -> RemoteQueryExecutor::processPacket() -> TaskIterator::rescheduleTasksFromReplica()

State/caches touched:

  • distributor queues: connection_to_files, unprocessed_files, replica_to_files_to_be_processed, last_node_activity
  • swarm mode gates in Context and cluster registration in ClusterDiscovery

3) Transition matrix

ID Transition Invariant
T1 iterator object -> stable identifier -> queue key same identifier function must be used on enqueue/dequeue/reschedule
T2 processed-task tracking -> replica-loss reschedule must preserve task granularity (file-bucket/archive unit)
T3 serialized task path -> ObjectInfo reconstruction plain paths must never be reinterpreted as control command
T4 retry command task -> reader loop control command must not be treated as data path
T5 connection-loss rerun only safe to rerun unprocessed units

4) Logical code-path testing summary

Reviewed changed branches:

  • matched/prequeued/unprocessed selection in distributor
  • delayed stealing (lock_object_storage_task_distribution_ms) and retry command response
  • rerun-on-connection-lost when no data packet seen
  • command-in-path parsing and reader handling
  • constant-column pruning path from Iceberg metadata

Concurrency/interleavings checked:

  • multiple replicas requesting tasks with shared distributor mutex
  • replica failure while tasks already assigned but before data packet
  • retry-command assignment + subsequent replica loss

5) Fault categories and category-by-category injection results

Fault category Status Outcome Defects
Identifier consistency across queue lifecycle Executed Fail 1
Retry/reschedule correctness under node loss Executed Fail 1 (same root cause)
Command/data boundary in task protocol Executed Fail 1
Concurrency/race/deadlock Executed Pass 0
Exception/partial-update behavior Executed Pass 0
Error-contract consistency Executed Pass 0
Performance/resource risks Executed Pass 0

6) Confirmed defects (High/Medium/Low)

High — Reschedule path uses a different task identity than normal scheduling, causing task collapse/misrouting after replica loss

  • Impact: possible data loss or incorrect re-execution granularity when a replica is lost before producing data; especially for split-file buckets and archive modes.
  • Where: StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica
  • Trigger condition: replica connection loss before data packet + tasks containing bucket-level identity or archive-derived identity.
  • Affected transition: T1/T2.
  • Why defect: normal scheduling keys/hashes by getIdentifier() (or archive path in special mode), but reschedule rehashes/keys by getPath().

Evidence:

for (const auto & file : processed_file_list_ptr->second)
{
    auto file_replica_idx = getReplicaForFile(file->getPath());
    unprocessed_files.emplace(file->getPath(), std::make_pair(file, file_replica_idx));
    connection_to_files[file_replica_idx].push_back(file);
}
String file_identifier;
if (send_over_whole_archive && object_info->isArchive())
    file_identifier = object_info->getPathOrPathToArchiveIfArchive();
else
    file_identifier = object_info->getIdentifier();
String ObjectInfo::getIdentifier() const
{
    String result = getPath();
    if (file_bucket_info)
        result += file_bucket_info->getIdentifier();
  • Smallest logical repro: split one file into multiple file_bucket_info tasks; lose assigned replica before first data packet; reschedule path inserts by plain getPath() key so bucket identities collide (emplace drops duplicates).
  • Likely fix direction: in reschedule, reuse the same identifier logic as enqueue path (getIdentifier() / archive-aware rule), both for hashing and unprocessed_files key.
  • Regression test direction: fail one replica before data while bucket-split is enabled; assert all buckets are eventually processed exactly once.
  • Blast radius: object-storage cluster reads (s3Cluster, icebergCluster, etc.) under failure/retry conditions.

Medium — Any JSON-looking file path is interpreted as a control command and can be dropped as empty path

  • Impact: valid object/file keys that are JSON objects (or command-shaped strings) can be misparsed as control commands, leading to skipped data.
  • Where: RelativePathWithMetadata::CommandInTaskResponse parsing + RelativePathWithMetadata constructor + reader empty-path check.
  • Affected transition: T3/T4.
  • Why defect: command parser marks any parsed JSON object as valid command even without required command fields (file_path/retry_after_us), then constructor replaces relative_path with empty fallback.

Evidence:

auto json = parser.parse(task).extract<Poco::JSON::Object::Ptr>();
if (!json)
    return;

is_valid = true;

if (json->has("file_path"))
    file_path = json->getValue<std::string>("file_path");
if (json->has("retry_after_us"))
    retry_after_us = json->getValue<size_t>("retry_after_us");
explicit RelativePathWithMetadata(String command_or_path, std::optional<ObjectMetadata> metadata_ = std::nullopt)
    : relative_path(std::move(command_or_path))
    , metadata(std::move(metadata_))
    , command(relative_path)
{
    if (command.isValid())
    {
        relative_path = command.getFilePath().value_or("");
if (object_info->relative_path_with_metadata.getCommand().isValid())
{
    auto retry_after_us = object_info->relative_path_with_metadata.getCommand().getRetryAfterUs();
    ...
}
if (object_info->getPath().empty())
    return {};
  • Smallest logical repro: task path string equals a JSON object without file_path; it becomes valid command with empty path, reader returns end-of-stream for that task.
  • Likely fix direction: require explicit command marker/version field before treating payload as command; otherwise keep as plain path.
  • Regression test direction: cluster task with JSON-like literal path should still be read as path unless marker is present.
  • Blast radius: protocol/task parsing edge cases in cluster reads.

7) Coverage accounting + stop-condition status

  • Call-graph coverage: reviewed all changed nodes in partitions A-C relevant to task scheduling/retry/protocol and Iceberg optimization plumbing.
  • Transitions reviewed: T1-T5 all covered.
  • Fault category completion: all listed categories executed; none deferred.
  • Coverage stop condition: met (all in-scope nodes/transitions/categories reviewed or explicitly passed).

8) Assumptions & Limits

  • Static audit only; no runtime execution in this pass.
  • Failure scenarios reasoned from code paths and state transitions, not live chaos tests.
  • No sanitizer/TSAN run performed here.

9) Confidence rating and confidence-raising evidence

  • Overall confidence: Medium-High.
  • To raise confidence to High:
    • run integration test with induced replica loss before first data packet under bucket-splitting/archive modes,
    • add protocol parsing test for JSON-like path payloads,
    • run failure-injection regression for s3Cluster/icebergCluster.

10) Residual risks and untested paths

  • Residual risk around mixed archive + bucket-split + retry-command interleavings.
  • Swarm mode register/unregister behavior under repeated toggles appears reasonable but not dynamically validated here.
  • Iceberg constant-column optimization correctness depends on metadata quality; runtime matrix not executed.

@CarlosFelipeOR
Copy link
Collaborator

QA Verification

PR #1414 is a frontport of 9 Antalya sub-PRs from 25.8 to 26.1, covering swarm mode, Iceberg read optimization, cluster task distribution improvements, and crash fixes.

Verdict: All new integration tests pass, no CI failures are related to the PR. Approved.

CI Failures Analysis (non-regression)

All builds passed. No test failures are related to PR #1414.

1. test_export_merge_tree_part_to_object_storage / 03572_export_merge_tree_part_limits_and_table_functions

Tests: test_add_column_during_export, test_drop_column_during_export_snapshot, 03572_export_merge_tree_part_limits_and_table_functions
Root cause: Introduced by PR #1388 (export part/partition forward port). Affects all antalya-26.1 PRs. Fixed by PR #1478.
Related to PR #1414: No

2. 01171_mv_select_insert_isolation_long

Job: Stateless tests (arm_asan, targeted)
Root cause: Known flaky test. Fails across multiple branches (antalya-25.6.5, 25.8, 26.1) and multiple PRs. Race condition in materialized view isolation timing.
Related to PR #1414: No

3. 03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers

Job: Stateless tests (arm_asan, targeted)
Root cause: Known flaky test under sanitizer builds. Fails across antalya-25.8, 26.1 and upstream ClickHouse CI. The test checks clickhouse-local startup time, which exceeds the threshold due to ASAN overhead.
Related to PR #1414: No

4. Integration tests (arm_binary, distributed plan, 3/4)

Status: CANCELLED / ERROR
Root cause: ARM CI infrastructure timeout issue, being fixed by PR #1466. Affects all antalya-26.1 PRs.
Related to PR #1414: No

New Integration Tests Added by PR

Test Coverage CI Result
test_s3_cache_locality (3 tests) Cache locality, rendezvous hashing, lock_object_storage_task_distribution_ms OK on amd_binary/arm_binary. BROKEN on ASAN/TSAN
test_s3_cluster::test_graceful_shutdown Graceful shutdown of S3 cluster node (SYSTEM STOP SWARM MODE) OK on all builds
test_read_constant_columns_optimization (4 variants: S3/Azure × True/False) Iceberg constant columns read optimization OK on all builds
gtest_rendezvous_hashing (unit test, updated) Rendezvous hashing algorithm OK

Note: Sub-PR #1201 (segfault on unexpected node shutdown) does not have a dedicated test. The scenario would be covered by the swarms/node_failure regression suite, but it cannot run yet due to the missing object_storage_cluster setting.

Regression Tests (Altinity/clickhouse-regression)

All failures are pre-existing and affect all antalya-26.1 PRs:

Suite x86 arm64 Reason
swarms FAIL FAIL object_storage_cluster setting not yet forward-ported to 26.1 (will be added by PR #1390)
iceberg_1, iceberg_2 FAIL FAIL Same object_storage_cluster missing setting; swarm and iterator race condition tests use it
parquet FAIL FAIL Antalya features still being forward-ported to 26.1
settings FAIL FAIL Settings list/default values not updated for 26.1 yet

The swarms and iceberg suites test functionality related to this PR, but all tests fail immediately because the object_storage_cluster setting (String, cluster name) does not exist yet in antalya-26.1. It exists in antalya-25.8 and will be forward-ported by PR #1390.

The new integration tests added by this PR (test_graceful_shutdown, test_s3_cache_locality, test_read_constant_columns_optimization) confirm the PR code works correctly.

@ianton-ru
Copy link
Author

@CarlosFelipeOR
"Reschedule path uses a different task identity" - looks like a bug after frontport, make sense to make an issue
"Any JSON-looking file path is interpreted as a control command" - it's a compromise. We can't extend protocol only in Antalya branch because it produces conflicts with future versions when upstream extends protocol by different way. So this JSON is a dirty hack to send some additional info to the nodes.

@CarlosFelipeOR
Copy link
Collaborator

Issue created for "Reschedule path uses a different task identity" : #1486

@CarlosFelipeOR CarlosFelipeOR added the verified-with-issue Verified by QA and issue(s) found. label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 port-antalya PRs to be ported to all new Antalya releases verified-with-issue Verified by QA and issue(s) found.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants