Skip to content

Antalya 26.1: Fix rescheduleTasksFromReplica#1568

Open
ianton-ru wants to merge 2 commits intoantalya-26.1from
bugfix/antalya-26.1/1554_fix_crash
Open

Antalya 26.1: Fix rescheduleTasksFromReplica#1568
ianton-ru wants to merge 2 commits intoantalya-26.1from
bugfix/antalya-26.1/1554_fix_crash

Conversation

@ianton-ru
Copy link
Copy Markdown

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix rescheduleTasksFromReplica

Documentation entry for user-facing changes

Fix incorrect change from c523f29
getReplicaForFile uses replica_to_files_to_be_processed to find best replica for file, With removing lost replica after getReplicaForFile call, getReplicaForFile chooses the same replica, so rescheduling makes no sense, files will be choosen only in getAnyUnprocessedFile and executed on random replicas.
This PR fixes the order, now files are matched with new best replicas.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@ianton-ru
Copy link
Copy Markdown
Author

@codex review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 23, 2026

Workflow [PR], commit [4e8cb2b]

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record


AI audit note: This review comment was generated by AI.

Audit update for PR #1568 (Fix rescheduleTasksFromReplica):

Confirmed defects

No confirmed defects in reviewed scope.

The change is correct:

  • std::move(processed_file_list_ptr->second) safely transfers list ownership to a local variable; std::list move leaves the source empty and well-defined.
  • processed_file_list_ptr (a map iterator) is invalidated by the subsequent erase, but is never dereferenced after that point.
  • getReplicaForFile is called within the mutex-held region, iterating over replica_to_files_to_be_processed which now correctly excludes the lost replica, so rendezvous hashing picks the best surviving replica.
  • The replica_to_files_to_be_processed.size() < 2 guard on line 307 still executes before the erase, correctly verifying at least one other replica remains.

Coverage summary

  • Scope reviewed: rescheduleTasksFromReplica (changed function), getReplicaForFile (callee), getNextTask / getPreQueuedFile / getMatchingFileFromIterator / getAnyUnprocessedFile (concurrent accessors), RemoteQueryExecutor::processPacket (caller on ConnectionLost), class header and all member data structures.
  • Categories passed: move-semantics safety, iterator invalidation, logical correctness of erase-before-hash, RAII/resource leaks, integer types, exception safety within the function, mutex coverage within rescheduleTasksFromReplica, rollback/partial-update (single-function scope, all mutations under one lock).
  • Categories failed: none.
  • Pre-existing issue noted (not introduced by this PR): getNextTask accesses replica_to_files_to_be_processed at lines 52 and 70 without the mutex, while rescheduleTasksFromReplica mutates it under the mutex. getMatchingFileFromIterator calls getReplicaForFile (which iterates the same map) at line 197 without the mutex. These are data races that can lead to use-after-free if a replica's in-flight getNextTask overlaps with rescheduleTasksFromReplica for the same replica. This PR does not introduce or worsen this race in any meaningful way.
  • Assumptions/limits: audit is static-only; no test for rescheduleTasksFromReplica was found in the test suite (gtest_rendezvous_hashing.cpp covers getNextTask only).

@ianton-ru ianton-ru changed the title Fix rescheduleTasksFromReplica Antalya 26.1: Fix rescheduleTasksFromReplica Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants