Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try fix 'some fetches may stuck' #30346

Merged
merged 6 commits into from Oct 28, 2021
Merged

Try fix 'some fetches may stuck' #30346

merged 6 commits into from Oct 28, 2021

Conversation

tavplubix
Copy link
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Minor improvements in replica cloning and enqueuing fetch for broken parts, that should avoid extremely rare hanging of GET_PART entries in replication queue.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Oct 18, 2021
@alesapin alesapin self-assigned this Oct 19, 2021
Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most complex fix for our replication I've seen.

LogEntryPtr parsed_entry = {};
};

std::vector<QueueEntryInfo> source_queue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need comment with general idea.

String path_created = dynamic_cast<const Coordination::CreateResponse &>(*results.back()).path_created;
log_entry->znode_name = path_created.substr(path_created.find_last_of('/') + 1);
queue.insert(zookeeper, log_entry);
zookeeper->multi(ops);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to hack queue here, because new entry will pulled in pullLogsToQueue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this change was incorrect, because we cannot pull entries from queue to queue

/// but before we copied its active parts set. In this case we will GET_PART entry in our queue
/// and later will pull the original GET_PART from replication log.
/// It should not cause any issues, but it does not allow to get rid of duplicated entries and add an assertion.
if (created_get_parts.count(part_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pass created_get_parts as argument. Currently it's hard to understand that should_ignore_log_entry depends on it.

@@ -324,7 +326,8 @@ class ReplicatedMergeTreeQueue
/** Remove the action from the queue with the parts covered by part_name (from ZK and from the RAM).
* And also wait for the completion of their execution, if they are now being executed.
*/
void removePartProducingOpsInRange(zkutil::ZooKeeperPtr zookeeper, const MergeTreePartInfo & part_info, const ReplicatedMergeTreeLogEntryData & current);
void removePartProducingOpsInRange(zkutil::ZooKeeperPtr zookeeper, const MergeTreePartInfo & part_info,
const std::optional<ReplicatedMergeTreeLogEntryData> & current);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is current? Maybe add a comment?

@tavplubix
Copy link
Member Author

Functional stateless tests (debug) - race between executeReplaceRange and ReplicatedMergeTreePartCheckThread (a part from dropped range may reappear after removePartAndEnqueueFetch called from check thread)

2021.10.26 06:06:07.076332 [ 552 ] {} <Warning> test_oxicq7.src_13 (ReplicatedMergeTreePartCheckThread): Found the missing part 6_5_10_1 at 6_5_10_1 on r1_3877579479
2021.10.26 06:06:07.082900 [ 382 ] {} <Trace> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Renaming temporary part tmp-fetch_6_5_10_1 to 6_5_10_1.
2021.10.26 06:06:07.149418 [ 382 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Part 6_5_5_0 is rendered obsolete by fetching part 6_5_10_1
2021.10.26 06:06:07.149522 [ 382 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Part 6_6_6_0 is rendered obsolete by fetching part 6_5_10_1
2021.10.26 06:06:07.149646 [ 382 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Part 6_7_7_0 is rendered obsolete by fetching part 6_5_10_1
2021.10.26 06:06:07.149746 [ 382 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Part 6_9_9_0 is rendered obsolete by fetching part 6_5_10_1
2021.10.26 06:06:07.149897 [ 382 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Fetched part 6_5_10_1 from /test/01154_move_partition_long_test_oxicq7/s1/src/replicas/r1_3877579479
2021.10.26 06:06:07.150036 [ 382 ] {} <Test> test_oxicq7.src_13 (ReplicatedMergeTreeQueue): Removing successful entry queue-0000000154 from queue with type MERGE_PARTS with virtual parts [6_5_10_1]
2021.10.26 06:06:07.150133 [ 382 ] {} <Test> test_oxicq7.src_13 (ReplicatedMergeTreeQueue): Adding parts [6_5_10_1] to current parts
2021.10.26 06:06:07.150257 [ 382 ] {} <Test> test_oxicq7.src_13 (ReplicatedMergeTreeQueue): Removing part 6_5_10_1 from mutations (remove_part: false, remove_covered_parts: true)
2021.10.26 06:06:07.167028 [ 397 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Skipping action for part 6_8_8_0 because part 6_5_10_1 already exists.
2021.10.26 06:06:13.583962 [ 395 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Skipping action for part 6_10_10_0 because part 6_5_10_1 already exists.
2021.10.26 06:06:14.568762 [ 588 ] {} <Warning> test_oxicq7.src_13 (ReplicatedMergeTreePartCheckThread): Checking part 6_5_10_1
2021.10.26 06:06:14.573340 [ 409 ] {} <Trace> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Replacing 7 parts 6_5_5_0 6_5_10_1 6_6_6_0 6_7_7_0 6_9_9_0 6_14_14_0 6_18_18_0 with 2 parts 6_20_20_0, 6_21_21_0
2021.10.26 06:06:14.577188 [ 588 ] {} <Trace> test_oxicq7.src_13 (ReplicatedMergeTreePartCheckThread): Part 6_5_10_1 in zookeeper: true, locally: false
2021.10.26 06:06:14.577297 [ 588 ] {} <Warning> test_oxicq7.src_13 (ReplicatedMergeTreePartCheckThread): Checking if anyone has a part 6_5_10_1 or covering part.
2021.10.26 06:06:14.596799 [ 588 ] {} <Warning> test_oxicq7.src_13 (ReplicatedMergeTreePartCheckThread): Found the missing part 6_5_10_1 at 6_5_10_1 on r1_3950745081
2021.10.26 06:06:14.597052 [ 588 ] {} <Warning> test_oxicq7.src_13 (ReplicatedMergeTreePartCheckThread): Part 6_5_10_1 exists in ZooKeeper but not locally and found on other replica. Removing from ZooKeeper and queueing a fetch.
2021.10.26 06:06:14.608161 [ 479 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): There is no part 6_5_10_1 in ZooKeeper, it was only in filesystem
2021.10.26 06:06:16.087135 [ 479 ] {} <Debug> test_oxicq7.src_13 (9cb435b5-3865-4994-9cb4-35b53865b994): Removing part from filesystem 6_5_10_1

@tavplubix tavplubix merged commit 33ffe11 into master Oct 28, 2021
@tavplubix tavplubix deleted the fix_some_fetches_may_stuck branch October 28, 2021 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants