Prepare the introduction of more keeper faults #56917

Algunenano · 2023-11-17T12:08:19Z

Changelog category (leave one):

Not for changelog

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Prepares the introduction of more keeper faults

Longer details:

Refactors ZooKeeperWithFaultInjection to prepare to introduce it in more places:
- Implementation goes to the .cpp file.
- Adds a bunch of new methods with faults to be used in replicated merge tree
- Temporarily maintains the tracking of ephemeral nodes created with a faulty keeper. In the future, once "everything" handles retries this should removed in favor of force a session reload, which is what happens in a real environment.
- Introduces faults in async functions too via promises.
It reworks how part_committed_locally_but_zookeeper (old) / resolve_uncertain_commit_stage (less old) worked so that we can keep retrying in case the ZK session failed when committing. Before the uncertain stage could only be resolved if ZK failed after write (and in that case everything was good) but it wouldn't recover it data wasn't written to keeper. Now it recovers both cases.
Fixes an issue found related with ephemeral nodes cleaning up in backups fault and changes the name (handleEphemeralNodeExistence -> deleteEphemeralNodeIfContentMatches) to avoid further confusions. I included in this PR since changing the name and introducing faults is what leads to noticing the problem and enables fixing them only once (and not twice, one in a different PR and another time in the refactor)

AFAICS. Closes #50465 as it removed the special handling of tryMulti() responses

cc @devcrafter

robot-ch-test-poll1 · 2023-11-17T12:10:19Z

This is an automated comment for commit 6cf8c9b with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker image for servers	The check to build and optionally push the mentioned image to docker hub	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Mergeable Check	Checks if all other necessary checks are successful	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Push to Dockerhub	The check for building and pushing the CI related docker images to docker hub	✅ success
SQLTest	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
SQLancer	Fuzzing tests that detect logical bugs with SQLancer tool	✅ success
Sqllogic	Run clickhouse on the sqllogic test set against sqlite and checks that all statements are passed	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style Check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	❌ failure
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	❌ failure

Algunenano · 2023-11-17T18:08:34Z

It seems I will need to bring more commits from #56389. As the tests run in parallel, the moment you enable more faults (some ops had some disabled, and the refactor got rid of that) they all clash with each other more frequently so you see the lack of proper recovery more often, e.g. the part_committed_locally_but_zookeeper thingy.

…ction

Algunenano · 2023-11-20T21:20:03Z

Fixed broken debug build. I'm not sure why it compiled in other builds when I used ch_assert vs chassert
The commit "fixing" 00509_extended_storage_definition_syntax_zookeeper wasn't necessary as, in the end, it wasn't unrelated. OPTIMIZE failed because there were inserts on the fly because of dangling ephemeral nodes due to keeper faults. After fixing that it does not fail

Algunenano · 2023-11-22T08:39:17Z

Errors:

AST fuzzer (asan) -> Related. Bug in ZooKeeperWithFaultInjection::exists logger.
Integration tests (asan) [1/4] -> PG leak -> Memory leak in Postgres (libpq) #56627
Integration tests (release) [2/4] -> test_backup_restore_on_cluster/test_concurrency.py::test_create_or_drop_tables_during_backup[Replicated-ReplicatedMergeTree]. Unrelated. The test is flaky normally and doesn't have faults (Cannot enqueue query on this replica)
Stateless tests (debug) [2/5] -> 01280_ttl_where_group_by. Flaky
Stateless tests (debug, s3 storage) [3/6] -> 01280_ttl_where_group_by again
Stateless tests (release, analyzer). 01287_max_execution_speed and 01600_parts_states_metrics_long which are flaky with the analyzer.
Stateless tests (tsan, s3 storage) [2/5] -> 01280_ttl_where_group_by again

Algunenano · 2023-11-23T12:31:08Z

Failures:

Stateless tests (debug, s3 storage) [3/6] -> 01280_ttl_where_group_by. Unrelated
Stress test (debug) -> Server unable to boot: Expected cache path (path) in configuration #57127. Unrelated

Algunenano · 2023-11-23T16:33:44Z

src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp

-                {
-                    my_storage.enqueuePartForCheck(part_name, MAX_AGE_OF_LOCAL_PART_THAT_WASNT_ADDED_TO_ZOOKEEPER);
-                });
+            LOG_TRACE(


@tavplubix Do you mind having a look at this changes, specially checking that rename_part_to_temporary(); in the case where the data wasn't written to keeper (to retry) is safe? Thanks!

It seems it's not safe due to zero copy locks and the bug is also present when the block_id was created by a different replica concurrently

devcrafter

LGTM with comments

devcrafter · 2023-11-23T15:28:37Z

src/Backups/RestoreCoordinationRemote.cpp

@@ -43,11 +44,9 @@ RestoreCoordinationRemote::RestoreCoordinationRemote(
            if (my_is_internal)
            {
                String alive_node_path = my_zookeeper_path + "/stage/alive|" + my_current_host;
+                zk->deleteEphemeralNodeIfContentMatches(alive_node_path, "");


It is worth a good comment. And probably it would be more convenient to hide deleteEphemeralNodeIfContentMatches() inside create() with zkutil::CreateMode::Ephemeral

I would rather not hide it for now. In some places the ephemeral node is created as part of a transaction and in other places we might not want to automatically recreate the ephemeral node. It also introduces extra round trips that are not necessary if you are not recovering from an error (one to check, one to remove or waiting for a long time for it to be removed).

I'll refactor it a bit

src/Common/ZooKeeper/ZooKeeper.h

src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp

Algunenano · 2023-12-11T14:24:56Z

Working on rebasing the PR after the fix for zero-copy locks

This reverts commit e4becc0.

…nsaction

Algunenano · 2023-12-12T11:39:03Z

Reimplemented ZK retries to tryMulti both for before and after failures on top of @CheSema 's recent changes. So far so good. I had to recover rollbackPartsToTemporaryState

If there are no tables left merge() will keep throwing errors constantly

Algunenano · 2023-12-13T09:41:21Z

Everything green except Integration tests (asan) [4/4] which failed because of test_postgresql_replica_database_engine_1/test.py::test_many_concurrent_queries which I don't see how it can be related.

Marking it as ready again after the last changes. cc @devcrafter in case you want to review the latest changes before merging

Algunenano · 2023-12-13T12:32:22Z

Need to rebase on top of 57764 which fixes some of the same problems found and fixed in this PR

Done

tavplubix · 2023-12-13T18:46:45Z

test_postgresql_replica_database_engine_1/test.py::test_many_concurrent_queries which I don't see how it can be related.

Even if it's not related, let's not ignore it. We can start with using CI DB to find when it started to fail. And it helps to find the reason quite easily: #57568 (comment)

Algunenano · 2023-12-13T18:55:40Z

The new failures appeared with the merge with master and they look related. Need to investigate more, but I took master's way of improving ZK retries on backup coordination instead of the one in this PR so most likely the problem is there.

Algunenano · 2023-12-15T13:29:09Z

Failures:

Integration tests (asan) [3/4] -> test_storage_hdfs/test.py::test_write_table. Fixed in Fix leak in StorageHDFS #55370
Integration tests (asan) [4/4] -> test_storage_azure_blob_storage/test_cluster.py::test_unset_skip_unavailable_shards -> Fixed in Fix test_unset_skip_unavailable_shards #57895
Integration tests (release) [4/4] -> test_storage_azure_blob_storage/test_cluster.py::test_unset_skip_unavailable_shards -> Fixed in Fix test_unset_skip_unavailable_shards #57895
Stateless tests (aarch64) -> 00002_log_and_exception_messages_formatting. I'll have a look in a different PR as it's not related to the changes.

I would love to not have to rebase this again 😉

devcrafter · 2023-12-15T15:19:58Z

I would love to not have to rebase this again 😉

It'd be nice to understand what has changed from latest review

Algunenano · 2023-12-15T16:23:06Z

It'd be nice to understand what has changed from latest review

The last 2 rebases:

9d8d5df + 923c3b7 -> Reimplemented retries in RMTSink (commit) when the connection to ZK fails on tryMulti(). This had been already implemented but it was discovered that old retries weren't handling zero copy nodes correctly and the function was refactored completely, so retries had to be introduced again.
Last one -> 3200933 -> Removes any changes vs master under src/Backups as the ones I had introduced have been also introduced in different PRs in the meantime.

devcrafter · 2023-12-15T15:39:32Z

tests/queries/0_stateless/02919_insert_meet_eternal_hardware_error.sql

+
+system disable failpoint replicated_commit_zk_fail_after_op;
+system disable failpoint replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault;
+


Optional, but we could check that after disabling failpoints, we can insert

devcrafter · 2023-12-15T15:53:07Z

src/Storages/MergeTree/MergeTreeData.cpp

+{
+    if (!isEmpty())
+    {
+        WriteBufferFromOwnString buf;


Shouldn't we create just a method which will do this logging? (it's copy/pasted code)

I recovered the code by reverting partially the commit that removed it and kept it as-is for simplicity, but it could be improved / iterated in the future.

devcrafter · 2023-12-15T17:40:19Z

src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp

+            });
+
+            /// Independently of how many retries we had left we want to do at least one check of this inner retry
+            /// at least once so a) we try to verify at least once if metadata was written and b) we set the proper


It's too many "at least once"

devcrafter · 2023-12-15T17:47:18Z

src/Storages/MergeTree/ReplicatedMergeTreeSink.cpp

@@ -1019,22 +1038,6 @@ std::pair<std::vector<String>, bool> ReplicatedMergeTreeSinkImpl<async_insert>::
    {
        zookeeper->setKeeper(storage.getZooKeeper());


I'm not sure about moving this block to lock and commit stage. Probably, you could explain

I moved it because the table being in read only only matters when we are going to add a new part/commit to ZK. If we only have checks left (for example in resolve_duplicate_stage) it doesn't make sense to not continue verifying because the table is stopped, as we can still return a proper response to the client.

A similar thing will happen for quorum for example (but it's not in this PR to keep in more concise). It's ok to check for quorum if the write succeeded but the table was set in read only after that.

devcrafter

@Algunenano My comments are mostly minor, except probably #56917 (comment). But if you think it's fine, please feel free to fix comments in follow-up PR

Algunenano · 2023-12-18T09:27:40Z

Failures:

AST fuzzer (ubsan) -> Sorting column wasn't found in the ActionsDAG's outputs #57705
Stateless tests (debug) [3/5]
- 01508_partition_pruning_long. It seems the queries just took too long and the test reached the 10m limit before finishing them all.
- 01414_mutations_and_errors_zookeeper: The mutation was cancelled internally (due to large sleep) before it was killed externally, so the log message is missing.
Upgrade check (debug): I see a OOM Kill in stress.log and I see a logical error in gdb.log but I can't see if they are related or what binary and data was being run. This test is really hard to debug/understand in general (Upgrade check: Provide better information on failure #57981)

Algunenano added 3 commits November 16, 2023 15:46

Prepare the introduction of more keeper faults

19931fe

WIP: Move implementation

cb9c973

Refactor ZooKeeperWithFaultInjection

3633e77

robot-ch-test-poll1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 17, 2023

Algunenano added 3 commits November 17, 2023 13:13

Style

9154e2f

Style and fix log level

40175f2

Merge remote-tracking branch 'blessed/master' into backup_1

210a0ee

Algunenano added 3 commits November 20, 2023 10:53

Merge remote-tracking branch 'blessed/master' into backup_1

aadb786

CI: Increase keeper retries and decrease retry backoff to a minimum

20eb5d3

Replace part_committed_locally_but_zookeeper with retries

8217915

devcrafter self-assigned this Nov 20, 2023

Recover special handling of ephemeral nodes in ZooKeeperWithFaultInje…

04f966c

…ction

Algunenano force-pushed the backup_1 branch from 8a0a5ec to 04f966c Compare November 20, 2023 21:17

Algunenano added 3 commits November 21, 2023 20:33

Tidy

820c1a5

Fix WithRetries callback

8972cde

Fix bug in exists()

16ad3ef

Merge remote-tracking branch 'blessed/master' into backup_1

2810603

Algunenano requested a review from devcrafter November 23, 2023 12:31

Algunenano commented Nov 23, 2023

View reviewed changes

devcrafter approved these changes Nov 24, 2023

View reviewed changes

Algunenano added 2 commits November 24, 2023 16:21

Review improvements

2539100

Merge remote-tracking branch 'blessed/master' into backup_1

63fe821

Algunenano mentioned this pull request Nov 30, 2023

fix zero-copy locks leaking #57205

Merged

Algunenano marked this pull request as draft December 11, 2023 14:24

Algunenano added 6 commits December 11, 2023 15:27

Merge remote-tracking branch 'blessed/master' into backup_1

a55a0c0

Adapt from HEAD

e6be38b

Fix CLICKHOUSE_KEEPER_CLIENT with CLICKHOUSE_BINARY

462cd0e

WIP: Remove UNCERTAIN_COMMIT in INSERT

e1965bb

Partially revert "make stages commit"

9d8d5df

This reverts commit e4becc0.

Implement retries when ZK connection fails without committing the tra…

923c3b7

…nsaction

Algunenano added 2 commits December 12, 2023 15:34

Change check order in replication.lib

c77f30d

If there are no tables left merge() will keep throwing errors constantly

Fix error on retries due to TABLE_IS_READ_ONLY

049fb60

Algunenano marked this pull request as ready for review December 13, 2023 09:40

Merge remote-tracking branch 'blessed/master' into backup_1

dd405a6

Algunenano added 3 commits December 14, 2023 11:08

Merge remote-tracking branch 'blessed/master' into backup_1

efcacd3

Leave backups as HEAD

3200933

Merge remote-tracking branch 'blessed/master' into backup_1

546484d

devcrafter reviewed Dec 15, 2023

View reviewed changes

devcrafter approved these changes Dec 15, 2023

View reviewed changes

Review improvements

6cf8c9b

Algunenano merged commit f10dae4 into ClickHouse:master Dec 18, 2023
340 of 343 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare the introduction of more keeper faults #56917

Prepare the introduction of more keeper faults #56917

Algunenano commented Nov 17, 2023 •

edited

robot-ch-test-poll1 commented Nov 17, 2023 •

edited by robot-ch-test-poll

Algunenano commented Nov 17, 2023

Algunenano commented Nov 20, 2023

Algunenano commented Nov 22, 2023 •

edited

Algunenano commented Nov 23, 2023

Algunenano Nov 23, 2023

Algunenano Nov 24, 2023

devcrafter left a comment

devcrafter Nov 23, 2023

Algunenano Nov 24, 2023

Algunenano commented Dec 11, 2023

Algunenano commented Dec 12, 2023

Algunenano commented Dec 13, 2023

Algunenano commented Dec 13, 2023 •

edited

tavplubix commented Dec 13, 2023

Algunenano commented Dec 13, 2023

Algunenano commented Dec 15, 2023

devcrafter commented Dec 15, 2023

Algunenano commented Dec 15, 2023

devcrafter Dec 15, 2023

devcrafter Dec 15, 2023

Algunenano Dec 15, 2023

devcrafter Dec 15, 2023

devcrafter Dec 15, 2023

Algunenano Dec 15, 2023

devcrafter left a comment

Algunenano commented Dec 18, 2023


		system disable failpoint replicated_commit_zk_fail_after_op;
		system disable failpoint replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault;

		@@ -1019,22 +1038,6 @@ std::pair<std::vector<String>, bool> ReplicatedMergeTreeSinkImpl<async_insert>::
		{
		zookeeper->setKeeper(storage.getZooKeeper());

Prepare the introduction of more keeper faults #56917

Prepare the introduction of more keeper faults #56917

Conversation

Algunenano commented Nov 17, 2023 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

robot-ch-test-poll1 commented Nov 17, 2023 • edited by robot-ch-test-poll

Algunenano commented Nov 17, 2023

Algunenano commented Nov 20, 2023

Algunenano commented Nov 22, 2023 • edited

Algunenano commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devcrafter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Algunenano commented Dec 11, 2023

Algunenano commented Dec 12, 2023

Algunenano commented Dec 13, 2023

Algunenano commented Dec 13, 2023 • edited

tavplubix commented Dec 13, 2023

Algunenano commented Dec 13, 2023

Algunenano commented Dec 15, 2023

devcrafter commented Dec 15, 2023

Algunenano commented Dec 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devcrafter left a comment

Choose a reason for hiding this comment

Algunenano commented Dec 18, 2023

Algunenano commented Nov 17, 2023 •

edited

robot-ch-test-poll1 commented Nov 17, 2023 •

edited by robot-ch-test-poll

Algunenano commented Nov 22, 2023 •

edited

Algunenano commented Dec 13, 2023 •

edited