Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare the introduction of more keeper faults #56917

Merged
merged 29 commits into from Dec 18, 2023

Conversation

Algunenano
Copy link
Member

@Algunenano Algunenano commented Nov 17, 2023

Changelog category (leave one):

  • Not for changelog

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Prepares the introduction of more keeper faults

Longer details:

  • Refactors ZooKeeperWithFaultInjection to prepare to introduce it in more places:

    • Implementation goes to the .cpp file.
    • Adds a bunch of new methods with faults to be used in replicated merge tree
    • Temporarily maintains the tracking of ephemeral nodes created with a faulty keeper. In the future, once "everything" handles retries this should removed in favor of force a session reload, which is what happens in a real environment.
    • Introduces faults in async functions too via promises.
  • It reworks how part_committed_locally_but_zookeeper (old) / resolve_uncertain_commit_stage (less old) worked so that we can keep retrying in case the ZK session failed when committing. Before the uncertain stage could only be resolved if ZK failed after write (and in that case everything was good) but it wouldn't recover it data wasn't written to keeper. Now it recovers both cases.

  • Fixes an issue found related with ephemeral nodes cleaning up in backups fault and changes the name (handleEphemeralNodeExistence -> deleteEphemeralNodeIfContentMatches) to avoid further confusions. I included in this PR since changing the name and introducing faults is what leads to noticing the problem and enables fixing them only once (and not twice, one in a different PR and another time in the refactor)

AFAICS. Closes #50465 as it removed the special handling of tryMulti() responses

cc @devcrafter

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 17, 2023
@robot-ch-test-poll1
Copy link
Contributor

robot-ch-test-poll1 commented Nov 17, 2023

This is an automated comment for commit 6cf8c9b with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker image for serversThe check to build and optionally push the mentioned image to docker hub✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub✅ success
SQLTestThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts❌ failure

@Algunenano
Copy link
Member Author

It seems I will need to bring more commits from #56389. As the tests run in parallel, the moment you enable more faults (some ops had some disabled, and the refactor got rid of that) they all clash with each other more frequently so you see the lack of proper recovery more often, e.g. the part_committed_locally_but_zookeeper thingy.

@devcrafter devcrafter self-assigned this Nov 20, 2023
@Algunenano
Copy link
Member Author

  • Fixed broken debug build. I'm not sure why it compiled in other builds when I used ch_assert vs chassert
  • The commit "fixing" 00509_extended_storage_definition_syntax_zookeeper wasn't necessary as, in the end, it wasn't unrelated. OPTIMIZE failed because there were inserts on the fly because of dangling ephemeral nodes due to keeper faults. After fixing that it does not fail

@Algunenano
Copy link
Member Author

Algunenano commented Nov 22, 2023

Errors:

  • AST fuzzer (asan) -> Related. Bug in ZooKeeperWithFaultInjection::exists logger.
  • Integration tests (asan) [1/4] -> PG leak -> Memory leak in Postgres (libpq) #56627
  • Integration tests (release) [2/4] -> test_backup_restore_on_cluster/test_concurrency.py::test_create_or_drop_tables_during_backup[Replicated-ReplicatedMergeTree]. Unrelated. The test is flaky normally and doesn't have faults (Cannot enqueue query on this replica)
  • Stateless tests (debug) [2/5] -> 01280_ttl_where_group_by. Flaky
  • Stateless tests (debug, s3 storage) [3/6] -> 01280_ttl_where_group_by again
  • Stateless tests (release, analyzer). 01287_max_execution_speed and 01600_parts_states_metrics_long which are flaky with the analyzer.
  • Stateless tests (tsan, s3 storage) [2/5] -> 01280_ttl_where_group_by again

@Algunenano
Copy link
Member Author

Failures:

{
my_storage.enqueuePartForCheck(part_name, MAX_AGE_OF_LOCAL_PART_THAT_WASNT_ADDED_TO_ZOOKEEPER);
});
LOG_TRACE(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tavplubix Do you mind having a look at this changes, specially checking that rename_part_to_temporary(); in the case where the data wasn't written to keeper (to retry) is safe? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's not safe due to zero copy locks and the bug is also present when the block_id was created by a different replica concurrently

Copy link
Member

@devcrafter devcrafter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with comments

@@ -43,11 +44,9 @@ RestoreCoordinationRemote::RestoreCoordinationRemote(
if (my_is_internal)
{
String alive_node_path = my_zookeeper_path + "/stage/alive|" + my_current_host;
zk->deleteEphemeralNodeIfContentMatches(alive_node_path, "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth a good comment. And probably it would be more convenient to hide deleteEphemeralNodeIfContentMatches() inside create() with zkutil::CreateMode::Ephemeral

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not hide it for now. In some places the ephemeral node is created as part of a transaction and in other places we might not want to automatically recreate the ephemeral node. It also introduces extra round trips that are not necessary if you are not recovering from an error (one to check, one to remove or waiting for a long time for it to be removed).

I'll refactor it a bit

src/Common/ZooKeeper/ZooKeeper.h Show resolved Hide resolved
@Algunenano Algunenano marked this pull request as draft December 11, 2023 14:24
@Algunenano
Copy link
Member Author

Working on rebasing the PR after the fix for zero-copy locks

@Algunenano
Copy link
Member Author

Reimplemented ZK retries to tryMulti both for before and after failures on top of @CheSema 's recent changes. So far so good. I had to recover rollbackPartsToTemporaryState

If there are no tables left merge() will keep throwing errors constantly
@Algunenano Algunenano marked this pull request as ready for review December 13, 2023 09:40
@Algunenano
Copy link
Member Author

Everything green except Integration tests (asan) [4/4] which failed because of test_postgresql_replica_database_engine_1/test.py::test_many_concurrent_queries which I don't see how it can be related.

Marking it as ready again after the last changes. cc @devcrafter in case you want to review the latest changes before merging

@Algunenano
Copy link
Member Author

Algunenano commented Dec 13, 2023

Need to rebase on top of 57764 which fixes some of the same problems found and fixed in this PR

Done

@tavplubix
Copy link
Member

test_postgresql_replica_database_engine_1/test.py::test_many_concurrent_queries which I don't see how it can be related.

Even if it's not related, let's not ignore it. We can start with using CI DB to find when it started to fail. And it helps to find the reason quite easily: #57568 (comment)

@Algunenano
Copy link
Member Author

The new failures appeared with the merge with master and they look related. Need to investigate more, but I took master's way of improving ZK retries on backup coordination instead of the one in this PR so most likely the problem is there.

@Algunenano
Copy link
Member Author

Failures:

  • Integration tests (asan) [3/4] -> test_storage_hdfs/test.py::test_write_table. Fixed in Fix leak in StorageHDFS #55370
  • Integration tests (asan) [4/4] -> test_storage_azure_blob_storage/test_cluster.py::test_unset_skip_unavailable_shards -> Fixed in Fix test_unset_skip_unavailable_shards #57895
  • Integration tests (release) [4/4] -> test_storage_azure_blob_storage/test_cluster.py::test_unset_skip_unavailable_shards -> Fixed in Fix test_unset_skip_unavailable_shards #57895
  • Stateless tests (aarch64) -> 00002_log_and_exception_messages_formatting. I'll have a look in a different PR as it's not related to the changes.

I would love to not have to rebase this again 😉

@devcrafter
Copy link
Member

I would love to not have to rebase this again 😉

It'd be nice to understand what has changed from latest review

@Algunenano
Copy link
Member Author

It'd be nice to understand what has changed from latest review

The last 2 rebases:

  • 9d8d5df + 923c3b7 -> Reimplemented retries in RMTSink (commit) when the connection to ZK fails on tryMulti(). This had been already implemented but it was discovered that old retries weren't handling zero copy nodes correctly and the function was refactored completely, so retries had to be introduced again.

  • Last one -> 3200933 -> Removes any changes vs master under src/Backups as the ones I had introduced have been also introduced in different PRs in the meantime.


system disable failpoint replicated_commit_zk_fail_after_op;
system disable failpoint replicated_merge_tree_commit_zk_fail_when_recovering_from_hw_fault;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but we could check that after disabling failpoints, we can insert

{
if (!isEmpty())
{
WriteBufferFromOwnString buf;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we create just a method which will do this logging? (it's copy/pasted code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recovered the code by reverting partially the commit that removed it and kept it as-is for simplicity, but it could be improved / iterated in the future.

});

/// Independently of how many retries we had left we want to do at least one check of this inner retry
/// at least once so a) we try to verify at least once if metadata was written and b) we set the proper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too many "at least once"

@@ -1019,22 +1038,6 @@ std::pair<std::vector<String>, bool> ReplicatedMergeTreeSinkImpl<async_insert>::
{
zookeeper->setKeeper(storage.getZooKeeper());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about moving this block to lock and commit stage. Probably, you could explain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it because the table being in read only only matters when we are going to add a new part/commit to ZK. If we only have checks left (for example in resolve_duplicate_stage) it doesn't make sense to not continue verifying because the table is stopped, as we can still return a proper response to the client.

A similar thing will happen for quorum for example (but it's not in this PR to keep in more concise). It's ok to check for quorum if the write succeeded but the table was set in read only after that.

Copy link
Member

@devcrafter devcrafter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Algunenano My comments are mostly minor, except probably #56917 (comment). But if you think it's fine, please feel free to fix comments in follow-up PR

@Algunenano
Copy link
Member Author

Failures:

  • AST fuzzer (ubsan) -> Sorting column wasn't found in the ActionsDAG's outputs #57705
  • Stateless tests (debug) [3/5]
    • 01508_partition_pruning_long. It seems the queries just took too long and the test reached the 10m limit before finishing them all.
    • 01414_mutations_and_errors_zookeeper: The mutation was cancelled internally (due to large sleep) before it was killed externally, so the log message is missing.
  • Upgrade check (debug): I see a OOM Kill in stress.log and I see a logical error in gdb.log but I can't see if they are related or what binary and data was being run. This test is really hard to debug/understand in general (Upgrade check: Provide better information on failure #57981)

@Algunenano Algunenano merged commit f10dae4 into ClickHouse:master Dec 18, 2023
340 of 343 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[23.4.2.11] [Keeper] LOGICAL_ERROR: There is no failed OpResult
4 participants