Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: implement the distributed lock of shard #706

Merged
merged 13 commits into from
Apr 21, 2023

Conversation

ZuLiangWang
Copy link
Contributor

Which issue does this PR close?

Closes #

Rationale for this change

In order to ensure that user data in distributed mode will not be lost, we need a mechanism to ensure that CeresDB will not have multiple leader shard under any circumstances.

What changes are included in this PR?

  • Introduce etcd client in CeresDB.
  • Distributed locks are implemented through etcd, and the corresponding locks need to be acquired every time when operating on the shard.

Are there any user-facing changes?

None

How does this change test

Pass all unit tests and integration tests.

@ZuLiangWang ZuLiangWang force-pushed the refactor_cluster_procedure branch 2 times, most recently from cb37f61 to 6ffefa2 Compare March 8, 2023 07:08
@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2023

Codecov Report

Merging #706 (e80dcb8) into main (d71e875) will decrease coverage by 0.41%.
The diff coverage is 16.91%.

❗ Current head e80dcb8 differs from pull request most recent head 6206b3b. Consider uploading reports for the commit 6206b3b to get more accurate results

@@            Coverage Diff             @@
##             main     #706      +/-   ##
==========================================
- Coverage   68.49%   68.08%   -0.41%     
==========================================
  Files         292      293       +1     
  Lines       45413    45644     +231     
==========================================
- Hits        31104    31076      -28     
- Misses      14309    14568     +259     
Impacted Files Coverage Δ
analytic_engine/src/table/partition.rs 0.00% <0.00%> (ø)
cluster/src/cluster_impl.rs 0.00% <0.00%> (ø)
cluster/src/lib.rs 33.33% <ø> (ø)
cluster/src/shard_lock_manager.rs 0.00% <0.00%> (ø)
src/bin/ceresdb-server.rs 1.49% <0.00%> (-0.05%) ⬇️
src/setup.rs 0.00% <0.00%> (ø)
analytic_engine/src/table/metrics.rs 84.66% <89.47%> (-5.21%) ⬇️
analytic_engine/src/instance/flush_compaction.rs 92.66% <100.00%> (-0.04%) ⬇️
analytic_engine/src/table/data.rs 91.54% <100.00%> (ø)
analytic_engine/src/table/mod.rs 83.69% <100.00%> (-0.52%) ⬇️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@ShiKaiWi ShiKaiWi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should create a branch for merging changes concerning the new cluster implementation.

cluster/src/shard_lock_manager.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/lib.rs Outdated Show resolved Hide resolved
cluster/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@ShiKaiWi ShiKaiWi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize the first letter of the log message.

Cargo.toml Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
cluster/src/cluster_impl.rs Outdated Show resolved Hide resolved
src/setup.rs Outdated Show resolved Hide resolved
src/setup.rs Outdated Show resolved Hide resolved
cluster/src/shard_lock_manager.rs Outdated Show resolved Hide resolved
cluster/src/shard_lock_manager.rs Outdated Show resolved Hide resolved
cluster/src/shard_lock_manager.rs Outdated Show resolved Hide resolved
@ShiKaiWi ShiKaiWi force-pushed the refactor_cluster_procedure branch 2 times, most recently from 94c59e2 to 6509865 Compare April 20, 2023 10:07
@ShiKaiWi ShiKaiWi marked this pull request as ready for review April 20, 2023 11:20
Copy link
Member

@ShiKaiWi ShiKaiWi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShiKaiWi ShiKaiWi added this pull request to the merge queue Apr 21, 2023
Merged via the queue into apache:main with commit 406e156 Apr 21, 2023
5 checks passed
cluster/src/shard_lock_manager.rs Show resolved Hide resolved
cluster/src/cluster_impl.rs Show resolved Hide resolved
cluster/src/config.rs Show resolved Hide resolved
let res = handle_close_shard(new_ctx, close_shard_req).await;
match res {
Ok(_) => info!("Close shard success, shard_id:{shard_id}"),
Err(e) => error!("Close shard failed, shard_id:{shard_id}, err:{e}"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to do when close failed? Will this leave some state dirty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In this case where the lease is expired, the network partition between ceresdb-server and etcd cluster has occurred, and what we can't do more than just hoping it to close shard successfully.

})?;

if !granted_by_this_call {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this happens?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When retrying close/open shard.

cluster/src/shard_lock_manager.rs Show resolved Hide resolved

/// Check whether lease is expired.
fn is_expired(&self) -> bool {
let now = common_util::time::current_time_millis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a monotonically clock, you should use Instant instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the interval of expire check is not very accurate, so the time offsetting is allowed.

cluster/src/shard_lock_manager.rs Show resolved Hide resolved
match shard_lock {
Some(mut v) => {
let mut etcd_client = self.etcd_client.clone();
v.revoke(&mut etcd_client).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this revoke failed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error will be thrown up to the CeresMeta finally. And in the ceresmeta, the close shard call will fail and the following operations will be cancelled.

/// The temporary key in etcd
key: Bytes,
/// The value of the key in etcd
value: Bytes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this value be used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ceresmeta will read it to draw the whole topology: shard -> node.

chunshao90 pushed a commit to chunshao90/ceresdb that referenced this pull request May 15, 2023
* refactor: add etcd shard lock

* feat: upgrade etcd-client to v0.10.4 and enable tls feature

* refactor: new implementation of the shard lock

* refactor: clippy compliants

* chore: use latest ceresdbproto

* chore: fix format error

* feat: operate shard lock in meta event server

* feat: enable cluster integration tests

* feat: use ceresdbproto in crates.io

* feat: use ceresdbproto in crates.io

* feat: add some basic unit tests

* refactor: rename and comment

* chore: check the root path and cluster name

---------

Co-authored-by: xikai.wxk <xikai.wxk@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants