refactor: implement the distributed lock of shard #706

ZuLiangWang · 2023-03-08T06:12:23Z

Which issue does this PR close?

Closes #

Rationale for this change

In order to ensure that user data in distributed mode will not be lost, we need a mechanism to ensure that CeresDB will not have multiple leader shard under any circumstances.

What changes are included in this PR?

Introduce etcd client in CeresDB.
Distributed locks are implemented through etcd, and the corresponding locks need to be acquired every time when operating on the shard.

Are there any user-facing changes?

None

How does this change test

Pass all unit tests and integration tests.

codecov-commenter · 2023-03-08T07:41:20Z

Codecov Report

Merging #706 (e80dcb8) into main (d71e875) will decrease coverage by 0.41%.
The diff coverage is 16.91%.

❗ Current head e80dcb8 differs from pull request most recent head 6206b3b. Consider uploading reports for the commit 6206b3b to get more accurate results

@@            Coverage Diff             @@
##             main     #706      +/-   ##
==========================================
- Coverage   68.49%   68.08%   -0.41%     
==========================================
  Files         292      293       +1     
  Lines       45413    45644     +231     
==========================================
- Hits        31104    31076      -28     
- Misses      14309    14568     +259

Impacted Files	Coverage Δ
analytic_engine/src/table/partition.rs	`0.00% <0.00%> (ø)`
cluster/src/cluster_impl.rs	`0.00% <0.00%> (ø)`
cluster/src/lib.rs	`33.33% <ø> (ø)`
cluster/src/shard_lock_manager.rs	`0.00% <0.00%> (ø)`
src/bin/ceresdb-server.rs	`1.49% <0.00%> (-0.05%)`	⬇️
src/setup.rs	`0.00% <0.00%> (ø)`
analytic_engine/src/table/metrics.rs	`84.66% <89.47%> (-5.21%)`	⬇️
analytic_engine/src/instance/flush_compaction.rs	`92.66% <100.00%> (-0.04%)`	⬇️
analytic_engine/src/table/data.rs	`91.54% <100.00%> (ø)`
analytic_engine/src/table/mod.rs	`83.69% <100.00%> (-0.52%)`	⬇️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ShiKaiWi

Maybe we should create a branch for merging changes concerning the new cluster implementation.

cluster/src/shard_lock_manager.rs

cluster/src/cluster_impl.rs

cluster/src/lib.rs

ShiKaiWi

Capitalize the first letter of the log message.

Cargo.toml

cluster/src/cluster_impl.rs

src/setup.rs

cluster/src/shard_lock_manager.rs

ShiKaiWi

LGTM

cluster/src/shard_lock_manager.rs

cluster/src/cluster_impl.rs

cluster/src/config.rs

jiacai2050 · 2023-04-21T06:50:30Z

server/src/grpc/meta_event_service/mod.rs

+            let res = handle_close_shard(new_ctx, close_shard_req).await;
+            match res {
+                Ok(_) => info!("Close shard success, shard_id:{shard_id}"),
+                Err(e) => error!("Close shard failed, shard_id:{shard_id}, err:{e}"),


Nothing to do when close failed? Will this leave some state dirty?

Yes. In this case where the lease is expired, the network partition between ceresdb-server and etcd cluster has occurred, and what we can't do more than just hoping it to close shard successfully.

jiacai2050 · 2023-04-21T06:51:14Z

server/src/grpc/meta_event_service/mod.rs

            })?;

+        if !granted_by_this_call {


When will this happens?

When retrying close/open shard.

cluster/src/shard_lock_manager.rs

jiacai2050 · 2023-04-21T07:06:28Z

cluster/src/shard_lock_manager.rs

+
+    /// Check whether lease is expired.
+    fn is_expired(&self) -> bool {
+        let now = common_util::time::current_time_millis();


This is not a monotonically clock, you should use Instant instead.

https://docs.rs/chrono/latest/chrono/#date-and-time

I guess the interval of expire check is not very accurate, so the time offsetting is allowed.

cluster/src/shard_lock_manager.rs

jiacai2050 · 2023-04-21T07:17:40Z

cluster/src/shard_lock_manager.rs

+        match shard_lock {
+            Some(mut v) => {
+                let mut etcd_client = self.etcd_client.clone();
+                v.revoke(&mut etcd_client).await?;


What if this revoke failed?

The error will be thrown up to the CeresMeta finally. And in the ceresmeta, the close shard call will fail and the following operations will be cancelled.

jiacai2050 · 2023-04-21T07:26:56Z

cluster/src/shard_lock_manager.rs

+    /// The temporary key in etcd
+    key: Bytes,
+    /// The value of the key in etcd
+    value: Bytes,


When will this value be used?

The ceresmeta will read it to draw the whole topology: shard -> node.

* refactor: add etcd shard lock * feat: upgrade etcd-client to v0.10.4 and enable tls feature * refactor: new implementation of the shard lock * refactor: clippy compliants * chore: use latest ceresdbproto * chore: fix format error * feat: operate shard lock in meta event server * feat: enable cluster integration tests * feat: use ceresdbproto in crates.io * feat: use ceresdbproto in crates.io * feat: add some basic unit tests * refactor: rename and comment * chore: check the root path and cluster name --------- Co-authored-by: xikai.wxk <xikai.wxk@antgroup.com>

ZuLiangWang force-pushed the refactor_cluster_procedure branch 2 times, most recently from cb37f61 to 6ffefa2 Compare March 8, 2023 07:08

ZuLiangWang force-pushed the refactor_cluster_procedure branch from 6ffefa2 to d8470a4 Compare March 9, 2023 06:07

ZuLiangWang mentioned this pull request Mar 9, 2023

refactor: implement failover based distributed etcd lock apache/incubator-horaedb-meta#142

Closed

ZuLiangWang force-pushed the refactor_cluster_procedure branch 3 times, most recently from bceb10f to 266d6b6 Compare March 10, 2023 08:24

ShiKaiWi reviewed Mar 13, 2023

View reviewed changes

cluster/src/shard_lock_manager.rs Outdated Show resolved Hide resolved

ZuLiangWang force-pushed the refactor_cluster_procedure branch 11 times, most recently from ced4b40 to 6206b3b Compare March 20, 2023 06:13

This was referenced Mar 21, 2023

Tracking issue: refactor cluster mode apache/incubator-horaedb-meta#145

Closed

refactor: implement failover based distributed etcd lock apache/incubator-horaedb-meta#146

Merged

ZuLiangWang force-pushed the refactor_cluster_procedure branch from 6206b3b to 009c7ed Compare April 6, 2023 09:33

ShiKaiWi force-pushed the refactor_cluster_procedure branch from 009c7ed to cb46c4d Compare April 18, 2023 05:46

chunshao90 reviewed Apr 18, 2023

View reviewed changes

ZuLiangWang force-pushed the refactor_cluster_procedure branch from cb46c4d to 02d8999 Compare April 18, 2023 08:20

ShiKaiWi reviewed Apr 18, 2023

View reviewed changes

ShiKaiWi force-pushed the refactor_cluster_procedure branch 2 times, most recently from 94c59e2 to 6509865 Compare April 20, 2023 10:07

ShiKaiWi marked this pull request as ready for review April 20, 2023 11:20

ZuLiangWang and others added 11 commits April 20, 2023 20:59

refactor: add etcd shard lock

645de30

feat: upgrade etcd-client to v0.10.4 and enable tls feature

097e174

refactor: new implementation of the shard lock

9c6e3ab

refactor: clippy compliants

677f0de

chore: use latest ceresdbproto

678649d

chore: fix format error

008baf8

feat: operate shard lock in meta event server

2198939

feat: enable cluster integration tests

a056466

feat: use ceresdbproto in crates.io

53da318

feat: use ceresdbproto in crates.io

f06a85a

feat: add some basic unit tests

574d2f7

ShiKaiWi force-pushed the refactor_cluster_procedure branch from 64cc985 to 574d2f7 Compare April 20, 2023 13:02

ShiKaiWi added 2 commits April 20, 2023 21:12

refactor: rename and comment

76a0dad

chore: check the root path and cluster name

fa01e90

ShiKaiWi approved these changes Apr 21, 2023

View reviewed changes

ShiKaiWi added this pull request to the merge queue Apr 21, 2023

Merged via the queue into apache:main with commit 406e156 Apr 21, 2023
5 checks passed

jiacai2050 reviewed Apr 21, 2023

View reviewed changes

ShiKaiWi mentioned this pull request Apr 21, 2023

refactor: shard lock module #853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: implement the distributed lock of shard #706

refactor: implement the distributed lock of shard #706

ZuLiangWang commented Mar 8, 2023

codecov-commenter commented Mar 8, 2023 •

edited

ShiKaiWi left a comment

ShiKaiWi left a comment

ShiKaiWi left a comment

jiacai2050 Apr 21, 2023

ShiKaiWi Apr 21, 2023

jiacai2050 Apr 21, 2023

ShiKaiWi Apr 21, 2023

jiacai2050 Apr 21, 2023

ShiKaiWi Apr 21, 2023

jiacai2050 Apr 21, 2023

ShiKaiWi Apr 21, 2023

jiacai2050 Apr 21, 2023

ShiKaiWi Apr 21, 2023

refactor: implement the distributed lock of shard #706

refactor: implement the distributed lock of shard #706

Conversation

ZuLiangWang commented Mar 8, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How does this change test

codecov-commenter commented Mar 8, 2023 • edited

Codecov Report

ShiKaiWi left a comment

Choose a reason for hiding this comment

ShiKaiWi left a comment

Choose a reason for hiding this comment

ShiKaiWi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Mar 8, 2023 •

edited