Managing Multiple Raft Groups on a Single Physical Node #47

bootjp · 2023-11-30T08:42:09Z

bootjp
Nov 30, 2023

Hello MicroRaft Community,

Following our previous discussion (Discussion #44) on using MicroRaft for data sharding, I have a more specific question related to this topic.

I am exploring ways to efficiently operate multiple Raft groups on a single physical node (not RaftNode). Given that the current implementation of RaftNodeImpl can only hold one GroupId, there seem to be a couple of approaches to manage multiple Raft groups on a single physical node, and I would appreciate your advice on the best approach.

Operating Multiple RaftNode Instances: One way is to create multiple RaftNode instances on a single physical node, each belonging to different Raft groups. In this case, are there any special considerations needed for managing resources or coordination between instances?
Modifying RaftNodeImpl and MicroRaft's Future Direction: Another approach is to modify RaftNodeImpl to allow it to hold multiple GroupIds. If we were to take this approach, what impact would it have on the existing design of MicroRaft, and what design challenges might arise? Furthermore, is there any consideration within the MicroRaft project to make such a change, or are there plans to consider it in the future?

Both methods seem to have their pros and cons, but I would like to hear your opinions on which approach would be preferable, considering the design philosophy and future direction of MicroRaft.

Additionally, if there are other methods to efficiently operate multiple Raft groups on a single physical node, I would be very interested to learn about them.

I look forward to borrowing your wisdom and experience. Thank you.

Answered by metanet

Dec 21, 2023

Hi @bootjp

sorry for my delayed response. i am on a christmas break hence not checking notifications very frequently.

on how many servers will you run those 100 Raft groups? what is your group size? what will be QPS of each Raft group to replicate actual mutation operations? should we estimate such stuff and see if HBs will be an actual overhead or not?

if you are using an RPC framework like gRPC, Thrift, etc, you will have open connections across your servers maintained by the rpc framework anyway and you will be sending cheap HB payloads through them. you can tune HB timeout and HB periods and MicroRaft applies some jitter for HB'ing. and if your Raft groups are getting mutation operati…

View full answer

metanet · 2023-12-03T01:09:12Z

metanet
Dec 3, 2023
Maintainer

There are a couple of points that needs consideration. The list is not exclusive and very likely there are more points to consider with a deeper discussion / design exploration.

Managing the Raft group -> physical node mapping:

Independent of which option to pick, there should be some solution for managing this mapping. Potential solutions are to run it a Raft group for the replicated state machine of the Raft group -> physical node mapping or rely on an external CP data store for this. Given that MultiRaft already solves the CP store problem, implementing a state machine for this on top of MicroRaft might be a reasonable direction which will also eliminate an external dependency. On the other hand, systems like ZooKeeper are already providing some solutions for this problem I think, so reusing an existing system might be faster to implement and validate.

Once the Raft group -> physical node mapping is maintained somewhere, there should be some subscription mechanism which will trigger physical nodes to start Raft nodes when they are added to / removed from Raft groups, when new Raft groups are defined, existing Raft groups are deleted, etc.

Isolation

I think managing multiple Raft groups outside of the existing RaftNode abstraction gives more opportunities for achieving isolation. You can run Raft groups in different threads if they are running in the same process, or you can completely isolate them by running them on different processes with some resource limits. Running multiple Raft groups in a single Raft node can make it more prune to the noisy neighbour problem. For instance, if a Raft group is committing and executing a computation heavy task on the state machine which is holding the thread for too long, it can trigger heartbeat timeouts and leader elections for the other Raft groups assigned to that thread.

Heartbeats (periodic append RPCs)

Once a physical node (i.e., server) is part of multiple Raft groups, it will receive / append heartbeats for every Raft group it is part of. For instance, if server A and server B are part of N Raft groups and if one of them is the leader, there will be N heartbeat messages and responses between these two servers. There may be a need to optimize this and just send 1 heartbeat message (i.e., empty AppendRequest RPC) that contains all N Raft groups.

API evolution

If we go with the option #2, at the very basic level, RaftNode interface needs new methods like addRaftGroup(RaftGroupId), removeRaftGroup(RaftGroupId), etc. Also the other API methods probably need a new RaftGroupId parameter. RaftNodeBuilder interface needs to be evolved. We also need to explore how the other main interfaces, such as Transport, RaftStore, StateMachine, etc. Can we keep them as it is, and have factory classes to create independent components per Raft group, or can we have a single component object to serve to multiple Raft groups and have the RaftGroupId as a parameter in its API methods. This might bring some complexity and requires some exploration and design discussions.

There are a few OSS Raft implementations that can be checked for this:

Alternatively, in option #1, the very same complexity can be implemented outside of the RaftNode class. Its advantage is, RaftNode will maintain its simplicity and basically continue abstracting away the Raft algorithm implementation.

If you want to do more exploration on this, I would suggest to put more details to the offered solutions and we can compare them more concretely.

Hope my reply helps.

Regards,

3 replies

bootjp Dec 7, 2023
Author

Thank you for your detailed response.

I appreciate the comprehensive answers to my broad query.

Managing the Raft group -> physical node mapping:

Regarding this, I am working on implementing a system equivalent to tikv/pd using etcd as the backend storage. I plan to resolve this issue soon.

Once the Raft group -> physical node mapping is maintained somewhere, there should be some subscription mechanism which will trigger physical nodes to start Raft nodes when they are added to / removed from Raft groups, when new Raft groups are defined, existing Raft groups are deleted, etc.

Similarly, for this issue, I am aiming to resolve it by implementing a system where new nodes inquire PD about their respective RaftGroup membership upon startup. (An interface for adding new nodes to PD will be added in the future)

Once a physical node (i.e., server) is part of multiple Raft groups, it will receive / append heartbeats for every Raft group it is part of. For instance, if server A and server B are part of N Raft groups and if one of them is the leader, there will be N heartbeat messages and responses between these two servers. There may be a need to optimize this and just send 1 heartbeat message (i.e., empty AppendRequest RPC) that contains all N Raft groups.

This is a concern in the project I am working on.

Let me explain a bit about the use case of the system I am building. We are using MicroRaft to distribute search indexes based on Apache Lucene, similar to Elasticsearch. To shard data for each tenant, we are considering dividing RaftGroups per tenant, thus achieving sharding (similar to Elasticsearch's Index). We need at least 100 tenants currently.

In this case, with RaftNode * Raft Group (at least 100), a large number of heartbeats may occur, which is a concern.

To address this issue of massive heartbeats, it seems the only option is to make changes to MicroRaft.
(If there is another way, please let me know.)
Are there any plans to optimize heartbeat on the MicroRaft side?
Or is there room to incorporate such a pull request?

API evolution

If optimizing heartbeats and incorporating pull requests are possible on the MicroRaft side, I would like to integrate the necessary interfaces into MultiRaft's RaftNode.

bootjp Dec 21, 2023
Author

Are there any plans to optimize heartbeat on the MicroRaft side?
Or is there room to incorporate such a pull request?

@metanet How about this?
If there is any lack of information, please let me know and I will address it.
I apologize for the inconvenience during your busy schedule, but I look forward to your kind cooperation.

metanet Dec 21, 2023
Maintainer

Hi @bootjp

sorry for my delayed response. i am on a christmas break hence not checking notifications very frequently.

on how many servers will you run those 100 Raft groups? what is your group size? what will be QPS of each Raft group to replicate actual mutation operations? should we estimate such stuff and see if HBs will be an actual overhead or not?

if you are using an RPC framework like gRPC, Thrift, etc, you will have open connections across your servers maintained by the rpc framework anyway and you will be sending cheap HB payloads through them. you can tune HB timeout and HB periods and MicroRaft applies some jitter for HB'ing. and if your Raft groups are getting mutation operations every second, there will be non-empty append requests sent from leaders to followers, which are accounted for HB'ing as well.

Are there any plans to optimize heartbeat on the MicroRaft side?
Or is there room to incorporate such a pull request?

I would suggest you to do the math and run some large scale tests to actually see some bottleneck. when we see a bottleneck, we can discuss how we can tune it better or further optimize with some code work. otherwise, i think we are mostly talking about hypothetical problems.

hope my answer helps.

regards,

Answer selected by bootjp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Multiple Raft Groups on a Single Physical Node #47

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Managing Multiple Raft Groups on a Single Physical Node #47

bootjp Nov 30, 2023

Replies: 1 comment · 3 replies

metanet Dec 3, 2023 Maintainer

bootjp Dec 7, 2023 Author

bootjp Dec 21, 2023 Author

metanet Dec 21, 2023 Maintainer

bootjp
Nov 30, 2023

Replies: 1 comment 3 replies

metanet
Dec 3, 2023
Maintainer

bootjp Dec 7, 2023
Author

bootjp Dec 21, 2023
Author

metanet Dec 21, 2023
Maintainer