Master election using raft consensus algorithm #1146

rtsisyk · 2015-11-13T14:12:11Z

Adapt code for master election from bsync branch.

Part of #1146

The new options are: - raft_is_enabled - enable/disable Raft. When disabled, the node is supposed to work like if Raft does not exist. Like earlier; - raft_is_candidate - a flag whether the instance can try to become a leader. Note, it can vote for other nodes regardless of value of this option; - raft_election_timeout - how long need to wait until election end, in seconds. The options don't do anything now. They are added separately in order to keep such mundane changes from the main Raft commit, to simplify its review. Part of #1146

The patch introduces a new type of system message used to notify the followers of the instance's raft status updates. It's relay's responsibility to deliver the new system rows to its peers. The notification system reuses and extends the same row type used to persist raft state in WAL and snapshot. Part of #1146 Part of #5204

The commit is a core part of Raft implementation. It introduces the Raft state machine implementation and its integration into the instance's life cycle. The implementation follows the protocol to the letter except a few important details. Firstly, the original Raft assumes, that all nodes share the same log record numbers. In Tarantool they are called LSNs. But in case of Tarantool each node has its own LSN in its own component of vclock. That makes the election messages a bit heavier, because the nodes need to send and compare complete vclocks of each other instead of a single number like in the original Raft. But logic becomes simpler. Because in the original Raft there is a problem of uncertainty about what to do with records of an old leader right after a new leader is elected. They could be rolled back or confirmed depending on circumstances. The issue disappears when vclock is used. Secondly, leader election works differently during cluster bootstrap, until number of bootstrapped replicas becomes >= election quorum. That arises from specifics of replicas bootstrap and order of systems initialization. In short: during bootstrap a leader election may use a smaller election quorum than the configured one. See more details in the code. Part of #1146

Box.info.raft returns a table of form: { state: <string>, term: <number>, vote: <instance ID>, leader: <instance ID> } The fields correspond to the same named Raft concepts one to one. This info dump is supposed to help with the tests, first of all. And with investigation of problems in a real cluster. Part of #1146

Part of #1146

Applier is going to need its numeric ID in order to tell the future Raft module who is a sender of a Raft message. An alternative would be to add sender ID to each Raft message, but this looks like a crutch. Moreover, applier still needs to know its numeric ID in order to notify Raft about heartbeats from the peer node. Needed for #1146

An instance is writable if box.cfg.read_only is false, and it is not orphan. Update of the final read-only state of the instance needs to fire read-only update triggers, and notify the engines. These 2 flags were easy and cheap to check on each operation, and the triggers were easy to use since both flags are stored and updated inside box.cc. That is going to change when Raft is introduced. Raft will add 2 more checks: - A flag if Raft is enabled on the node. If it is not, then Raft state won't affect whether the instance is writable; - When Raft is enabled, it will allow writes on a leader only. It means a check for being read-only would look like this: is_ro || is_orphan || (raft_is_enabled() && !raft_is_leader()) This is significantly slower. Besides, Raft somehow needs to access the read-only triggers and engine API - this looks wrong. The patch introduces a new flag is_ro_summary. The flag incorporates all the read-only conditions into one flag. When some subsystem may change read-only state of the instance, it needs to call box_update_ro_summary(), and the function takes care of updating the summary flag, running the triggers, and notifying the engines. Raft will use this function when its state or config will change. Needed for #1146

Relay.cc and box.cc obtained box.cfg.wal_dir value using cfg_gets() call. To initialize WAL and create struct recovery objects. That is not only a bit dangerous (cfg_gets() uses Lua API and can throw a Lua error) and slow, but also not necessary - wal_dir parameter is constant, it can't be changed after instance start. It means, the value can be stored somewhere one time and then used without Lua. Main motivation is that the WAL directory path will be needed inside relay threads to restart their recovery iterators in the Raft patch. They can't use cfg_gets(), because Lua lives in TX thread. But can access a constant global variable, introduced in this patch (it existed before, but now has a method to get it). Needed for #1146

Struct replicaset didn't store a number of registered replicas. Only an array, which was necessary to fullscan each time when want to find the count. That is going to be needed in Raft to calculate election quorum. The patch makes the count tracked so as it could be found for constant time by simply reading an integer. Needed for #1146

The patch introduces a sceleton of Raft module and a method to persist a Raft state in snapshot, not bound to any space. Part of #1146

The new options are: - raft_is_enabled - enable/disable Raft. When disabled, the node is supposed to work like if Raft does not exist. Like earlier; - raft_is_candidate - a flag whether the instance can try to become a leader. Note, it can vote for other nodes regardless of value of this option; - raft_election_timeout - how long need to wait until election end, in seconds. The options don't do anything now. They are added separately in order to keep such mundane changes from the main Raft commit, to simplify its review. Part of #1146

The patch introduces a new type of system message used to notify the followers of the instance's raft status updates. It's relay's responsibility to deliver the new system rows to its peers. The notification system reuses and extends the same row type used to persist raft state in WAL and snapshot. Part of #1146 Part of #5204

The commit is a core part of Raft implementation. It introduces the Raft state machine implementation and its integration into the instance's life cycle. The implementation follows the protocol to the letter except a few important details. Firstly, the original Raft assumes, that all nodes share the same log record numbers. In Tarantool they are called LSNs. But in case of Tarantool each node has its own LSN in its own component of vclock. That makes the election messages a bit heavier, because the nodes need to send and compare complete vclocks of each other instead of a single number like in the original Raft. But logic becomes simpler. Because in the original Raft there is a problem of uncertainty about what to do with records of an old leader right after a new leader is elected. They could be rolled back or confirmed depending on circumstances. The issue disappears when vclock is used. Secondly, leader election works differently during cluster bootstrap, until number of bootstrapped replicas becomes >= election quorum. That arises from specifics of replicas bootstrap and order of systems initialization. In short: during bootstrap a leader election may use a smaller election quorum than the configured one. See more details in the code. Part of #1146

Box.info.raft returns a table of form: { state: <string>, term: <number>, vote: <instance ID>, leader: <instance ID> } The fields correspond to the same named Raft concepts one to one. This info dump is supposed to help with the tests, first of all. And with investigation of problems in a real cluster. Part of #1146

Part of #1146

The new options are: - election_is_enabled - enable/disable leader election (via Raft). When disabled, the node is supposed to work like if Raft does not exist. Like earlier; - election_is_candidate - a flag whether the instance can try to become a leader. Note, it can vote for other nodes regardless of value of this option; - election_timeout - how long need to wait until election end, in seconds. The options don't do anything now. They are added separately in order to keep such mundane changes from the main Raft commit, to simplify its review. Option names don't mention 'Raft' on purpose, because - Not all users know what is Raft, so they may not even know it is related to leader election; - In future the algorithm may change from Raft to something else, so better not to depend on it too much in the public API. Part of #1146

The patch introduces a new type of system message used to notify the followers of the instance's raft status updates. It's relay's responsibility to deliver the new system rows to its peers. The notification system reuses and extends the same row type used to persist raft state in WAL and snapshot. Part of #1146 Part of #5204

The commit is a core part of Raft implementation. It introduces the Raft state machine implementation and its integration into the instance's life cycle. The implementation follows the protocol to the letter except a few important details. Firstly, the original Raft assumes, that all nodes share the same log record numbers. In Tarantool they are called LSNs. But in case of Tarantool each node has its own LSN in its own component of vclock. That makes the election messages a bit heavier, because the nodes need to send and compare complete vclocks of each other instead of a single number like in the original Raft. But logic becomes simpler. Because in the original Raft there is a problem of uncertainty about what to do with records of an old leader right after a new leader is elected. They could be rolled back or confirmed depending on circumstances. The issue disappears when vclock is used. Secondly, leader election works differently during cluster bootstrap, until number of bootstrapped replicas becomes >= election quorum. That arises from specifics of replicas bootstrap and order of systems initialization. In short: during bootstrap a leader election may use a smaller election quorum than the configured one. See more details in the code. Part of #1146

Box.info.election returns a table of form: { state: <string>, term: <number>, vote: <instance ID>, leader: <instance ID> } The fields correspond to the same named Raft concepts one to one. This info dump is supposed to help with the tests, first of all. And with investigation of problems in a real cluster. The API doesn't mention 'Raft' on purpose, to keep it not depending specifically on Raft, and not to confuse users who don't know anything about Raft (even that it is about leader election and synchronous replication). Part of #1146

Part of #1146

The new option can be one of 3 values: 'off', 'candidate', 'voter'. It replaces 2 old options: election_is_enabled and election_is_candidate. These flags looked strange, that it was possible to set candidate true, but disable election at the same time. Also it would not look good if we would ever decide to introduce another mode like a data-less sentinel node, for example. Just for voting. Anyway, the single option approach looks easier to configure and to extend. - 'off' means the election is disabled on the node. It is the same as election_is_enabled = false in the old config; - 'voter' means the node can vote and is never writable. The same as election_is_enabled = true + election_is_candidate = false in the old config; - 'candidate' means the node is a full-featured cluster member, which eventually may become a leader. The same as election_is_enabled = true + election_is_candidate = true in the old config. Part of #1146

An instance is writable if box.cfg.read_only is false, and it is not orphan. Update of the final read-only state of the instance needs to fire read-only update triggers, and notify the engines. These 2 flags were easy and cheap to check on each operation, and the triggers were easy to use since both flags are stored and updated inside box.cc. We're going to make txn_limbo affect read-only state. This would be useful, since a write on a node other than limbo owner results in an error anyway. It means a check for being read-only would look like this: is_ro || is_orphan || (!txn_limbo_is_empty() && limbo->owner_id != instance_id) This is significantly slower. The patch introduces a new flag is_ro_summary. The flag incorporates all the read-only conditions into one flag. When some subsystem may change read-only state of the instance, it needs to call box_update_ro_summary(), and the function takes care of updating the summary flag, running the triggers, and notifying the engines. The limbo will use this function each time its owner changes or it gets emptied out. Needed for #1146 (cherry picked from commit 31dc4fa)

Struct replicaset didn't store a number of registered replicas. Only an array, which was necessary to fullscan each time when want to find the count. That is going to be needed in Raft to calculate election quorum. The patch makes the count tracked so as it could be found for constant time by simply reading an integer. Needed for #1146 (cherry picked from commit 764a548)

rtsisyk added the replication label Nov 13, 2015

rtsisyk self-assigned this Nov 13, 2015

rtsisyk added this to the 1.7.0 milestone Nov 13, 2015

kostja mentioned this issue Nov 13, 2015

Synchronous replication #980

Open

17 tasks

rtsisyk changed the title ~~Master election~~ Master election using raft consensus algorithm Nov 13, 2015

rtsisyk modified the milestones: 1.8.0, 1.7.0 Jun 7, 2016

kostja modified the milestones: 1.8.0, 1.8.1 Mar 31, 2017

kostja modified the milestones: 1.8.3, 1.8.4, 1.8.5 Oct 27, 2017

Gerold103 assigned Gerold103 and unassigned rtsisyk Feb 9, 2018

kostja modified the milestones: 2.1.1, 1.9.0, 2.1.2 Feb 11, 2018

kostja unassigned Gerold103 Feb 20, 2018

Gerold103 removed the replication label Sep 24, 2018

Gerold103 self-assigned this Sep 24, 2018

Gerold103 added feature A new functionality app complicated labels Sep 24, 2018

kostja modified the milestones: 2.1.1, 2.2.0 Nov 19, 2018

kyukhin removed this from the 2.2.0 milestone Apr 8, 2019

Gerold103 added a commit that referenced this issue Sep 28, 2020

raft: add tests

0c13347

Part of #1146

Gerold103 added a commit that referenced this issue Sep 29, 2020

raft: add tests

2e3c864

Part of #1146

Gerold103 added a commit that referenced this issue Sep 29, 2020

raft: introduce persistent raft state

4f0f7c8

The patch introduces a sceleton of Raft module and a method to persist a Raft state in snapshot, not bound to any space. Part of #1146

Gerold103 added a commit that referenced this issue Sep 29, 2020

raft: add tests

33aa2d3

Part of #1146

Gerold103 added a commit that referenced this issue Sep 29, 2020

raft: add tests

cf79964

Part of #1146

sergos mentioned this issue Nov 2, 2020

RAFT implementation using quorum based synchro #5202

Closed

Mons closed this as completed Dec 4, 2020

kyukhin removed this from the wishlist milestone Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master election using raft consensus algorithm #1146

Master election using raft consensus algorithm #1146

rtsisyk commented Nov 13, 2015

Master election using raft consensus algorithm #1146

Master election using raft consensus algorithm #1146

Comments

rtsisyk commented Nov 13, 2015