Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master election using raft consensus algorithm #1146

Closed
rtsisyk opened this issue Nov 13, 2015 · 0 comments
Closed

Master election using raft consensus algorithm #1146

rtsisyk opened this issue Nov 13, 2015 · 0 comments
Labels
app complicated feature A new functionality

Comments

@rtsisyk
Copy link
Contributor

rtsisyk commented Nov 13, 2015

Adapt code for master election from bsync branch.

@rtsisyk rtsisyk self-assigned this Nov 13, 2015
@rtsisyk rtsisyk added this to the 1.7.0 milestone Nov 13, 2015
@kostja kostja mentioned this issue Nov 13, 2015
17 tasks
@rtsisyk rtsisyk changed the title Master election Master election using raft consensus algorithm Nov 13, 2015
@rtsisyk rtsisyk modified the milestones: 1.8.0, 1.7.0 Jun 7, 2016
@kostja kostja modified the milestones: 1.8.0, 1.8.1 Mar 31, 2017
@kostja kostja modified the milestones: 1.8.3, 1.8.4, 1.8.5 Oct 27, 2017
@Gerold103 Gerold103 assigned Gerold103 and unassigned rtsisyk Feb 9, 2018
@kostja kostja modified the milestones: 2.1.1, 1.9.0, 2.1.2 Feb 11, 2018
@Gerold103 Gerold103 self-assigned this Sep 24, 2018
@Gerold103 Gerold103 added feature A new functionality app complicated labels Sep 24, 2018
@kostja kostja modified the milestones: 2.1.1, 2.2.0 Nov 19, 2018
@kyukhin kyukhin removed this from the 2.2.0 milestone Apr 8, 2019
Gerold103 added a commit that referenced this issue Sep 28, 2020
Gerold103 added a commit that referenced this issue Sep 29, 2020
The new options are:

- raft_is_enabled - enable/disable Raft. When disabled, the node
  is supposed to work like if Raft does not exist. Like earlier;

- raft_is_candidate - a flag whether the instance can try to
  become a leader. Note, it can vote for other nodes regardless of
  value of this option;

- raft_election_timeout - how long need to wait until election
  end, in seconds.

The options don't do anything now. They are added separately in
order to keep such mundane changes from the main Raft commit, to
simplify its review.

Part of #1146
Gerold103 pushed a commit that referenced this issue Sep 29, 2020
The patch introduces a new type of system message used to notify the
followers of the instance's raft status updates.
It's relay's responsibility to deliver the new system rows to its peers.
The notification system reuses and extends the same row type used to
persist raft state in WAL and snapshot.

Part of #1146
Part of #5204
Gerold103 added a commit that referenced this issue Sep 29, 2020
The commit is a core part of Raft implementation. It introduces
the Raft state machine implementation and its integration into the
instance's life cycle.

The implementation follows the protocol to the letter except a few
important details.

Firstly, the original Raft assumes, that all nodes share the same
log record numbers. In Tarantool they are called LSNs. But in case
of Tarantool each node has its own LSN in its own component of
vclock. That makes the election messages a bit heavier, because
the nodes need to send and compare complete vclocks of each other
instead of a single number like in the original Raft. But logic
becomes simpler. Because in the original Raft there is a problem
of uncertainty about what to do with records of an old leader
right after a new leader is elected. They could be rolled back or
confirmed depending on circumstances. The issue disappears when
vclock is used.

Secondly, leader election works differently during cluster
bootstrap, until number of bootstrapped replicas becomes >=
election quorum. That arises from specifics of replicas bootstrap
and order of systems initialization. In short: during bootstrap a
leader election may use a smaller election quorum than the
configured one. See more details in the code.

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Box.info.raft returns a table of form:

    {
        state: <string>,
        term: <number>,
        vote: <instance ID>,
        leader: <instance ID>
    }

The fields correspond to the same named Raft concepts one to one.
This info dump is supposed to help with the tests, first of all.
And with investigation of problems in a real cluster.

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Gerold103 added a commit that referenced this issue Sep 29, 2020
Applier is going to need its numeric ID in order to tell the
future Raft module who is a sender of a Raft message. An
alternative would be to add sender ID to each Raft message, but
this looks like a crutch. Moreover, applier still needs to know
its numeric ID in order to notify Raft about heartbeats from the
peer node.

Needed for #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
An instance is writable if box.cfg.read_only is false, and it is
not orphan. Update of the final read-only state of the instance
needs to fire read-only update triggers, and notify the engines.
These 2 flags were easy and cheap to check on each operation, and
the triggers were easy to use since both flags are stored and
updated inside box.cc.

That is going to change when Raft is introduced. Raft will add 2
more checks:

  - A flag if Raft is enabled on the node. If it is not, then Raft
    state won't affect whether the instance is writable;

  - When Raft is enabled, it will allow writes on a leader only.

It means a check for being read-only would look like this:

    is_ro || is_orphan || (raft_is_enabled() && !raft_is_leader())

This is significantly slower. Besides, Raft somehow needs to
access the read-only triggers and engine API - this looks wrong.

The patch introduces a new flag is_ro_summary. The flag
incorporates all the read-only conditions into one flag. When some
subsystem may change read-only state of the instance, it needs to
call box_update_ro_summary(), and the function takes care of
updating the summary flag, running the triggers, and notifying the
engines.

Raft will use this function when its state or config will change.

Needed for #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Relay.cc and box.cc obtained box.cfg.wal_dir value using
cfg_gets() call. To initialize WAL and create struct recovery
objects.

That is not only a bit dangerous (cfg_gets() uses Lua API and can
throw a Lua error) and slow, but also not necessary - wal_dir
parameter is constant, it can't be changed after instance start.

It means, the value can be stored somewhere one time and then used
without Lua.

Main motivation is that the WAL directory path will be needed
inside relay threads to restart their recovery iterators in the
Raft patch. They can't use cfg_gets(), because Lua lives in TX
thread. But can access a constant global variable, introduced in
this patch (it existed before, but now has a method to get it).

Needed for #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Struct replicaset didn't store a number of registered replicas.
Only an array, which was necessary to fullscan each time when want
to find the count.

That is going to be needed in Raft to calculate election quorum.
The patch makes the count tracked so as it could be found for
constant time by simply reading an integer.

Needed for #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
The patch introduces a sceleton of Raft module and a method to
persist a Raft state in snapshot, not bound to any space.

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
The new options are:

- raft_is_enabled - enable/disable Raft. When disabled, the node
  is supposed to work like if Raft does not exist. Like earlier;

- raft_is_candidate - a flag whether the instance can try to
  become a leader. Note, it can vote for other nodes regardless of
  value of this option;

- raft_election_timeout - how long need to wait until election
  end, in seconds.

The options don't do anything now. They are added separately in
order to keep such mundane changes from the main Raft commit, to
simplify its review.

Part of #1146
Gerold103 pushed a commit that referenced this issue Sep 29, 2020
The patch introduces a new type of system message used to notify the
followers of the instance's raft status updates.
It's relay's responsibility to deliver the new system rows to its peers.
The notification system reuses and extends the same row type used to
persist raft state in WAL and snapshot.

Part of #1146
Part of #5204
Gerold103 added a commit that referenced this issue Sep 29, 2020
The commit is a core part of Raft implementation. It introduces
the Raft state machine implementation and its integration into the
instance's life cycle.

The implementation follows the protocol to the letter except a few
important details.

Firstly, the original Raft assumes, that all nodes share the same
log record numbers. In Tarantool they are called LSNs. But in case
of Tarantool each node has its own LSN in its own component of
vclock. That makes the election messages a bit heavier, because
the nodes need to send and compare complete vclocks of each other
instead of a single number like in the original Raft. But logic
becomes simpler. Because in the original Raft there is a problem
of uncertainty about what to do with records of an old leader
right after a new leader is elected. They could be rolled back or
confirmed depending on circumstances. The issue disappears when
vclock is used.

Secondly, leader election works differently during cluster
bootstrap, until number of bootstrapped replicas becomes >=
election quorum. That arises from specifics of replicas bootstrap
and order of systems initialization. In short: during bootstrap a
leader election may use a smaller election quorum than the
configured one. See more details in the code.

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Box.info.raft returns a table of form:

    {
        state: <string>,
        term: <number>,
        vote: <instance ID>,
        leader: <instance ID>
    }

The fields correspond to the same named Raft concepts one to one.
This info dump is supposed to help with the tests, first of all.
And with investigation of problems in a real cluster.

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Gerold103 added a commit that referenced this issue Sep 29, 2020
The new options are:

- election_is_enabled - enable/disable leader election (via
  Raft). When disabled, the node is supposed to work like if Raft
  does not exist. Like earlier;

- election_is_candidate - a flag whether the instance can try to
  become a leader. Note, it can vote for other nodes regardless
  of value of this option;

- election_timeout - how long need to wait until election end, in
  seconds.

The options don't do anything now. They are added separately in
order to keep such mundane changes from the main Raft commit, to
simplify its review.

Option names don't mention 'Raft' on purpose, because
- Not all users know what is Raft, so they may not even know it
  is related to leader election;
- In future the algorithm may change from Raft to something else,
  so better not to depend on it too much in the public API.

Part of #1146
Gerold103 pushed a commit that referenced this issue Sep 29, 2020
The patch introduces a new type of system message used to notify the
followers of the instance's raft status updates.
It's relay's responsibility to deliver the new system rows to its peers.
The notification system reuses and extends the same row type used to
persist raft state in WAL and snapshot.

Part of #1146
Part of #5204
Gerold103 added a commit that referenced this issue Sep 29, 2020
The commit is a core part of Raft implementation. It introduces
the Raft state machine implementation and its integration into the
instance's life cycle.

The implementation follows the protocol to the letter except a few
important details.

Firstly, the original Raft assumes, that all nodes share the same
log record numbers. In Tarantool they are called LSNs. But in case
of Tarantool each node has its own LSN in its own component of
vclock. That makes the election messages a bit heavier, because
the nodes need to send and compare complete vclocks of each other
instead of a single number like in the original Raft. But logic
becomes simpler. Because in the original Raft there is a problem
of uncertainty about what to do with records of an old leader
right after a new leader is elected. They could be rolled back or
confirmed depending on circumstances. The issue disappears when
vclock is used.

Secondly, leader election works differently during cluster
bootstrap, until number of bootstrapped replicas becomes >=
election quorum. That arises from specifics of replicas bootstrap
and order of systems initialization. In short: during bootstrap a
leader election may use a smaller election quorum than the
configured one. See more details in the code.

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Box.info.election returns a table of form:

    {
        state: <string>,
        term: <number>,
        vote: <instance ID>,
        leader: <instance ID>
    }

The fields correspond to the same named Raft concepts one to one.
This info dump is supposed to help with the tests, first of all.
And with investigation of problems in a real cluster.

The API doesn't mention 'Raft' on purpose, to keep it not
depending specifically on Raft, and not to confuse users who
don't know anything about Raft (even that it is about leader
election and synchronous replication).

Part of #1146
Gerold103 added a commit that referenced this issue Sep 29, 2020
Gerold103 added a commit that referenced this issue Oct 6, 2020
The new option can be one of 3 values: 'off', 'candidate',
'voter'. It replaces 2 old options: election_is_enabled and
election_is_candidate. These flags looked strange, that it was
possible to set candidate true, but disable election at the same
time. Also it would not look good if we would ever decide to
introduce another mode like a data-less sentinel node, for
example. Just for voting.

Anyway, the single option approach looks easier to configure and
to extend.

- 'off' means the election is disabled on the node. It is the same
  as election_is_enabled = false in the old config;

- 'voter' means the node can vote and is never writable. The same
  as election_is_enabled = true + election_is_candidate = false in
  the old config;

- 'candidate' means the node is a full-featured cluster member,
  which eventually may become a leader. The same as
  election_is_enabled = true + election_is_candidate = true in the
  old config.

Part of #1146
Gerold103 added a commit that referenced this issue Oct 7, 2020
The new option can be one of 3 values: 'off', 'candidate',
'voter'. It replaces 2 old options: election_is_enabled and
election_is_candidate. These flags looked strange, that it was
possible to set candidate true, but disable election at the same
time. Also it would not look good if we would ever decide to
introduce another mode like a data-less sentinel node, for
example. Just for voting.

Anyway, the single option approach looks easier to configure and
to extend.

- 'off' means the election is disabled on the node. It is the same
  as election_is_enabled = false in the old config;

- 'voter' means the node can vote and is never writable. The same
  as election_is_enabled = true + election_is_candidate = false in
  the old config;

- 'candidate' means the node is a full-featured cluster member,
  which eventually may become a leader. The same as
  election_is_enabled = true + election_is_candidate = true in the
  old config.

Part of #1146
Gerold103 added a commit that referenced this issue Oct 12, 2020
The new option can be one of 3 values: 'off', 'candidate',
'voter'. It replaces 2 old options: election_is_enabled and
election_is_candidate. These flags looked strange, that it was
possible to set candidate true, but disable election at the same
time. Also it would not look good if we would ever decide to
introduce another mode like a data-less sentinel node, for
example. Just for voting.

Anyway, the single option approach looks easier to configure and
to extend.

- 'off' means the election is disabled on the node. It is the same
  as election_is_enabled = false in the old config;

- 'voter' means the node can vote and is never writable. The same
  as election_is_enabled = true + election_is_candidate = false in
  the old config;

- 'candidate' means the node is a full-featured cluster member,
  which eventually may become a leader. The same as
  election_is_enabled = true + election_is_candidate = true in the
  old config.

Part of #1146
sergepetrenko pushed a commit that referenced this issue Dec 2, 2020
An instance is writable if box.cfg.read_only is false, and it is
not orphan. Update of the final read-only state of the instance
needs to fire read-only update triggers, and notify the engines.
These 2 flags were easy and cheap to check on each operation, and
the triggers were easy to use since both flags are stored and
updated inside box.cc.

We're going to make txn_limbo affect read-only state. This would be
useful, since a write on a node other than limbo owner results in an
error anyway.

It means a check for being read-only would look like this:

    is_ro || is_orphan || (!txn_limbo_is_empty() && limbo->owner_id !=
                                                    instance_id)

This is significantly slower.

The patch introduces a new flag is_ro_summary. The flag
incorporates all the read-only conditions into one flag. When some
subsystem may change read-only state of the instance, it needs to
call box_update_ro_summary(), and the function takes care of
updating the summary flag, running the triggers, and notifying the
engines.

The limbo will use this function each time its owner changes or it gets
emptied out.

Needed for #1146

(cherry picked from commit 31dc4fa)
Gerold103 added a commit that referenced this issue Dec 3, 2020
An instance is writable if box.cfg.read_only is false, and it is
not orphan. Update of the final read-only state of the instance
needs to fire read-only update triggers, and notify the engines.
These 2 flags were easy and cheap to check on each operation, and
the triggers were easy to use since both flags are stored and
updated inside box.cc.

We're going to make txn_limbo affect read-only state. This would be
useful, since a write on a node other than limbo owner results in an
error anyway.

It means a check for being read-only would look like this:

    is_ro || is_orphan || (!txn_limbo_is_empty() && limbo->owner_id !=
                                                    instance_id)

This is significantly slower.

The patch introduces a new flag is_ro_summary. The flag
incorporates all the read-only conditions into one flag. When some
subsystem may change read-only state of the instance, it needs to
call box_update_ro_summary(), and the function takes care of
updating the summary flag, running the triggers, and notifying the
engines.

The limbo will use this function each time its owner changes or it gets
emptied out.

Needed for #1146

(cherry picked from commit 31dc4fa)
@Mons Mons closed this as completed Dec 4, 2020
Gerold103 added a commit that referenced this issue Dec 24, 2020
Struct replicaset didn't store a number of registered replicas.
Only an array, which was necessary to fullscan each time when want
to find the count.

That is going to be needed in Raft to calculate election quorum.
The patch makes the count tracked so as it could be found for
constant time by simply reading an integer.

Needed for #1146

(cherry picked from commit 764a548)
@kyukhin kyukhin removed this from the wishlist milestone Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app complicated feature A new functionality
Projects
None yet
Development

No branches or pull requests

5 participants