Skip to content

Commit

Permalink
raft: introduce state machine
Browse files Browse the repository at this point in the history
The commit is a core part of Raft implementation. It introduces
the Raft state machine implementation and its integration into the
instance's life cycle.

The implementation follows the protocol to the letter except a few
important details.

Firstly, the original Raft assumes, that all nodes share the same
log record numbers. In Tarantool they are called LSNs. But in case
of Tarantool each node has its own LSN in its own component of
vclock. That makes the election messages a bit heavier, because
the nodes need to send and compare complete vclocks of each other
instead of a single number like in the original Raft. But logic
becomes simpler. Because in the original Raft there is a problem
of uncertainty about what to do with records of an old leader
right after a new leader is elected. They could be rolled back or
confirmed depending on circumstances. The issue disappears when
vclock is used.

Secondly, leader election works differently during cluster
bootstrap, until number of bootstrapped replicas becomes >=
election quorum. That arises from specifics of replicas bootstrap
and order of systems initialization. In short: during bootstrap a
leader election may use a smaller election quorum than the
configured one. See more details in the code.

Part of #1146
  • Loading branch information
Gerold103 committed Sep 29, 2020
1 parent 7b118a3 commit 91f2f4b
Show file tree
Hide file tree
Showing 5 changed files with 1,062 additions and 36 deletions.
23 changes: 21 additions & 2 deletions src/box/applier.cc
Expand Up @@ -883,6 +883,11 @@ static int
applier_handle_raft(struct applier *applier, struct xrow_header *row)
{
assert(iproto_type_is_raft_request(row->type));
if (applier->instance_id == 0) {
diag_set(ClientError, ER_PROTOCOL, "Can't apply a Raft request "
"from an instance without an ID");
return -1;
}

struct raft_request req;
struct vclock candidate_clock;
Expand All @@ -897,8 +902,21 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
* Return 0 for success or -1 in case of an error.
*/
static int
applier_apply_tx(struct stailq *rows)
applier_apply_tx(struct applier *applier, struct stailq *rows)
{
/*
* Rows received not directly from a leader are ignored. That is a
* protection against the case when an old leader keeps sending data
* around not knowing yet that it is not a leader anymore.
*
* XXX: it may be that this can be fine to apply leader transactions by
* looking at their replica_id field if it is equal to leader id. That
* can be investigated as an 'optimization'. Even though may not give
* anything, because won't change total number of rows sent in the
* network anyway.
*/
if (!raft_is_source_allowed(applier->instance_id))
return 0;
struct xrow_header *first_row = &stailq_first_entry(rows,
struct applier_tx_row, next)->row;
struct xrow_header *last_row;
Expand Down Expand Up @@ -1238,6 +1256,7 @@ applier_subscribe(struct applier *applier)
struct xrow_header *first_row =
&stailq_first_entry(&rows, struct applier_tx_row,
next)->row;
raft_process_heartbeat(applier->instance_id);
if (first_row->lsn == 0) {
if (unlikely(iproto_type_is_raft_request(
first_row->type))) {
Expand All @@ -1246,7 +1265,7 @@ applier_subscribe(struct applier *applier)
diag_raise();
}
applier_signal_ack(applier);
} else if (applier_apply_tx(&rows) != 0) {
} else if (applier_apply_tx(applier, &rows) != 0) {
diag_raise();
}

Expand Down
19 changes: 17 additions & 2 deletions src/box/box.cc
Expand Up @@ -157,7 +157,7 @@ void
box_update_ro_summary(void)
{
bool old_is_ro_summary = is_ro_summary;
is_ro_summary = is_ro || is_orphan;
is_ro_summary = is_ro || is_orphan || raft_is_ro();
/* In 99% nothing changes. Filter this out first. */
if (is_ro_summary == old_is_ro_summary)
return;
Expand All @@ -171,6 +171,10 @@ static int
box_check_writable(void)
{
if (is_ro_summary) {
/*
* XXX: return a special error when the node is not a leader to
* reroute to the leader node.
*/
diag_set(ClientError, ER_READONLY);
diag_log();
return -1;
Expand Down Expand Up @@ -2648,6 +2652,7 @@ box_init(void)

txn_limbo_init();
sequence_init();
raft_init();
}

bool
Expand Down Expand Up @@ -2795,8 +2800,18 @@ box_cfg_xc(void)
title("running");
say_info("ready to accept requests");

if (!is_bootstrap_leader)
if (!is_bootstrap_leader) {
replicaset_sync();
} else {
/*
* When the cluster is just bootstrapped and this instance is a
* leader, it makes no sense to wait for a leader appearance.
* There is no one. Moreover this node *is* a leader, so it
* should take the control over the situation and start a new
* term immediately.
*/
raft_new_term();
}

/* box.cfg.read_only is not read yet. */
assert(box_is_ro());
Expand Down

0 comments on commit 91f2f4b

Please sign in to comment.