-
Notifications
You must be signed in to change notification settings - Fork 1
Recipe: Basic Leader Election
Invoke and observe a successful leader election with the minimum number of peers per the specified configuration.
- Raft configuration
- Cluster size to determine the minimum number of servers needed
num-raft-servers / 2 + 1
First, bring Holon up through “Term Catch Up”. Once there, start up an additional server from the configuration bringing the total number of running servers to (N / 2 + 1)
. Perform the necessary observations and checks which are described in the remainder of this document.
While the number of running servers is <= N / 2
, no leader can be elected. Starting from the state specified in Term Catch Up, the addition of one more raft peer will allow for an election to complete. For an election to complete, the newly started server must itself undergo “Term Catch Up”. This should happen as the other nodes cycle through the candidate state and issue vote requests to the newly started server. It is possible that the newly started server will become the leader if the first election following its startup ends in a draw. A draw will occur if server-0 and server-1 have the same term value and have voted for themselves and the new server, server-2, votes for either of them. In the case of a 4 or 5 node cluster, with 3 running peers, all 3 peers (N / 2 + 1) must vote for the same candidate. When a draw occurs after server-2 has joined, it’s possible that server-2 can win the election - with one caveat - that the server-2 has a ‘newest-entry-index’ which is >= to the other peers. This condition will be true for a newly initialized cluster or in the case where servers-[0-2] were fully synchronized at the time of cluster shutdown.
{
"raft_root_entry" : [
{
"raft-uuid" : "23568354-914f-11ea-960c-90324b2d1e89",
"peer-uuid" : "2357668e-914f-11ea-94a1-90324b2d1e89",
"voted-for-uuid" : "2357668e-914f-11ea-94a1-90324b2d1e89",
"leader-uuid" : "",
"state" : "candidate",
"follower-reason" : "none",
"client-requests" : "deny-leader-not-established",
"term" : 155,
"commit-idx" : -1,
"last-applied" : -1,
"last-applied-cumulative-crc" : 0,
"newest-entry-idx" : -1,
"newest-entry-term" : 0,
"newest-entry-data-size" : 0,
"newest-entry-crc" : 0,
"dev-read-latency-usec" : {
"1" : 1,
"4" : 1
},
"dev-write-latency-usec" : {
"1024" : 4,
"2048" : 2
}
}
],
"system_info" : {
"current_time" : "Fri May 08 17:18:35 UTC 2020"
}
}
{
"raft_root_entry" : [
{
"raft-uuid" : "23568354-914f-11ea-960c-90324b2d1e89",
"peer-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"voted-for-uuid" : "2357668e-914f-11ea-94a1-90324b2d1e89",
"leader-uuid" : "",
"state" : "follower",
"follower-reason" : "voted-for-peer",
"client-requests" : "redirect-to-leader",
"term" : 155,
"commit-idx" : -1,
"last-applied" : -1,
"last-applied-cumulative-crc" : 0,
"newest-entry-idx" : -1,
"newest-entry-term" : 0,
"newest-entry-data-size" : 0,
"newest-entry-crc" : 0,
"dev-read-latency-usec" : {
"1" : 1,
"4" : 1
},
"dev-write-latency-usec" : {
"1024" : 140,
"2048" : 16,
"4096" : 1
}
}
],
"system_info" : {
"current_time" : "Fri May 08 17:18:35 UTC 2020"
}
}
Above is the JSON output of 2 servers running on a fresh 5 peer raft configuration. Per the bolded items, one can see that no log / commit activity has occurred. However, we do see that writes have been done but these are explained by Term Ticker.
Below is the JSON following the start of the 3rd server. Here we can see that peer 235798b6-914f-11ea-8f26-90324b2d1e89 has joined and voted for a previously running peer (which has become the leader).
{
"raft_root_entry" : [
{
"raft-uuid" : "23568354-914f-11ea-960c-90324b2d1e89",
"peer-uuid" : "235798b6-914f-11ea-8f26-90324b2d1e89",
"voted-for-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"leader-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"state" : "follower",
"follower-reason" : "voted-for-peer",
"client-requests" : "redirect-to-leader",
"term" : 245,
"commit-idx" : 0,
"last-applied" : 0,
"last-applied-cumulative-crc" : 2740276030,
"newest-entry-idx" : 0,
"newest-entry-term" : 245,
"newest-entry-data-size" : 0,
"newest-entry-crc" : 2740276030,
"dev-read-latency-usec" : {
"1" : 1,
"16" : 1
},
"dev-write-latency-usec" : {
"1024" : 4
}
}
],
"system_info" : {
"current_time" : "Fri May 08 17:20:24 UTC 2020"
}
}
{
"raft_root_entry" : [
{
"raft-uuid" : "23568354-914f-11ea-960c-90324b2d1e89",
"peer-uuid" : "2357668e-914f-11ea-94a1-90324b2d1e89",
"voted-for-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"leader-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"state" : "follower",
"follower-reason" : "voted-for-peer",
"client-requests" : "redirect-to-leader",
"term" : 245,
"commit-idx" : 0,
"last-applied" : 0,
"last-applied-cumulative-crc" : 2740276030,
"newest-entry-idx" : 0,
"newest-entry-term" : 245,
"newest-entry-data-size" : 0,
"newest-entry-crc" : 2740276030,
"dev-read-latency-usec" : {
"1" : 1,
"4" : 1
},
"dev-write-latency-usec" : {
"1024" : 81,
"2048" : 16
}
}
],
"system_info" : {
"current_time" : "Fri May 08 17:20:24 UTC 2020"
}
}
{
"raft_root_entry" : [
{
"raft-uuid" : "23568354-914f-11ea-960c-90324b2d1e89",
"peer-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"voted-for-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"leader-uuid" : "23573204-914f-11ea-a8bb-90324b2d1e89",
"state" : "leader",
"follower-reason" : "none",
"client-requests" : "accept",
"term" : 245,
"commit-idx" : 0,
"last-applied" : 0,
"last-applied-cumulative-crc" : 2740276030,
"newest-entry-idx" : 0,
"newest-entry-term" : 245,
"newest-entry-data-size" : 0,
"newest-entry-crc" : 2740276030,
"dev-read-latency-usec" : {
"1" : 1,
"4" : 1
},
"dev-write-latency-usec" : {
"1024" : 220,
"2048" : 27,
"4096" : 1
},
"follower-stats" : [
{
"peer-uuid" : "2357f5ae-914f-11ea-9aac-90324b2d1e89",
"last-ack" : "Thu Jan 01 00:00:00 UTC 1970",
"next-idx" : 0,
"prev-idx-term" : 0
},
{
"peer-uuid" : "2357cdae-914f-11ea-b715-90324b2d1e89",
"last-ack" : "Thu Jan 01 00:00:00 UTC 1970",
"next-idx" : 0,
"prev-idx-term" : 0
},
{
"peer-uuid" : "235798b6-914f-11ea-8f26-90324b2d1e89",
"last-ack" : "Fri May 08 17:20:24 UTC 2020",
"next-idx" : 1,
"prev-idx-term" : 245
},
{
"peer-uuid" : "2357668e-914f-11ea-94a1-90324b2d1e89",
"last-ack" : "Fri May 08 17:20:24 UTC 2020",
"next-idx" : 1,
"prev-idx-term" : 245
}
],
"commit-latency-msec" : {},
"read-latency-msec" : {}
}
],
"system_info" : {
"current_time" : "Fri May 08 17:20:24 UTC 2020"
}
}
Leader election is a fundamental aspect of Raft and here we can see a lot of different activity in the above JSON snippet. Let’s go through the various items.
As we know that the number of running servers is 3 out of 5, or 3 == N / 2 + 1 where N == 5, then it is inferred that all running servers have voted for the same peer. Therefore, voted-for-uuid and leader-uuid should be equivalent for each peer!
Note: Holon should be tracking which peer UUIDs are leaders and followers.
Two of the three servers must report the following state following the election:
"state" : "follower",
"follower-reason" : "voted-for-peer",
"client-requests" : "redirect-to-leader",
The leader should report:
"state" : "leader",
"follower-reason" : "none",
“client-requests" : "accept",
Note that the leader may report “deny-may-be-deposed” or “"deny-determining-commit-index"” in some cases. This is probably OK for now as long as the condition is transient.
The two recipes, Term Catch Up and Term Ticker, focused on the term and how it “ticked” or was “caught up” via the logical clock property. Now that the leader has been elected, the term will stop ticking. “Basic leader election” must verify that this is the case. In fact, until the current leader is either stopped or paused, the term of these running processes should not advance.
The “commit-idx” key is now ‘0’. The pre-election value was ‘-1’, meaning that the system did not yet know the value. Since the raft instance shown here is fresh, the next value after election will be 0. However, a non-fresh raft instance would still show ‘-1’ pre-election but a value > 0 post-election. The commit index in raft never decreases. Following subsequent restarts of raft, the commit index will continue to advance one for every successful election or client write operation. To this point, a newly elected raft leader will commit a dummy transaction (aka “leader change marker” or lcm) before it can accept reads or writes from the client. In some cases this procedure may take some time, during this period the leader client-request state will be "deny-determining-commit-index" - which means that the leader has not yet committed its first transaction. For this recipe, we do not expect the system to reach the "deny-determining-commit-index" condition since there are no prior commits to replay.
As a general rule: the commit index in raft never decreases.
The “last-applied” key relates to the commit index. While the commit index tracks the transactions which have been safely written to the cluster, “last-applied” is a local variable used to track which transactions have been added to the state machine. In the case of a leader election, the transaction is very simple so there should not be a huge or even noticeable latency between the advance of the commit-idx from -1 -> 0 and the advance last-applied from -1 -> 0.
As a general rule: “last-applied” must be <= “commit-idx”.
This key is used to ensure that peers with the same “last-applied” value have done the exact same work to their state machine. In the case of this recipe, once the election has completed, each peer’s last-applied-cumulative-crc value should be equivalent.
As a general rule: Whenever the cluster has reached a stasis amongst a quorum of peers the last-applied-cumulative-crc must be equal across that set of peers.
The “newest entry” refers to written raft indices which may or may not have been committed. Each raft entry is written with a header which contains the term in which the entry was written and the size of that entry. The NIOVA raft implementation also stores a CRC for the entry.
"newest-entry-idx" : 0, "newest-entry-term" : 245, "newest-entry-data-size" : 0, “newest-entry-crc" : 2740276030,
For this recipe, “Basic Leader Election”, only a single entry is written - the entry written by the elected leader which marks the beginning of its term as leader. This entry is written during the most recent term and the term value of the entry must equal that of the alive peers. The data size is ‘0’ since this raft commit requires no application storage.
The newest-entry-crc is the checksum for this entry only (it is not a cumulative sum like the aforementioned last-applied-cumulative-crc). In the case where the commit-idx and newest-entry-idx are 0, then the newest-entry-crc will be equal to last-applied-cumulative-crc but this is probably the only time.
The leader maintains some stats for its followers:
"follower-stats" : [
{
"peer-uuid" : "2357f5ae-914f-11ea-9aac-90324b2d1e89",
"last-ack" : "Thu Jan 01 00:00:00 UTC 1970",
"next-idx" : 0,
"prev-idx-term" : 0
},
{
"peer-uuid" : "2357cdae-914f-11ea-b715-90324b2d1e89",
"last-ack" : "Thu Jan 01 00:00:00 UTC 1970",
"next-idx" : 0,
"prev-idx-term" : 0
},
{
"peer-uuid" : "235798b6-914f-11ea-8f26-90324b2d1e89",
"last-ack" : "Fri May 08 19:07:21 UTC 2020",
"next-idx" : 1,
"prev-idx-term" : 245
},
{
"peer-uuid" : "2357668e-914f-11ea-94a1-90324b2d1e89",
"last-ack" : "Fri May 08 19:07:21 UTC 2020",
"next-idx" : 1,
"prev-idx-term" : 245
}
],
First thing to notice is that the JSON array has 4 entries (N - 1), this is because the leader does not track itself in this array.
Second thing is that two out of the four followers have a “last-ack” time of "Thu Jan 01 00:00:00 UTC 1970". This means that these follower peers have never communicated with the leader since the leader has started up. In this case, the reason is that these two peers have yet to be started. Another possible reason (which will be covered in a future recipe) is where the peers are running but they exist in a separated network partition and cannot communicate with the leader.
As for the remaining, alive followers. We can see that the leader is tracking their “next-idx” and the “prev-idx-term”. For this recipe, Holon must observe that the leader’s follower stat for the given follower matches the respective value in the follower’s own stats:
leader.follower_stat[UUID]next-idx == follower@UUID.newest-entry-idx + 1
leader.follower_stat[UUID]prev-idx-term == follower@UUID.term
Follower-stats and Holon Sanity Checking Holon must ensure that the info in the follower-stats matches its own tracking information for the raft cluster. For example, Holon must know which UUIDs have been started and which have not and check this against the follower-stats data. If there’s an inconsistency then Holon should terminate the recipe and exit with an error. Dev Write Latency Histogram of the Last Server to Startup (LSS) The server process which was last to be started should have a total number of writes <= the peers which had already been running. In most cases, the LSS will have a total number of writes equaling ‘4’. The reason for this predictable value relies on the fact that the newest-entry-idx values for all servers are the same. Therefore, future recipes which have more complex scenarios will not always generate this number of writes on a follower.
"dev-write-latency-usec" : { "1024" : 4 }
Where do these 4 writes come from? Writes 1 and 2 come from superblock initialization These writes are only present if the raft log is ‘fresh’ Write 3 occurs when the peer replies ‘yes’ to a vote request from a candidate Raft peers must record whom they voted for and the term in which the vote occurred Write 4 is done on behalf of the leader writing his first ‘append-entry’ command to the cluster. ###What if the Writes Total Greater than 4? As mentioned above, this predictable value relies on the fact that the newest-entry-idx values for all servers are the same. If these values are the same then a value > 4 is still possible if both the other peers are candidates at the time when the LSS peer starts up. In this case, the LSS will still do the first 3 writes described above, however, no leader will be elected and each candidate will time out. Following the election time outs, each of the running peers, including the LSS, may now become a candidate - increasing the term and sending vote-requests to the other peers. Note that this cycle of unwinnable elections may occur many times over, especially with the minimum number of peers (N / 2 + 1) running.