Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections #2454

shaan1337 · 2020-04-30T13:14:14Z

Changed: Write leader's instance ID in epoch record. Pass on the epoch record's leader's instance id and each node's gossip information during elections to the leader of elections to determine more accurately if the previous leader is still alive when choosing the best leader candidate.

Fixes: #2213

Problem statement

Current algorithm

The current election algorithm works like this when choosing the best candidate:

Each node keeps in memory the last elected leader for the elections they've participated in
When a node becomes the leader of the elections, it chooses the best leader candidate as follows:
- If the previously elected leader it has stored in memory appears to be alive (according to the node's gossip info), immediately propose it as the leader
- Otherwise, choose the best candidate (by selecting the node that has the latest data) among the quorum number of nodes that have sent a PrepareOk message.

There are a few problems with this approach:

Problem 1

It is not necessary that the in-memory previously elected leader is accurate. For instance, consider this scenario in a 3 node cluster:

Nodes 1,2,3 elect node 1 as the leader
A network partition occurs which isolates node 3. Node 3's in memory previously elected leader is node 1.
Node 2 becomes the leader for some reason. (e.g node 1 has resigned, node 2 is restarted, etc.)
The partition is re-established and node 3 triggers elections.
Node 3 becomes the leader of the elections. It still thinks that node 1 is the leader and will propose it immediately although node 2 is the new leader.

Problem 2

The leader of the elections also uses only it's own gossip information to determine if the previous leader is alive or not. It may be possible that the gossip information is outdated and the node thinks that the leader is dead but it is not.

Problem 3

Restarting a node loses the in memory previously elected leader. If this node becomes leader of the elections, it may not propose the previous leader.

Proposed solution

New algorithm

The new algorithm works as follows:

Instead of keeping the last elected leader's ID in memory, we store the leader's ID in the epoch record that the leader writes as soon as it is elected. The epoch record is part of the transaction log and is replicated to each node. Each node also passes this information to the leader of the Elections in the PrepareOk message.
Instead of only using the leader of the elections's gossip information, each node sends its latest gossip information in the PrepareOk message.
The leader of the elections chooses the best candidate leader as follows:
- Start by finding the candidate that has the latest data among the quorum number of nodes that are participating in the elections. Let's call this candidate BestCandidate
- Obtain the leader's instance ID (PreviousLeader) from the epoch record in the PrepareOk message sent by BestCandidate.
- Next try to determine if PreviousLeader is alive by either seeing if we have directly received a PrepareOk message from it or if at least one node among the quorum thinks it is alive (using the gossip info sent in the PrepareOk).
- If it appears to be alive and is not resigning, then we propose PreviousLeader as best candidate
- Otherwise we propose BestCandidate as the best candidate.

Guarantees

Using the above algorithm, we guarantee that we are electing a leader that has the latest data while with the previous algorithm the in-memory value may be stale. Since data needs to be replicated to a quorum number of nodes to be considered committed, we know that if the epoch has been replicated, the chosen BestCandidate will have it independently of which subset quorum of nodes participate in the elections.
See this comment for more details: https://github.com/EventStore/EventStore/pull/2454/files#diff-65f89e62d3f64576ec45b0860d37ffc8R437
There are two possibilities when the previous leader is alive but we do not elect it:
- The previous leader writes an epoch record but it does not have enough time to be replicated to a quorum number of nodes and new elections occur. This is an uncommitted write anyway. In this case, the previous leader will need to truncate when it joins the new leader. We still guarantee that the BestCandidate node has the latest replicated data.
- All the quorum number of nodes participating in the elections think that the previous leader is dead while it's still actually alive. In this scenario, we will not be able to obtain a quorum number of Accepts anyway and the elections would fail, so by proposing the BestCandidate instead, we're still better than nothing.
- On the other hand if at least one node says that the previous leader is alive, we may still have a quorum number of Accepts if we choose this node and the other floor(N/2) nodes that have not sent their PrepareOks for the elections. So we still give the previous leader a chance to be elected.

Stress tests

Stress tests have been carried out with https://github.com/EventStore/XstreamTester with random cluster sizes between 3 and 7 nodes. On this PR, out of around 5640 elections, 7 resulted in truncation which is around 0.1%. The reason for truncation is due to one of the two reasons stated above. There have also been no truncations of committed records.

Tests have also been carried out on V6 master (ca67b9a) and in this case, the number of truncations is higher: 14/4996 elections ~= (0.3%). This has probably improved a bit compared to V5 due to #2386. There was also a case where epoch with committed records was truncated.

Tests on 5.0.8 also show a higher number of truncations and a higher number of truncation of epoch with committed records.

Branch	Number of unique elections tested	No. of elections resulting in Truncations	No. of elections resulting in Truncations of epoch with committed records
This PR	5640	7 (0.1%)	0
v6-master (`ca67b9a`)	4996	14 (0.3%)	1
v 5.0.8 (`6b871e5`)	5290	17 (0.3%)	4

jageall

Is it necessary to update the UI submodule in this PR?

It was a rebasing mistake. Fixed, thanks!

shaan1337 · 2020-05-20T13:19:15Z

Tests currently failing due to #2505

…andidate node and it's still alive, propose it as the best candidate for Leader.

…e PrepareOk message GetBestLeaderCandidate(): Use the cluster gossip information to determine if the previous leader may still be alive GetBestLeaderCandidate(): Clean up code and add debug log messages

Remove tests that are no longer relevant since we no longer rely on the value of the in-memory last elected master to determine the new leader

…rCandidate() Add TestElectedLeaderId property to ElectionsService for test visibility

…y - only change leader if the current leader is resigning

src/EventStore.Core/Services/ElectionsService.cs

…sts accordingly

Requested changes have been made

shaan1337 force-pushed the write-leader-in-epoch-record branch 4 times, most recently from 3d1b8e7 to 1e49bd7 Compare May 5, 2020 11:09

jageall requested a review from condron May 14, 2020 13:26

shaan1337 force-pushed the write-leader-in-epoch-record branch from 54fdcd7 to 317d693 Compare May 19, 2020 17:18

shaan1337 self-assigned this May 19, 2020

shaan1337 changed the title ~~Write leader instance ID in epoch record~~ Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections May 20, 2020

shaan1337 marked this pull request as ready for review May 20, 2020 08:59

shaan1337 requested a review from jageall May 20, 2020 09:12

jageall previously requested changes May 20, 2020

View reviewed changes

shaan1337 force-pushed the write-leader-in-epoch-record branch from 317d693 to 71a4860 Compare May 20, 2020 11:02

shaan1337 requested a review from jageall May 20, 2020 11:03

shaan1337 added 8 commits May 21, 2020 13:15

Write LeaderInstanceId in EpochRecord

fc3c513

Pass EpochLeaderInstanceId to PrepareOk/Proposal Election messages

bb70973

Pass epochLeaderId to tests in PrepareOk/Proposal Election messages

64261c9

If there's an EpochLeaderInstanceId in the epoch record of the best c…

709ffc4

…andidate node and it's still alive, propose it as the best candidate for Leader.

Update cluster.proto to pass cluster information from each node in th…

cdf3aa5

…e PrepareOk message GetBestLeaderCandidate(): Use the cluster gossip information to determine if the previous leader may still be alive GetBestLeaderCandidate(): Clean up code and add debug log messages

Pass clusterInfo to PrepareOk Election messages in tests

2b3e39a

Fix tests that are failing

a6bf9c0

Remove tests that are no longer relevant since we no longer rely on the value of the in-memory last elected master to determine the new leader

Add new election tests covering most of the scenarios in GetBestLeade…

2c3dc58

…rCandidate() Add TestElectedLeaderId property to ElectionsService for test visibility

shaan1337 force-pushed the write-leader-in-epoch-record branch from 71a4860 to 2c3dc58 Compare May 21, 2020 09:16

jageall previously approved these changes May 21, 2020

View reviewed changes

jageall requested a review from hayley-jean May 22, 2020 09:13

Do not change leader even if there's another node with higher priorit…

d8fb6ee

…y - only change leader if the current leader is resigning

shaan1337 dismissed jageall’s stale review via d8fb6ee May 26, 2020 08:02

shaan1337 requested a review from jageall May 26, 2020 08:03

hayley-jean previously requested changes May 26, 2020

View reviewed changes

src/EventStore.Core/Services/ElectionsService.cs Outdated Show resolved Hide resolved

src/EventStore.Core/Services/ElectionsService.cs Outdated Show resolved Hide resolved

Remove TestElectedLeaderId property by LeaderProposalId and rename te…

cd3760c

…sts accordingly

jageall approved these changes May 26, 2020

View reviewed changes

shaan1337 requested a review from hayley-jean May 26, 2020 08:54

hayley-jean approved these changes May 28, 2020

View reviewed changes

hayley-jean merged commit a4ee0ea into master May 28, 2020

hayley-jean deleted the write-leader-in-epoch-record branch May 28, 2020 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections #2454

Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections #2454

shaan1337 commented Apr 30, 2020 •

edited

jageall left a comment

shaan1337 commented May 20, 2020

Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections #2454

Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections #2454

Conversation

shaan1337 commented Apr 30, 2020 • edited

Problem statement

Current algorithm

Problem 1

Problem 2

Problem 3

Proposed solution

New algorithm

Guarantees

Stress tests

jageall left a comment

Choose a reason for hiding this comment

shaan1337 commented May 20, 2020

shaan1337 commented Apr 30, 2020 •

edited