Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections #2454

Merged
merged 10 commits into from
May 28, 2020

Conversation

shaan1337
Copy link
Member

@shaan1337 shaan1337 commented Apr 30, 2020

Changed: Write leader's instance ID in epoch record. Pass on the epoch record's leader's instance id and each node's gossip information during elections to the leader of elections to determine more accurately if the previous leader is still alive when choosing the best leader candidate.

Fixes: #2213

Problem statement

Current algorithm

The current election algorithm works like this when choosing the best candidate:

  • Each node keeps in memory the last elected leader for the elections they've participated in
  • When a node becomes the leader of the elections, it chooses the best leader candidate as follows:
    • If the previously elected leader it has stored in memory appears to be alive (according to the node's gossip info), immediately propose it as the leader
    • Otherwise, choose the best candidate (by selecting the node that has the latest data) among the quorum number of nodes that have sent a PrepareOk message.

There are a few problems with this approach:

Problem 1

It is not necessary that the in-memory previously elected leader is accurate. For instance, consider this scenario in a 3 node cluster:

  • Nodes 1,2,3 elect node 1 as the leader
  • A network partition occurs which isolates node 3. Node 3's in memory previously elected leader is node 1.
  • Node 2 becomes the leader for some reason. (e.g node 1 has resigned, node 2 is restarted, etc.)
  • The partition is re-established and node 3 triggers elections.
  • Node 3 becomes the leader of the elections. It still thinks that node 1 is the leader and will propose it immediately although node 2 is the new leader.

Problem 2

The leader of the elections also uses only it's own gossip information to determine if the previous leader is alive or not. It may be possible that the gossip information is outdated and the node thinks that the leader is dead but it is not.

Problem 3

Restarting a node loses the in memory previously elected leader. If this node becomes leader of the elections, it may not propose the previous leader.

Proposed solution

New algorithm

The new algorithm works as follows:

  • Instead of keeping the last elected leader's ID in memory, we store the leader's ID in the epoch record that the leader writes as soon as it is elected. The epoch record is part of the transaction log and is replicated to each node. Each node also passes this information to the leader of the Elections in the PrepareOk message.
  • Instead of only using the leader of the elections's gossip information, each node sends its latest gossip information in the PrepareOk message.
  • The leader of the elections chooses the best candidate leader as follows:
    • Start by finding the candidate that has the latest data among the quorum number of nodes that are participating in the elections. Let's call this candidate BestCandidate
    • Obtain the leader's instance ID (PreviousLeader) from the epoch record in the PrepareOk message sent by BestCandidate.
    • Next try to determine if PreviousLeader is alive by either seeing if we have directly received a PrepareOk message from it or if at least one node among the quorum thinks it is alive (using the gossip info sent in the PrepareOk).
    • If it appears to be alive and is not resigning, then we propose PreviousLeader as best candidate
    • Otherwise we propose BestCandidate as the best candidate.

Guarantees

  • Using the above algorithm, we guarantee that we are electing a leader that has the latest data while with the previous algorithm the in-memory value may be stale. Since data needs to be replicated to a quorum number of nodes to be considered committed, we know that if the epoch has been replicated, the chosen BestCandidate will have it independently of which subset quorum of nodes participate in the elections.
    See this comment for more details: https://github.com/EventStore/EventStore/pull/2454/files#diff-65f89e62d3f64576ec45b0860d37ffc8R437
  • There are two possibilities when the previous leader is alive but we do not elect it:
    • The previous leader writes an epoch record but it does not have enough time to be replicated to a quorum number of nodes and new elections occur. This is an uncommitted write anyway. In this case, the previous leader will need to truncate when it joins the new leader. We still guarantee that the BestCandidate node has the latest replicated data.
    • All the quorum number of nodes participating in the elections think that the previous leader is dead while it's still actually alive. In this scenario, we will not be able to obtain a quorum number of Accepts anyway and the elections would fail, so by proposing the BestCandidate instead, we're still better than nothing.
    • On the other hand if at least one node says that the previous leader is alive, we may still have a quorum number of Accepts if we choose this node and the other floor(N/2) nodes that have not sent their PrepareOks for the elections. So we still give the previous leader a chance to be elected.

Stress tests

Stress tests have been carried out with https://github.com/EventStore/XstreamTester with random cluster sizes between 3 and 7 nodes. On this PR, out of around 5640 elections, 7 resulted in truncation which is around 0.1%. The reason for truncation is due to one of the two reasons stated above. There have also been no truncations of committed records.

Tests have also been carried out on V6 master (ca67b9a) and in this case, the number of truncations is higher: 14/4996 elections ~= (0.3%). This has probably improved a bit compared to V5 due to #2386. There was also a case where epoch with committed records was truncated.

Tests on 5.0.8 also show a higher number of truncations and a higher number of truncation of epoch with committed records.

Branch Number of unique elections tested No. of elections resulting in Truncations No. of elections resulting in Truncations of epoch with committed records
This PR 5640 7 (0.1%) 0
v6-master (ca67b9a) 4996 14 (0.3%) 1
v 5.0.8 (6b871e5) 5290 17 (0.3%) 4

@shaan1337 shaan1337 force-pushed the write-leader-in-epoch-record branch 4 times, most recently from 3d1b8e7 to 1e49bd7 Compare May 5, 2020 11:09
@jageall jageall requested a review from condron May 14, 2020 13:26
@shaan1337 shaan1337 force-pushed the write-leader-in-epoch-record branch from 54fdcd7 to 317d693 Compare May 19, 2020 17:18
@shaan1337 shaan1337 self-assigned this May 19, 2020
@shaan1337 shaan1337 changed the title Write leader instance ID in epoch record Write leader instance ID in epoch record and pass on epoch's leader's ID and gossip info to leader of elections May 20, 2020
@shaan1337 shaan1337 marked this pull request as ready for review May 20, 2020 08:59
@shaan1337 shaan1337 requested a review from jageall May 20, 2020 09:12
Copy link
Contributor

@jageall jageall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to update the UI submodule in this PR?

@shaan1337 shaan1337 force-pushed the write-leader-in-epoch-record branch from 317d693 to 71a4860 Compare May 20, 2020 11:02
@shaan1337 shaan1337 requested a review from jageall May 20, 2020 11:03
@shaan1337 shaan1337 dismissed jageall’s stale review May 20, 2020 11:04

It was a rebasing mistake. Fixed, thanks!

@shaan1337
Copy link
Member Author

Tests currently failing due to #2505

…andidate node and it's still alive, propose it as the best candidate for Leader.
…e PrepareOk message

GetBestLeaderCandidate(): Use the cluster gossip information to determine if the previous leader may still be alive
GetBestLeaderCandidate(): Clean up code and add debug log messages
Remove tests that are no longer relevant since we no longer rely on the value of the in-memory last elected master to determine the new leader
…rCandidate()

Add TestElectedLeaderId property to ElectionsService for test visibility
@shaan1337 shaan1337 force-pushed the write-leader-in-epoch-record branch from 71a4860 to 2c3dc58 Compare May 21, 2020 09:16
jageall
jageall previously approved these changes May 21, 2020
@jageall jageall requested a review from hayley-jean May 22, 2020 09:13
…y - only change leader if the current leader is resigning
src/EventStore.Core/Services/ElectionsService.cs Outdated Show resolved Hide resolved
src/EventStore.Core/Services/ElectionsService.cs Outdated Show resolved Hide resolved
@shaan1337 shaan1337 dismissed hayley-jean’s stale review May 26, 2020 14:46

Requested changes have been made

@hayley-jean hayley-jean merged commit a4ee0ea into master May 28, 2020
@hayley-jean hayley-jean deleted the write-leader-in-epoch-record branch May 28, 2020 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-Master scenario due to node using outdated information about last elected master
3 participants