Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

Merged
merged 37 commits into from Oct 23, 2019
Merged

Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

merged 37 commits into from Oct 23, 2019

Conversation

321zer0
Copy link
Member

@321zer0 321zer0 commented Aug 26, 2019

Ref: #1170

Adds the ability to issue a resignation of the master node over http. This command will generally be issued after the priority of master is set to a lower priority of the other nodes in the cluster.

This provides the ability to be able to perform maintenance tasks such as scavenging on the master without having to restart the master but issuing a resignation which will elect another member in the cluster as master.

Important Note

There is no UI bits for this and currently is only exposed via HTTP.

The process of resigning a master is a 2 step process.

Set node priority

We need to ensure that we set the node priority (this is a hint for when elections occur as it will select nodes based on a set of criteria and will order them by node priority amongst other things.) to a lower priority than the other nodes in the cluster.

curl -X POST -d{} http://x.x.x.x:1114/admin/node/priority/-1 -u admin:changeit

Issue a resignation of the current running master node

When a master node is given the instruction to resign, it will ensure that all the writes in flight are handled before it changes it's state from Master to Unkown. The intermediate state is ResigningMaster. During this state it will no longer accept any incoming writes.

curl -X POST -d{} http://x.x.x.x:1114/admin/node/resign -u admin:changeit

@avish0694 avish0694 self-requested a review August 27, 2019 06:20
@ChrisChinchilla
Copy link
Contributor

@muzaffar1331 Will this need docs? If so add the label…

@avish0694 avish0694 added area/documentation Issues relating to project documentation kind/enhancement Issues which are a new feature subsystem/core database Issues relating to the core database labels Aug 28, 2019
@avish0694
Copy link
Contributor

avish0694 commented Aug 29, 2019

Testing Notes

Replica status blinks on the UI page

Steps to reproduce the issue:

  • Enable maintenance mode on master node.
curl -X POST http://127.0.0.1:1113/admin/maintenance/enable -u admin:changeit -H "Content-Length: 0”

What's the expected result?
Replica status to show consistently on UI

What's the actual result?
Replica status on UI appearing and disappears when maintenance mode is enabled on master node

Nodes shuts down when enable maintenance node.

Steps to reproduce the issue:

  • Enable maintenance mode on master node when eventstore client is writing data.
curl -X POST http://127.0.0.1:1113/admin/maintenance/enable -u admin:changeit -H "Content-Length: 0"

What's the expected result?
Master node becomes slave. And new master is elected. (No shutdown)

What's the actual result?
2 nodes shuts down when EventStore client is writing events and maintenance mode is enabled on master node

Nodes shuts down when you launch them in a 3 node cluster

Steps to reproduce the issue:
No consistent reproduction

What's the expected result?
3 nodes expected to join cluster and work.

What's the actual result?
Nodes shuts down when starting them

Stack trace:

[ERROR] FATAL UNHANDLED EXCEPTION: System.IO.IOException: Invalid handle to path "/Users/avish/db2/chunk-000011.000000"
  at System.IO.FileStream.Dispose (System.Boolean disposing) [0x00052] in <98fac219bd4e453693d76fda7bd96ab0>:0
  at System.IO.Stream.Close () [0x00000] in <98fac219bd4e453693d76fda7bd96ab0>:0
  at System.IO.Stream.Dispose () [0x00000] in <98fac219bd4e453693d76fda7bd96ab0>:0
  at (wrapper remoting-invoke-with-check) System.IO.Stream.Dispose()
  at EventStore.Core.TransactionLog.Chunks.TFChunkBulkReader.Release () [0x0000b] in <95edd66a7030431fb85f1959116925b5>:0
  at EventStore.Core.TransactionLog.Chunks.TFChunkBulkReader.Dispose () [0x00009] in <95edd66a7030431fb85f1959116925b5>:0
  at EventStore.Core.TransactionLog.Chunks.TFChunkBulkReader.Finalize () [0x00000] in <95edd66a7030431fb85f1959116925b5>:0
1m%

Test Environment

Operating System: MacOS
Browser: Chrome

@321zer0
Copy link
Member Author

321zer0 commented Aug 29, 2019

How to run the unit tests on Ubuntu 18.04:

export FrameworkPathOverride=/usr/lib/mono/4.7.1-api

cd EventStore/src/EventStore.Core.Tests

dotnet test --filter EventStore.Core.Tests.Services.ElectionsService.ChoosingMasterTests

@jageall jageall added this to the Event Store v6 RC milestone Sep 10, 2019
@pgermishuys pgermishuys self-assigned this Sep 16, 2019
@pgermishuys pgermishuys changed the base branch from master to v6-master September 17, 2019 06:30
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
…hen endpoint is triggered

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
… ElectionService

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
…priate

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
…onMaster

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
… only when GossipUpdate is received and master's priority has changed

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>
Deposing the master will kick off elections. This has a number of
benefits instead of kicking off elections manually.

- You won't potentially end up in a case where too many master's are
alive.
- Client's that prefer master will automatically reconnect to the new
master node as we handle this case in the state machine
(ClusterVNodeController) as the node state is now `Unknown`.
- The other nodes in the cluster will realize via their tcp connections
being dropped that the cluster has changed and will act accordingly.
@321zer0 321zer0 changed the title Add maintainance mode feature Add maintenance mode feature Oct 12, 2019
@pgermishuys pgermishuys changed the title Add maintenance mode feature Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) Oct 16, 2019
Copy link
Contributor

@pgermishuys pgermishuys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lack of tests around most of the components we've touched.

Copy link
Member

@hayley-jean hayley-jean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases:

All tests done on a 3 node cluster with node A as master, unless specified otherwise.

Happy case:

  1. Constantly write data to node A.
  2. Set priority on node A to -1000.
  3. Resign node A.
  4. Node B or C elected master, A becomes unknown and eventuallly becomes a slave.
  5. No nodes taken offline for truncation.

Resigning node without changing priority

  1. Don't change priority on node A.
  2. Resign node A.
  3. Node A is not necessarily re-elected, but may be, if it is found to be the best candidate.

Changing priority and triggering elections without resigning a node

  1. Take down node C.
  2. Set priority on node A to -1000.
  3. Bring node C back up (triggering elections).
  4. Node A remains master.
    This test case is here because previously it would cause instability

Resigning node when other nodes are not caught up

  1. Write a lot of data.
  2. Take down nodes B and C, delete their db folders.
  3. Bring nodes B and C back up, wait for them to go into a CatchingUp state.
  4. Set priority on node A to -1000.
  5. Resign node A.
  6. Node A is elected master again.

@hayley-jean
Copy link
Member

We are merging this PR for now, but there are improvements that need to be made to the election and gossip services in order to improve the stability of the cluster.

@hayley-jean hayley-jean merged commit d9e698a into EventStore:v6-master Oct 23, 2019
pgermishuys pushed a commit that referenced this pull request Dec 5, 2019
…nce Mode) (#1982)

Add ability to issue a resignation of the master node over http.

Allow setting the priority for a node during runtime.
Add NodePriority to ElectionMessage.Proposal.

Add new ResigningMaster state. The node will enter this state when it is told to resign.
While resigning, the master will ignore any new write requests and wait until the request queue is drained.
Once the request queue is empty, the master will enter the Unknown state.

Broadcast the resigning master message to other nodes in the cluster.
This provides nodes with a hint to not re-elect the last master if it has signaled that it is resigning.
pgermishuys pushed a commit that referenced this pull request Dec 5, 2019
…nce Mode) (#1982)

Add ability to issue a resignation of the master node over http.

Allow setting the priority for a node during runtime.
Add NodePriority to ElectionMessage.Proposal.

Add new ResigningMaster state. The node will enter this state when it is told to resign.
While resigning, the master will ignore any new write requests and wait until the request queue is drained.
Once the request queue is empty, the master will enter the Unknown state.

Broadcast the resigning master message to other nodes in the cluster.
This provides nodes with a hint to not re-elect the last master if it has signaled that it is resigning.
pgermishuys pushed a commit that referenced this pull request Dec 5, 2019
…nce Mode) (#1982)

Add ability to issue a resignation of the master node over http.

Allow setting the priority for a node during runtime.
Add NodePriority to ElectionMessage.Proposal.

Add new ResigningMaster state. The node will enter this state when it is told to resign.
While resigning, the master will ignore any new write requests and wait until the request queue is drained.
Once the request queue is empty, the master will enter the Unknown state.

Broadcast the resigning master message to other nodes in the cluster.
This provides nodes with a hint to not re-elect the last master if it has signaled that it is resigning.
@mat-mcloughlin
Copy link
Contributor

closed #1170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Issues relating to project documentation kind/enhancement Issues which are a new feature subsystem/core database Issues relating to the core database
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants