Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

321zer0 · 2019-08-26T06:13:57Z

Adds the ability to issue a resignation of the master node over http. This command will generally be issued after the priority of master is set to a lower priority of the other nodes in the cluster.

This provides the ability to be able to perform maintenance tasks such as scavenging on the master without having to restart the master but issuing a resignation which will elect another member in the cluster as master.

Important Note

There is no UI bits for this and currently is only exposed via HTTP.

The process of resigning a master is a 2 step process.

Set node priority

We need to ensure that we set the node priority (this is a hint for when elections occur as it will select nodes based on a set of criteria and will order them by node priority amongst other things.) to a lower priority than the other nodes in the cluster.

curl -X POST -d{} http://x.x.x.x:1114/admin/node/priority/-1 -u admin:changeit

Issue a resignation of the current running master node

When a master node is given the instruction to resign, it will ensure that all the writes in flight are handled before it changes it's state from Master to Unkown. The intermediate state is ResigningMaster. During this state it will no longer accept any incoming writes.

curl -X POST -d{} http://x.x.x.x:1114/admin/node/resign -u admin:changeit

ChrisChinchilla · 2019-08-27T09:18:23Z

@muzaffar1331 Will this need docs? If so add the label…

avish0694 · 2019-08-29T08:13:44Z

Testing Notes

Replica status blinks on the UI page

Steps to reproduce the issue:

Enable maintenance mode on master node.

curl -X POST http://127.0.0.1:1113/admin/maintenance/enable -u admin:changeit -H "Content-Length: 0”

What's the expected result?
Replica status to show consistently on UI

What's the actual result?
Replica status on UI appearing and disappears when maintenance mode is enabled on master node

Nodes shuts down when enable maintenance node.

Steps to reproduce the issue:

Enable maintenance mode on master node when eventstore client is writing data.

curl -X POST http://127.0.0.1:1113/admin/maintenance/enable -u admin:changeit -H "Content-Length: 0"

What's the expected result?
Master node becomes slave. And new master is elected. (No shutdown)

What's the actual result?
2 nodes shuts down when EventStore client is writing events and maintenance mode is enabled on master node

Nodes shuts down when you launch them in a 3 node cluster

Steps to reproduce the issue:
No consistent reproduction

What's the expected result?
3 nodes expected to join cluster and work.

What's the actual result?
Nodes shuts down when starting them

Stack trace:

[ERROR] FATAL UNHANDLED EXCEPTION: System.IO.IOException: Invalid handle to path "/Users/avish/db2/chunk-000011.000000"
  at System.IO.FileStream.Dispose (System.Boolean disposing) [0x00052] in <98fac219bd4e453693d76fda7bd96ab0>:0
  at System.IO.Stream.Close () [0x00000] in <98fac219bd4e453693d76fda7bd96ab0>:0
  at System.IO.Stream.Dispose () [0x00000] in <98fac219bd4e453693d76fda7bd96ab0>:0
  at (wrapper remoting-invoke-with-check) System.IO.Stream.Dispose()
  at EventStore.Core.TransactionLog.Chunks.TFChunkBulkReader.Release () [0x0000b] in <95edd66a7030431fb85f1959116925b5>:0
  at EventStore.Core.TransactionLog.Chunks.TFChunkBulkReader.Dispose () [0x00009] in <95edd66a7030431fb85f1959116925b5>:0
  at EventStore.Core.TransactionLog.Chunks.TFChunkBulkReader.Finalize () [0x00000] in <95edd66a7030431fb85f1959116925b5>:0
1m%

Test Environment

Operating System: MacOS
Browser: Chrome

321zer0 · 2019-08-29T15:22:03Z

How to run the unit tests on Ubuntu 18.04:

export FrameworkPathOverride=/usr/lib/mono/4.7.1-api

cd EventStore/src/EventStore.Core.Tests

dotnet test --filter EventStore.Core.Tests.Services.ElectionsService.ChoosingMasterTests

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

…hen endpoint is triggered Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

… ElectionService Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

…priate Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

…onMaster Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

… only when GossipUpdate is received and master's priority has changed Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Deposing the master will kick off elections. This has a number of benefits instead of kicking off elections manually. - You won't potentially end up in a case where too many master's are alive. - Client's that prefer master will automatically reconnect to the new master node as we handle this case in the state machine (ClusterVNodeController) as the node state is now `Unknown`. - The other nodes in the cluster will realize via their tcp connections being dropped that the cluster has changed and will act accordingly.

This provides nodes with a hint to not re-elect the last master if it has signalled that it is resigning.

pgermishuys

There's a lack of tests around most of the components we've touched.

hayley-jean

Test cases:

All tests done on a 3 node cluster with node A as master, unless specified otherwise.

Happy case:

Constantly write data to node A.
Set priority on node A to -1000.
Resign node A.
Node B or C elected master, A becomes unknown and eventuallly becomes a slave.
No nodes taken offline for truncation.

Resigning node without changing priority

Don't change priority on node A.
Resign node A.
Node A is not necessarily re-elected, but may be, if it is found to be the best candidate.

Changing priority and triggering elections without resigning a node

Take down node C.
Set priority on node A to -1000.
Bring node C back up (triggering elections).
Node A remains master.
This test case is here because previously it would cause instability

Resigning node when other nodes are not caught up

Write a lot of data.
Take down nodes B and C, delete their db folders.
Bring nodes B and C back up, wait for them to go into a CatchingUp state.
Set priority on node A to -1000.
Resign node A.
Node A is elected master again.

hayley-jean · 2019-10-23T12:45:22Z

We are merging this PR for now, but there are improvements that need to be made to the election and gossip services in order to improve the stability of the cluster.

…nce Mode) (#1982) Add ability to issue a resignation of the master node over http. Allow setting the priority for a node during runtime. Add NodePriority to ElectionMessage.Proposal. Add new ResigningMaster state. The node will enter this state when it is told to resign. While resigning, the master will ignore any new write requests and wait until the request queue is drained. Once the request queue is empty, the master will enter the Unknown state. Broadcast the resigning master message to other nodes in the cluster. This provides nodes with a hint to not re-elect the last master if it has signaled that it is resigning.

mat-mcloughlin · 2019-12-27T09:52:27Z

closed #1170

avish0694 self-requested a review August 27, 2019 06:20

avish0694 added area/documentation Issues relating to project documentation kind/enhancement Issues which are a new feature subsystem/core database Issues relating to the core database labels Aug 28, 2019

jageall added this to the Event Store v6 RC milestone Sep 10, 2019

avish0694 mentioned this pull request Sep 11, 2019

Maintenance mode UI changes EventStore/EventStore.UI#225

Closed

2 tasks

pgermishuys self-assigned this Sep 16, 2019

pgermishuys changed the base branch from master to v6-master September 17, 2019 06:30

321zer0 added 19 commits September 20, 2019 07:20

Add endpoint to enable maintainance mode

d2cdb21

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Add endpoint to disable maintainance mode

a4df8fc

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Add EnableMaintainanceMode and DisableMaintainanceMode ClientMessage

6fa6b25

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Publish EnableMaintainanceMode and DisableMaintainanceMode messages w…

4953bb4

…hen endpoint is triggered Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Handle EnableMaintainanceMode and DisableMaintainanceMode messages in…

2851ee0

… ElectionService Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Handle request to enable or disable maintainance mode only when appro…

2bc45ea

…priate Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Send the updated node priority through gossip

d5038c4

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Trigger election when disabling maintainance, ignore lastElectedMaster

b788a79

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Modify GetBestMasterCandidate to consider _servers array only

23b15f6

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Remove Console logging of ClusterInfo in HandleAsMaster and HandleAsN…

f1e2b4f

…onMaster Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Improve readability in MemberInfo.ToString()

d991142

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Remove rule that checks if previous master is alive, trigger election…

08427d4

… only when GossipUpdate is received and master's priority has changed Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Attempt to fix election loop

81924b6

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Restore best working logic in GetBestMasterCandidate()

564864b

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Add tests for maintenance mode

a695a23

Add check to prevent election if there are non-caught up nodes

54cb720

Signed-off-by: Muzaffar Auhammud <muzaffar1331@gmail.com>

Add more checks when choosing a master candidiate

910d7a7

Add ElectionsServiceUnitTests

b19489a

Add Hayley's changes

0fb138e

321zer0 and others added 9 commits September 20, 2019 07:27

Working when turning maintenance on, but not off

ffcae4b

Add NodePriority to ElectionMessage.Proposal

31921cd

Use ElectionMessage.Proposal NodePriority

d275a0f

Move maintenance mode test to appropriate directory

0c58ae4

Be a little more descriptive in info message regarding maintenance mode

22999d1

Remove commented out code

e972c25

Update UI for maintenance mode

7c69227

Clear up tests

8caaef3

Rebased ontop of v6:master

5b70af5

pgermishuys mentioned this pull request Sep 25, 2019

Expose resignation of leadership over HTTP #1170

Closed

321zer0 changed the title ~~Add maintainance mode feature~~ Add maintenance mode feature Oct 12, 2019

pgermishuys added 3 commits October 13, 2019 19:38

Drain request queue

6ff4c98

Split out Setting Node Priority and Resigning a node

db47de7

Correctly set node priority

f987296

pgermishuys changed the title ~~Add maintenance mode feature~~ Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) Oct 16, 2019

pgermishuys added 5 commits October 16, 2019 09:31

Formatting and remove commented out code

6baf883

Ignore Request Queue Drained if not in Resigning Master

f033df9

Broadcast the resigning master message

94ec074

This provides nodes with a hint to not re-elect the last master if it has signalled that it is resigning.

Update UI submodule

e1b739e

Need better tests

a8754f6

pgermishuys reviewed Oct 17, 2019

View reviewed changes

hayley-jean approved these changes Oct 17, 2019

View reviewed changes

hayley-jean merged commit d9e698a into EventStore:v6-master Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

321zer0 commented Aug 26, 2019 •

edited by pgermishuys

ChrisChinchilla commented Aug 27, 2019

avish0694 commented Aug 29, 2019 •

edited by pgermishuys

321zer0 commented Aug 29, 2019 •

edited

pgermishuys left a comment

hayley-jean left a comment

hayley-jean commented Oct 23, 2019

mat-mcloughlin commented Dec 27, 2019

Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

Expose resignation of Master over HTTP. (Previously known as Maintenance Mode) #1982

Conversation

321zer0 commented Aug 26, 2019 • edited by pgermishuys

Important Note

Set node priority

Issue a resignation of the current running master node

ChrisChinchilla commented Aug 27, 2019

avish0694 commented Aug 29, 2019 • edited by pgermishuys

Testing Notes

Replica status blinks on the UI page

Nodes shuts down when enable maintenance node.

Nodes shuts down when you launch them in a 3 node cluster

Test Environment

321zer0 commented Aug 29, 2019 • edited

pgermishuys left a comment

Choose a reason for hiding this comment

hayley-jean left a comment

Choose a reason for hiding this comment

Test cases:

Happy case:

Resigning node without changing priority

Changing priority and triggering elections without resigning a node

Resigning node when other nodes are not caught up

hayley-jean commented Oct 23, 2019

mat-mcloughlin commented Dec 27, 2019

321zer0 commented Aug 26, 2019 •

edited by pgermishuys

avish0694 commented Aug 29, 2019 •

edited by pgermishuys

321zer0 commented Aug 29, 2019 •

edited