Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix riak dynamic join leave #410

Merged
merged 8 commits into from
Dec 9, 2019
Merged

Fix riak dynamic join leave #410

merged 8 commits into from
Dec 9, 2019

Conversation

albsch
Copy link
Member

@albsch albsch commented Dec 6, 2019

  • Implemented delete handoff functionality to remove nodes from a cluster.
    • This does not mean that the interaction with the inter-dc communictation is preserved, this has to be checked separately. Nevertheless, removing a node now works on the riak_core side of things.
    • Updated metrics to account for log size decrease
  • Added missing function clause for the handoff commands for the materializer_vnode
    • Fixes a bug where a function clause is missing during handoff
    • This does not mean the handoff is implemented correctly, unexpected requests coming from processes which try to access the vnode in the handoff lifecycle can hang indefinitely because the timeout of riak calls is infinity
  • Improved documentation for handoff

@albsch
Copy link
Member Author

albsch commented Dec 6, 2019

Adding/Deleting monitoring:

image

  1. Node 1 is started and receives operations
  2. Node 2 joined, node 1 transfers half of its log (half of the vnodes start handoff), log shrinks
  3. More data, this time only half of the operations arrive at node 1
  4. Node 3 joined, rebalancing all vnodes again. Node 1 loses and gains some vnodes

@bieniusa
Copy link
Contributor

bieniusa commented Dec 9, 2019

Handoff and leave is a cool feature, but it interferes with the Cure protocol (which assumes a fixed one-to-one mapping of nodes on different DCs).

@preguica @marc-shapiro @shamouda Opinions? Should we try to fix the protocol accordingly? How does this relate to partial replication?

@preguica
Copy link

preguica commented Dec 9, 2019 via email

@bieniusa
Copy link
Contributor

bieniusa commented Dec 9, 2019

Yes, it is about dynamicity of nodes within a DC.

@albsch
Copy link
Member Author

albsch commented Dec 9, 2019

Currently, to add a node dynamically to not break cure one needs to do the following manually:

  1. Block client requests
  2. Each DC forgets every other DC
  3. Change any DC configuration in any way
  4. Wait until handoff is finished (can take a long time)
  5. Reconnect all DCs
  6. Receive client requests

Once handoff is implemented correctly, step 4 can be skipped. I'm not sure if 1. and 6. is needed, but added it just in case.

An easier approach would be to design Inter-DC communication in such a way that each DC does not know the internal ring structure of another DC. Then changing the structure and replication inside a DC would not affect another DC, reducing complexity of the problem.

@albsch albsch force-pushed the fix-riak-dynamic-join-leave branch from 7bd813f to 233fc2c Compare December 9, 2019 14:53
@marc-shapiro
Copy link

marc-shapiro commented Dec 9, 2019 via email

@albsch
Copy link
Member Author

albsch commented Dec 9, 2019

The discussion for that is best left for another issue or meeting, this PR doesn't address the issues of proper join leave mechanics and replication.

@albsch
Copy link
Member Author

albsch commented Dec 9, 2019

Will merge this once the rebar3 as test coveralls send stops failing the build.

@albsch albsch merged commit ae151c3 into master Dec 9, 2019
@albsch albsch deleted the fix-riak-dynamic-join-leave branch December 9, 2019 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants