Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop merging when remote node is not alive #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dszoboszlay
Copy link

Sometimes unsplit_server attempts to stitch together nodes before they would become reconnected. In this case IslandB is not a list, but a badrpc error tuple that crashes intersection/2.

The patch halts the stitching when rpc fails. It assumes a new Mnesia event will be sent later on, when NodeB is reconnected. This looks like a valid assumption: I have a CT test suite that relies on unsplit resolving netsplits and it worked reliably even when unsplit_server crashed. The only symptom was that when it happened too frequently in a short time the whole application stopped, which got caught by a sanity check later on.

@uwiger
Copy link

uwiger commented Aug 29, 2013

So, wouldn't it be prudent to either abort or at least complain a bit when the error is detected?

Reasonably, we don't know the state of our database at that point.

@dszoboszlay
Copy link
Author

If abort would mean to crash the server, I don't think it would be the correct way of handling the problem. When restarted the server won't attempt to merge with NodeB again, only if it receives another Mnesia event. Stopping the merge action without crashing (as in the patch) would do just as well.

Regarding complaining: the rest of unsplit_server indeed complains when something unexpected happens, but in the form of io:fwrite/2 calls, which do not normally write to anywhere in production. If you believe it would still worth adding an error message in this form, I can accept it.

@dszoboszlay dszoboszlay reopened this Aug 31, 2013
@dszoboszlay
Copy link
Author

(Sorry, I meant to comment only, not to close&reopen...)

@uwiger
Copy link

uwiger commented Aug 31, 2013

Ok, fair enough. But I'm not quite comfortable with the "probably because NodeB is not alive". If NodeB is dead, I agree with you. Can you add a check to verify that this is the case?

@dszoboszlay
Copy link
Author

If the rpc call failed then I see no other possibility that we are not connected to NodeB. The failing rpc call is itself a check for that. However, I don't dare to say NodeB is dead. It is dead from this node's point of view, that's all we know.

And honestly, I don't know how could Mnesia figure out at this point that these two nodes are inconsistent. What I see is that without rpc working we don't have a chance to perform the unsplit operation.

@uwiger
Copy link

uwiger commented Sep 1, 2013

You're probably right, but aren't you in fact getting {badrpc, nodedown} returns?

Another possible outcome is that the node is up, but the function misbehaves. I'd rather have a specific pattern match against a known error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants