Stop merging when remote node is not alive #2

dszoboszlay · 2013-08-26T19:35:09Z

Sometimes unsplit_server attempts to stitch together nodes before they would become reconnected. In this case IslandB is not a list, but a badrpc error tuple that crashes intersection/2.

The patch halts the stitching when rpc fails. It assumes a new Mnesia event will be sent later on, when NodeB is reconnected. This looks like a valid assumption: I have a CT test suite that relies on unsplit resolving netsplits and it worked reliably even when unsplit_server crashed. The only symptom was that when it happened too frequently in a short time the whole application stopped, which got caught by a sanity check later on.

uwiger · 2013-08-29T04:52:06Z

So, wouldn't it be prudent to either abort or at least complain a bit when the error is detected?

Reasonably, we don't know the state of our database at that point.

dszoboszlay · 2013-08-31T16:11:26Z

If abort would mean to crash the server, I don't think it would be the correct way of handling the problem. When restarted the server won't attempt to merge with NodeB again, only if it receives another Mnesia event. Stopping the merge action without crashing (as in the patch) would do just as well.

Regarding complaining: the rest of unsplit_server indeed complains when something unexpected happens, but in the form of io:fwrite/2 calls, which do not normally write to anywhere in production. If you believe it would still worth adding an error message in this form, I can accept it.

dszoboszlay · 2013-08-31T16:13:05Z

(Sorry, I meant to comment only, not to close&reopen...)

uwiger · 2013-08-31T21:56:41Z

Ok, fair enough. But I'm not quite comfortable with the "probably because NodeB is not alive". If NodeB is dead, I agree with you. Can you add a check to verify that this is the case?

dszoboszlay · 2013-09-01T19:38:54Z

If the rpc call failed then I see no other possibility that we are not connected to NodeB. The failing rpc call is itself a check for that. However, I don't dare to say NodeB is dead. It is dead from this node's point of view, that's all we know.

And honestly, I don't know how could Mnesia figure out at this point that these two nodes are inconsistent. What I see is that without rpc working we don't have a chance to perform the unsplit operation.

uwiger · 2013-09-01T22:37:36Z

You're probably right, but aren't you in fact getting {badrpc, nodedown} returns?

Another possible outcome is that the node is up, but the function misbehaves. I'd rather have a specific pattern match against a known error.

Stop merging when remote node is not alive

a742db5

dszoboszlay closed this Aug 31, 2013

dszoboszlay reopened this Aug 31, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop merging when remote node is not alive #2

Stop merging when remote node is not alive #2

dszoboszlay commented Aug 26, 2013

uwiger commented Aug 29, 2013

dszoboszlay commented Aug 31, 2013

dszoboszlay commented Aug 31, 2013

uwiger commented Aug 31, 2013

dszoboszlay commented Sep 1, 2013

uwiger commented Sep 1, 2013

Stop merging when remote node is not alive #2

Are you sure you want to change the base?

Stop merging when remote node is not alive #2

Conversation

dszoboszlay commented Aug 26, 2013

uwiger commented Aug 29, 2013

dszoboszlay commented Aug 31, 2013

dszoboszlay commented Aug 31, 2013

uwiger commented Aug 31, 2013

dszoboszlay commented Sep 1, 2013

uwiger commented Sep 1, 2013