link/unlink is exception unsafe #286

facundominguez · 2016-05-03T14:25:27Z

Suppose I have the following code

f pid = do
  link pid
  blocking_operations_which_depend_on_health_of pid
  unlink pid

If f is interrupted with an exception, unlink might never be called. Thus I rewrite to:

f pid = bracket_ (link pid) (unlink pid) $
          blocking_operations_which_depend_on_health_of pid

But this still doesn't help, because unlink is interruptible, thus if unlink is interrupted with an exception, the process will remain linked to pid after leaving f.

How about

f pid = bracket_ (link pid) (uninterruptibleMask_ $ unlink pid) $
          blocking_operations_which_depend_on_health_of pid

?

This introduces the potential for blocking indefinitely. If pid dies while unlink is executing, the link exception cannot be thrown because uninterruptibleMask_ is in effect. However, when unlink completes, the exception cannot be thrown either, because the very unlink definition does not allow it. The only way to avoid illegal behavior would be to have unlink never return or somehow don't throw the link exception (the latter looks impossible to me).

The second attempt is the one I would expect to work. The problem is determining what tweak of unlink semantics would make it to work.

I would like unlink to be uninterruptible, but I don't see how that could be implemented. An alternative is to guarantee that when unlink runs with exceptions masked, it always unlinks before returning, but it might return with an exception if it is interrupted during its execution.

The text was updated successfully, but these errors were encountered:

hyperthunk · 2016-06-23T10:36:34Z

That last paragraph sounds like the most pragmatic approach to me?

mboes · 2016-06-23T19:04:43Z

I'm not sure what exactly the problem is here. Until unlink completes, because I called link, I accept that I'll have some exception thrown at me at any time. So if someone else throws an async exception at me, is that really a problem?

facundominguez · 2016-06-23T21:15:37Z

Yes it is, the link exception could interrupt the important clean up in:

bracket_ importantResource importantCleanup $ bracket_ (link pid) (unlink pid) ...

hm, though probably we could argue that importantCleanup should be made uninterruptible ...

Another bad example:

timeout $ ... bracket_ (link pid) (unlink pid) ...

The link exception could arrive after timeout completes if the timeout occurs just when unlink is called.

mboes · 2016-06-24T19:38:18Z

There is no need to bracketize unlink because if unlink doesn't get called that's quite okay: link is just there to signal that the current thread should have an async exception thrown at it in case a peer disappears, but if it's gotten one anyways then there's no point in calling unlink.

And aplications shouldn't make any assumptions about what order async exceptions will be delivered in, because CH provides no such guarantees.

hyperthunk · 2018-11-08T09:55:26Z

Since that's the case, can we close this issue? Should I add something about this to the documentation?

It feels to me like I should probably compile the various issues and conversations around async exceptions in Cloud Haskell - and there have been many - into a page, and publish it to the website. Or would it be better placed in the haddocks?

facundominguez · 2018-11-08T12:18:21Z

but if it's gotten one anyways then there's no point in calling unlink.

I'm not getting why there is no point in calling unlink if the original exception is caught and handled. Or if I did get it in the past, I've forgotten how :)

+1 for discussing async exceptions in the documentation.

hyperthunk · 2018-11-09T13:10:47Z

Yes actually that's a good question...

I think the answer is in the client's court. Calling unlink means I no logger want link exceptions. If I want to avoid having some other arbitrary async exception interfere with unlink, I can mask it. One might argue that you should always mask unlink, but that's another matter entirely.

Now assuming you mask unlink and miss the link exception, who cares? You were done with linking anyway. The more onerous issue is that you might've missed another async exception that mattered to you more than the link exception.

The broader design issue here, and we should document this when we update the docs around async exception behaviour in general, is don't use links if you require time to do cleanup. Instead you should use monitors, and kill your process in response to a monitor notification if that's the semantics you require.

Better still, use supervisors, and encode the dependency between running processes that way. If all those processes are managed processes based on the client-server library, they'll each get a chance to do cleanup in a structured fashion, and can even deal with their inbox without risking message loss by using the safe message filters available in the client-server or fsm libraries.

So I guess my perspective is, these issues are probably best solved higher up the value chain in libraries, rather than having complex semantics surrounding the primitives those libraries depend on.

Not many erlang programmers use messaging primitives directly all that often. 95% of production erlang code uses gen_servers and supervisors, which have well understood models for fault tolerance, built on the underlying semantics.

facundominguez · 2018-11-09T21:49:35Z

So I guess my perspective is, these issues are probably best solved higher up the value chain in libraries, rather than having complex semantics surrounding the primitives those libraries depend on.

I agree that the semantics shouldn't be complex. The primitives in the foundations should be simple to use. But, alas, unlink is not so easy. After thinking it some more I still don't see a clear way to improve it.

I'm fine promoting this task to discussing the problem in the documentation.

hyperthunk · 2018-11-09T22:36:25Z

Looking at the code, I'm not sure some of your assertions hold @facundominguez. After sending a control message to the node controller, unlink waits for a message of type DidUnlinkProcess, not an exception. So assuming we run unlink with uninterruptibleMask, why do we think it might never return?

hyperthunk · 2018-11-09T22:38:43Z

And in actual fact, ncEffectUnlink in the node controller doesn't even care if the target (i.e. linked-to process) is alive or not. So there's not even a race, because even if the the linked-to process dies before the node controller handles the signal raised by unlinkAsync in the calling process, it will still send DidUnlink back to the caller.

hyperthunk self-assigned this Nov 9, 2018

hyperthunk added the task label Nov 9, 2018

hyperthunk added this to To do in Release 0.8 Nov 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

link/unlink is exception unsafe #286

link/unlink is exception unsafe #286

facundominguez commented May 3, 2016 •

edited

hyperthunk commented Jun 23, 2016

mboes commented Jun 23, 2016

facundominguez commented Jun 23, 2016 •

edited

mboes commented Jun 24, 2016

hyperthunk commented Nov 8, 2018

facundominguez commented Nov 8, 2018 •

edited

hyperthunk commented Nov 9, 2018

facundominguez commented Nov 9, 2018

hyperthunk commented Nov 9, 2018 •

edited

hyperthunk commented Nov 9, 2018

link/unlink is exception unsafe #286

link/unlink is exception unsafe #286

Comments

facundominguez commented May 3, 2016 • edited

hyperthunk commented Jun 23, 2016

mboes commented Jun 23, 2016

facundominguez commented Jun 23, 2016 • edited

mboes commented Jun 24, 2016

hyperthunk commented Nov 8, 2018

facundominguez commented Nov 8, 2018 • edited

hyperthunk commented Nov 9, 2018

facundominguez commented Nov 9, 2018

hyperthunk commented Nov 9, 2018 • edited

hyperthunk commented Nov 9, 2018

facundominguez commented May 3, 2016 •

edited

facundominguez commented Jun 23, 2016 •

edited

facundominguez commented Nov 8, 2018 •

edited

hyperthunk commented Nov 9, 2018 •

edited