Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Root causing protonj "no current delivery" #27716

Closed
anuchandy opened this issue Mar 16, 2022 · 0 comments · Fixed by #27839
Closed

Root causing protonj "no current delivery" #27716

anuchandy opened this issue Mar 16, 2022 · 0 comments · Fixed by #27839
Assignees
Labels
amqp Label for tracking issues related to AMQP Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus
Milestone

Comments

@anuchandy
Copy link
Member

anuchandy commented Mar 16, 2022

Background

Two customer tickets reported that the ProtonJ error "no current delivery" appeared in the logs, but when it happened in the lower layer, it activated a code path in the upper layer that triggered a reliability issue; this-pr addressed it.

This ticket is created so that we can focus on ProtonJ error "no current delivery" .

Root causing

As part of this investigation, we learned that it is possible that the ProtonJ object state can be mutated (e.g disposed) after the "scheduling" of work to the message-pump but before the actual "execution" of the work.

Sketching the control flow, it looks like the reason for "no current delivery" is such a mutation; the flow appears to be below -

  1. ProtonJ Receiver received a delivery. It updated its _current pointer to refer to this delivery then handed it over to ReceiveLinkHandler.
  2. ReceiveLinkHandler enqueues the delivery to deliveries-Flux.
  3. The async-chain attached to deliveries-Flux emits the delivery to ReactorReceiver.
  4. The ReactorReceiver "schedule" a work to message-pump thread to decode this delivery.
  5. The async-chain is waiting asynchronously for the Sink to complete. The completion happens only when scheduled work in 4 "executes" in the future.
  6. The service traffic closed the amqp-receive-link, inturn frees the ProtonJ Receiver using message-pump thread. Note: the user-work in 4 and proton-j work in 6 does not execute strictly in FIFO order (two different pipes drive it), but the serial execution is guaranteed.
  7. As part of freeing the ProtonJ Receiver, the ProtonJ library frees the delivery (that 4 refers to), settling the delivery and advancing the _current pointer. The _current become null.
  8. ReceiveLinkHandler enqueues completion-event to deliveries-Flux. The completion-event emission to the async-chain is deferred because the Sink for the last event (5) is yet to be complete.
  9. Work scheduled in 4 executes, but since _current is null and not the one the work is referring to, ProtonJ throws "no current delivery"

The exception in 9 goes to global error handling as an unhandled exception. Three 3 issues -

a. The Sink in 5 never errored
b. The completion-event in 8 never emits (because enqueued after a)
c. The unhandled exception forces the parent amqp-connection to close

The good thing is, there is no one downstream waiting to react to the "terminal event" from a (or b) to happen. But c is concerning because the closure of one amqp-receiver-link combined with the race resulted in amqp-connection to close, which means it causes all other healthy amqp-receiver-links in the same amqp-connection to disconnect from service.

Fix

Again fix is simple as this ticket, i.e. do a deferred check for disposal of the Receiver so that the side effect is limited to that specific Receiver and not bypassed to the parent connection.

References

The customer tickets - #24575 and #26975

@anuchandy anuchandy added Event Hubs Service Bus pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) amqp Label for tracking issues related to AMQP labels Mar 16, 2022
@anuchandy anuchandy added this to the [2022] April milestone Mar 16, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
amqp Label for tracking issues related to AMQP Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) Service Bus
Projects
None yet
2 participants