Root causing protonj "no current delivery" #27716
Labels
amqp
Label for tracking issues related to AMQP
Event Hubs
pillar-reliability
The issue is related to reliability, one of our core engineering pillars. (includes stress testing)
Service Bus
Milestone
Background
Two customer tickets reported that the ProtonJ error "no current delivery" appeared in the logs, but when it happened in the lower layer, it activated a code path in the upper layer that triggered a reliability issue; this-pr addressed it.
This ticket is created so that we can focus on ProtonJ error "no current delivery" .
Root causing
As part of this investigation, we learned that it is possible that the ProtonJ object state can be mutated (e.g disposed) after the "scheduling" of work to the message-pump but before the actual "execution" of the work.
Sketching the control flow, it looks like the reason for "no current delivery" is such a mutation; the flow appears to be below -
The exception in 9 goes to global error handling as an unhandled exception. Three 3 issues -
a. The Sink in 5 never errored
b. The completion-event in 8 never emits (because enqueued after a)
c. The unhandled exception forces the parent amqp-connection to close
The good thing is, there is no one downstream waiting to react to the "terminal event" from a (or b) to happen. But c is concerning because the closure of one amqp-receiver-link combined with the race resulted in amqp-connection to close, which means it causes all other healthy amqp-receiver-links in the same amqp-connection to disconnect from service.
Fix
Again fix is simple as this ticket, i.e. do a deferred check for disposal of the Receiver so that the side effect is limited to that specific Receiver and not bypassed to the parent connection.
References
The customer tickets - #24575 and #26975
The text was updated successfully, but these errors were encountered: