EventHubs stops processing outgoing messages #17568

oceansv · 2020-11-13T10:32:17Z

Describe the bug

We have project on spring boot and
using

com.azure
azure-messaging-eventhubs
5.3.1

and EventProcessorClient api to listen messages from topic using eventProcessor.

We are seeing that time to time listener stops listening from the topic
even though our application was up.

Blue line in graph showing msgs incoming
Orange line showing outgoing msgs

We can see there are time windows mentioned below where we have only incoming msgs
but no outgoing msgs.

1:34 to 12:15 incoming msgs but no outgoing msgs
12:33 to 1:22 incoming msgs but no outgoing msgs
similarly other time windows.

In one scenario we saw the listener stops listening msgs from topic but continously updating ownership last modified time in storagev2 checkpoint blob storage.
Also at the same time few partition were listening msgs while some stopped listening.
Also when we restarted the app listener started listening messagess

Exception or Stack Trace
We have different exception in our logs occurring very frequently dont know if these are actually the rootcause.

Exception occurred when handling event in super.
Error occurred in XXXXX partition processor for partition 8, com.azure.core.amqp.exception.AmqpException: The connection was inactive for more than the allowed 240000 milliseconds and is closed by container '4ddce78e588743768330a20fc913f4f5_G14'., errorContext[NAMESPACE: ehns-control-tower-prod-scus-1.servicebus.windows.net, PATH: YYYYY/ConsumerGroups/XXXX/Partitions/8, REFERENCE_ID: 8_79eeb9_1605218761061, LINK_CREDIT: 473]
Error occurred in partition processor for partition 12, java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 60000ms in 'takeUntil' (and no fallback has been configured)
Error occurred in partition processor for partition 17, com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:9424c7270001399b000071e85fadc19f_G27_B27, SystemTracker:ehns-control-tower-prod-scus-1:eventhub:XXXXX~18431|YYYY, Timestamp:2020-11-12T23:14:59, errorContext[NAMESPACE: ehns-control-tower-prod-scus-1.servicebus.windows.net, PATH: XXXXX/ConsumerGroups/YYYYYY/Partitions/17, REFERENCE_ID: 17_0989fc_1605218311041, LINK_CREDIT: 447]

joshfree · 2020-11-13T18:22:35Z

@conniey could you please follow up?

Pawelgasieniec · 2020-12-08T11:35:58Z

Hi,
We have very similar problems.
EventProcessorClient seems to lose connection to some or all partitions. Then it claims that it reconnected but never receives any events.

Exceptions which we see most often:

c.a.m.e.PartitionBasedLoadBalancer : Load balancing for event processor failed - Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
c.o.ds_bridge.eventhub.EventProcessor :Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
c.a.c.a.i.RequestResponseChannel : Exception in RequestResponse links. Disposing and clearing unconfirmed sends.
Retry Blob storage hangs for files > about 3500 kb #1. Transient error occurred. Retrying after 4511 ms.
The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:c7c7dc00e80b46f19d0dfc9e7a447bf0_G10S2, SystemTracker:gateway5, Timestamp:2020-12-08T03:04:16, errorContext

Can't tell which exactly is the real issue. Sometimes it works for days without any problem, sometimes it fails after 1 hour and never recovers.

Issuing client.stop followed by client.start doesn't help. Neither creating new EventProcessorClient does. Only full application restart is proven to "fix" the problem.
We tried both 5.2.0 and 5.3.1 versions of azure-messaging-eventhubs artifact with similar results.

We use Azure Kubernetes to run the application. OS is Ubuntu-18, Java version 8.

BR
Pawel

ameetkonnur · 2021-02-18T03:38:04Z

We are seeing similar issues too.
We have a Standard EventHub implementation with Consumers hosted in AKS written in SpringBoot.

When we scale down the number of PODS for Consumers, the remaining Consumers stop processing Events and when they scale it back up it starts processing. Also all events between Scale Down and Scale up are lost.
Ex: 5 pods (scale down by 2) to 3 Pods - doesn’t process.
To process it again, scale up by at least one. 3 Pods + (1 or more) - Starts processing.

Restarting the PODS post scale down helps to restart the processing.

I checked on com.microsoft.azure:spring-cloud-azure-eventhubs-stream-binder version and its - 1.2.8

We also have seen below intermittent errors in Consumers.
• Load balancing failed for event processing. connection aborted.
• Missing “azure_checkpointer” header

Any pointers on how to resolve this will help.

conniey · 2021-02-20T00:22:54Z

Closed as a duplicate of #18070

joshfree changed the title ~~[BUG]~~ EventHubs stops processing outgoing messages Nov 13, 2020

joshfree added Client This issue points to a problem in the data-plane of the library. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) labels Nov 13, 2020

ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Nov 13, 2020

joshfree assigned conniey Nov 13, 2020

conniey mentioned this issue Feb 20, 2021

[BUG] EventHub Consumer stops consuming messages until we restart #18070

Closed

3 tasks

conniey closed this as completed Feb 20, 2021

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventHubs stops processing outgoing messages #17568

EventHubs stops processing outgoing messages #17568

oceansv commented Nov 13, 2020 •

edited

Loading

joshfree commented Nov 13, 2020

Pawelgasieniec commented Dec 8, 2020

ameetkonnur commented Feb 18, 2021

conniey commented Feb 20, 2021

EventHubs stops processing outgoing messages #17568

EventHubs stops processing outgoing messages #17568

Comments

oceansv commented Nov 13, 2020 • edited Loading

joshfree commented Nov 13, 2020

Pawelgasieniec commented Dec 8, 2020

ameetkonnur commented Feb 18, 2021

conniey commented Feb 20, 2021

oceansv commented Nov 13, 2020 •

edited

Loading