Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventHubs stops processing outgoing messages #17568

Closed
oceansv opened this issue Nov 13, 2020 · 4 comments
Closed

EventHubs stops processing outgoing messages #17568

oceansv opened this issue Nov 13, 2020 · 4 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@oceansv
Copy link

oceansv commented Nov 13, 2020

Describe the bug

We have project on spring boot and
using

com.azure
azure-messaging-eventhubs
5.3.1

and EventProcessorClient api to listen messages from topic using eventProcessor.

We are seeing that time to time listener stops listening from the topic
even though our application was up.

Blue line in graph showing msgs incoming
Orange line showing outgoing msgs

We can see there are time windows mentioned below where we have only incoming msgs
but no outgoing msgs.

image

1:34 to 12:15 incoming msgs but no outgoing msgs
12:33 to 1:22 incoming msgs but no outgoing msgs
similarly other time windows.

In one scenario we saw the listener stops listening msgs from topic but continously updating ownership last modified time in storagev2 checkpoint blob storage.
Also at the same time few partition were listening msgs while some stopped listening.
Also when we restarted the app listener started listening messagess

Exception or Stack Trace
We have different exception in our logs occurring very frequently dont know if these are actually the rootcause.

  1. Exception occurred when handling event in super.

  2. Error occurred in XXXXX partition processor for partition 8, com.azure.core.amqp.exception.AmqpException: The connection was inactive for more than the allowed 240000 milliseconds and is closed by container '4ddce78e588743768330a20fc913f4f5_G14'., errorContext[NAMESPACE: ehns-control-tower-prod-scus-1.servicebus.windows.net, PATH: YYYYY/ConsumerGroups/XXXX/Partitions/8, REFERENCE_ID: 8_79eeb9_1605218761061, LINK_CREDIT: 473]

  3. Error occurred in partition processor for partition 12, java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 60000ms in 'takeUntil' (and no fallback has been configured)

  4. Error occurred in partition processor for partition 17, com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:9424c7270001399b000071e85fadc19f_G27_B27, SystemTracker:ehns-control-tower-prod-scus-1:eventhub:XXXXX~18431|YYYY, Timestamp:2020-11-12T23:14:59, errorContext[NAMESPACE: ehns-control-tower-prod-scus-1.servicebus.windows.net, PATH: XXXXX/ConsumerGroups/YYYYYY/Partitions/17, REFERENCE_ID: 17_0989fc_1605218311041, LINK_CREDIT: 447]

@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Nov 13, 2020
@joshfree joshfree changed the title [BUG] EventHubs stops processing outgoing messages Nov 13, 2020
@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) labels Nov 13, 2020
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Nov 13, 2020
@joshfree
Copy link
Member

@conniey could you please follow up?

@Pawelgasieniec
Copy link

Hi,
We have very similar problems.
EventProcessorClient seems to lose connection to some or all partitions. Then it claims that it reconnected but never receives any events.

Exceptions which we see most often:

  1. c.a.m.e.PartitionBasedLoadBalancer : Load balancing for event processor failed - Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
    Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
  2. c.o.ds_bridge.eventhub.EventProcessor :Did not observe any item or terminal signal within 60000ms in 'flatMapMany' (and no fallback has been configured)
  3. c.a.c.a.i.RequestResponseChannel : Exception in RequestResponse links. Disposing and clearing unconfirmed sends.
  4. Retry Blob storage hangs for files > about 3500 kb #1. Transient error occurred. Retrying after 4511 ms.
    The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:c7c7dc00e80b46f19d0dfc9e7a447bf0_G10S2, SystemTracker:gateway5, Timestamp:2020-12-08T03:04:16, errorContext

Can't tell which exactly is the real issue. Sometimes it works for days without any problem, sometimes it fails after 1 hour and never recovers.

Issuing client.stop followed by client.start doesn't help. Neither creating new EventProcessorClient does. Only full application restart is proven to "fix" the problem.
We tried both 5.2.0 and 5.3.1 versions of azure-messaging-eventhubs artifact with similar results.

We use Azure Kubernetes to run the application. OS is Ubuntu-18, Java version 8.

BR
Pawel

@ameetkonnur
Copy link

We are seeing similar issues too.
We have a Standard EventHub implementation with Consumers hosted in AKS written in SpringBoot.

When we scale down the number of PODS for Consumers, the remaining Consumers stop processing Events and when they scale it back up it starts processing. Also all events between Scale Down and Scale up are lost.
Ex: 5 pods (scale down by 2) to 3 Pods - doesn’t process.
To process it again, scale up by at least one. 3 Pods + (1 or more) - Starts processing.

Restarting the PODS post scale down helps to restart the processing.

I checked on com.microsoft.azure:spring-cloud-azure-eventhubs-stream-binder version and its - 1.2.8

We also have seen below intermittent errors in Consumers.
• Load balancing failed for event processing. connection aborted.
• Missing “azure_checkpointer” header

Any pointers on how to resolve this will help.

@conniey
Copy link
Member

conniey commented Feb 20, 2021

Closed as a duplicate of #18070

@conniey conniey closed this as completed Feb 20, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

5 participants