Skip to content

Latest commit

 

History

History
209 lines (142 loc) · 15.6 KB

TROUBLESHOOTING.md

File metadata and controls

209 lines (142 loc) · 15.6 KB

Troubleshoot Event Hubs issues

This troubleshooting guide covers failure investigation techniques, common errors for the credential types in the Azure Event Hubs Python client library, and mitigation steps to resolve these errors.

Handle Event Hubs exceptions

All Event Hubs exceptions are wrapped in an EventHubError. They often have an underlying AMQP error code which specifies whether an error should be retried. For retryable errors (ie. amqp:connection:forced or amqp:link:detach-forced), the client libraries will attempt to recover from these errors based on the [retry options][AmqpRetryOptions] specified when instantiating the client. To configure retry options, follow the sample [Client Creation][ClientCreation]. If the error is non-retryable, there is some configuration issue that needs to be resolved.

The recommended way to solve the specific exception the AMQP exception represents is to follow the Event Hubs Messaging Exceptions guidance.

Find relevant information in exception messages

An EventHubError contains three fields which describe the error:

  • message: The underlying AMQP error message. A description of the errors can be found in the Exceptions module or the OASIS AMQP 1.0 spec.
  • error: The error condition if available.
  • details: The error details, if included in the service response.

By default the producer and consumer clients will retry for error conditions. We recommend that users of the clients use the following keyword arguments during creation of the client to change the retry behavior rather than retrying on their own:

  • retry_total: The total number of attempts to redo a failed operation when an error occurs. Default value is 3
  • retry_backoff_factor: A backoff factor to apply between attempts after the second try
  • retry_backoff_max: The maximum back off time. Default value is 120 seconds
  • **retry_mode: The delay behavior between retry attempts. Supported values are 'fixed' or 'exponential', where default is 'exponential'

Commonly encountered exceptions

ConnectionLostError Exception

When the connection to Event Hubs is idle, the service will disconnect the client after some time and raise a ConnectionLostError exception. The underlying issues that cause this are amqp:connection:forced and amqp:link:detach-forced. This is not a problem as the clients will re-establish a connection when a service operation is requested. More information can be found in the AMQP troubleshooting documentation.

Permission issues

An AuthenticationError means that the provided credentials do not allow for them to perform the action (receiving or sending) with Event Hubs.

Troubleshoot authentication and authorization issues with Event Hubs lists other possible solutions.

Connectivity issues

Timeout when connecting to service

  • Verify that the connection string or fully qualified domain name specified when creating the client is correct. Get an Event Hubs connection string demonstrates how to acquire a connection string.
  • Check the firewall and port permissions in your hosting environment and that the AMQP ports 5671 and 5762 are open.
    • Make sure that the endpoint is allowed through the firewall.
  • Try using WebSockets, which connects on port 443. See configure web sockets sample.
  • See if your network is blocking specific IP addresses.
  • If applicable, check the proxy configuration. See configure proxy sample.
  • For more information about troubleshooting network connectivity, refer to Event Hubs troubleshooting

SSL handshake failures

This error can occur when an intercepting proxy is used. We recommend testing in your hosting environment with the proxy disabled to verify.

Socket exhaustion errors

Applications should prefer treating the Event Hubs clients as a singleton, creating and using a single instance through the lifetime of their application. This is important as each client type manages its connection; creating a new Event Hub client results in a new AMQP connection, which uses a socket. Additionally, it is essential to be aware that your client is responsible for calling close() when it is finished using a client or to use the with statement for clients so that they are automatically closed after the flow execution leaves that block.

Connect using an IoT connection string

Because translating a connection string requires querying the IoT Hub service, the Event Hubs client library cannot use it directly. The IoT Hub Connection String Sample sample describes how to query IoT Hub to translate an IoT connection string into one that can be used with Event Hubs.

Further reading:

Adding "TransportType=AmqpWebSockets"

To use web sockets, pass in a kwarg transport_type = TransportType.AmqpOverWebsocket during client creation.

Adding "Authentication=Managed Identity"

To authenticate with Managed Identity, see the sample [client_identity_authentication.py][PublishEventsWithAzureIdentity].

For more information about the Azure.Identity library, check out our Authentication and the Azure SDK blog post.

Enable and configure logging

The Azure SDK for Python offers a consistent logging story to help troubleshoot application errors and expedite their resolution. The logs produced will capture the flow of an application before reaching the terminal state to help locate the root issue.

This library uses the standard Logging library for logging

  • Enable azure.eventhub logger to collect traces from the library.

Enable AMQP transport logging

If enabling client logging is not enough to diagnose your issues. You can enable AMQP frame level trace by setting logging_enable=True when creating the client.

Troubleshoot EventHubProducerClient (Sync/Async) issues

Cannot set multiple partition keys for events in EventDataBatch

When publishing messages, the Event Hubs service supports a single partition key for each EventDataBatch. Customers can consider using the producer client in buffered mode if they want that capability. Otherwise, they'll have to manage their batches.

Setting partition key on EventData is not set in Kafka consumer

The partition key of the EventHubs event is available in the Kafka record headers, the protocol specific key being "x-opt-partition-key" in the header.

By design, Event Hubs does not promote the Kafka message key to be the Event Hubs partition key nor the reverse because with the same value, the Kafka client and the Event Hub client likely send the message to two different partitions. It might cause some confusion if we set the value in the cross-protocol communication case. Exposing the properties with a protocol specific key to the other protocol client should be good enough.

Troubleshoot EventHubConsumerClient issues

412 precondition failures when using an event processor

Logs reflect intermittent HTTP 412 and HTTP 409 responses from storage when the client tries to take or renew ownership of a partition, but the local version of the ownership record is outdated. This occurs when another processor instance steals partition ownership. See Partition ownership changes a lot for more information.

Partition ownership changes frequently

When the number of EventHubConsumerClient instances changes (i.e. added or removed), the running instances try to load-balance partitions between themselves. For a few minutes after the number of processors changes, partitions are expected to change owners. Once balanced, partition ownership should be stable and change infrequently. If partition ownership is changing frequently when the number of processors is constant, this likely indicates a problem. It is recommended that a GitHub issue with logs and a repro be filed in this case.

"...current receiver '<RECEIVER_NAME>' with epoch '0' is getting disconnected"

The entire error message looks something like this:

New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:, SystemTracker::eventhub:<EVENT_HUB_NAME>|<CONSUMER_GROUP>, Timestamp:2022-01-01T12:00:00}"}

This error is expected when load balancing occurs after EventHubConsumerClient instances are added or removed. Load balancing is an ongoing process. When using the BlobCheckpointStore with your consumer, every ~30 seconds (by default), the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer. The service mechanism used to assert exclusive ownership over a partition is known as the Epoch.

However, if no instances are being added or removed, there is an underlying issue that should be addressed. See Partition ownership changes a lot for additional information and Filing GitHub issues.

High CPU usage

High CPU usage is usually because an instance owns too many partitions. We recommend no more than three partitions for every 1 CPU core; better to start with 1.5 partitions for each CPU core and test increasing the number of partitions owned.

Processor client stops receiving

The processor client often is continually running in a host application for days on end. Sometimes, they notice that EventHubConsumerClient is not processing one or more partitions. Usually, this is not enough information to determine why the exception occurred. The EventHubConsumerClient stopping is the symptom of an underlying cause (i.e. race condition) that occurred while trying to recover from a transient error. Please see Filing Github issues for the information we require.

Migrate from legacy to new client library

The migration guide includes steps on migrating from the legacy client and migrating legacy checkpoints.

Get additional help

Additional information on ways to reach out for support can be found in the SUPPORT.md at the repo's root.

Filing GitHub issues

When filing GitHub issues, the following details are requested:

  • Event Hub environment
    • How many partitions?
  • EventHubConsumerClient environment
    • What is the machine(s) specs processing your Event Hub?
    • How many instances are running?
  • What is the average size of each EventData?
  • What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventHubConsumerClient is always busy or has slow traffic periods.)
  • Repro code and steps
    • This is important as we often cannot reproduce the issue in our environment.
  • Logs. We need DEBUG logs, but if that is not possible, INFO at least. Error and warning level logs do not provide enough information. The period of at least +/- 10 minutes from when the issue occurred.