Skip to content

[Monitor OpenTelemetry exporter] Fix retry amplification storm #47002

Merged
hectorhdzg merged 1 commit into
Azure:mainfrom
hectorhdzg:fix/monitor-otel-retry-storm-throttling
May 20, 2026
Merged

[Monitor OpenTelemetry exporter] Fix retry amplification storm #47002
hectorhdzg merged 1 commit into
Azure:mainfrom
hectorhdzg:fix/monitor-otel-retry-storm-throttling

Conversation

@hectorhdzg
Copy link
Copy Markdown
Member

During sustained 429 throttling, failed telemetry accumulates as blob files in local storage. On recovery, _transmit_from_storage() drained all blobs in a tight loop, creating a burst of requests that could immediately re-trigger throttling.

Changes:

  • Cap storage drain to 10 blobs per invocation (_MAX_STORAGE_DRAIN_BATCH) to spread retry load across export cycles
  • Stop draining immediately when a retryable failure occurs, since the service is still under pressure
  • Add tests for both drain cap and early termination behaviors

Copilot AI review requested due to automatic review settings May 19, 2026 21:22
@github-actions github-actions Bot added the Monitor - Exporter Monitor OpenTelemetry Exporter label May 19, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates retry amplification in the Azure Monitor OpenTelemetry exporter by preventing _transmit_from_storage() from aggressively draining local offline-storage blobs after sustained throttling, reducing the likelihood of immediately re-triggering 429s on recovery.

Changes:

  • Add a per-invocation cap (_MAX_STORAGE_DRAIN_BATCH = 10) on how many stored blobs are processed in _transmit_from_storage().
  • Stop draining immediately when a retryable failure occurs while draining, to avoid creating a burst of follow-up requests.
  • Add unit tests validating early termination on retryable failure and enforcement of the drain cap.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/export/_base.py Caps offline-storage draining per cycle and stops draining on retryable failures to prevent request bursts after throttling.
sdk/monitor/azure-monitor-opentelemetry-exporter/tests/test_base_exporter.py Adds tests for early-stop behavior on retryable failures and for enforcing the per-invocation drain cap.

During sustained 429 throttling, failed telemetry accumulates as blob
files in local storage. On recovery, _transmit_from_storage() drained
all blobs in a tight loop, creating a burst of requests that could
immediately re-trigger throttling.

Changes:
- Cap storage drain to 10 blobs per invocation (_MAX_STORAGE_DRAIN_BATCH)
  to spread retry load across export cycles
- Stop draining immediately when a retryable failure occurs, since the
  service is still under pressure
- Add tests for both drain cap and early termination behaviors
@hectorhdzg hectorhdzg force-pushed the fix/monitor-otel-retry-storm-throttling branch from 4a70184 to 52d7553 Compare May 19, 2026 22:06
Copy link
Copy Markdown
Member

@JacksonWeber JacksonWeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hectorhdzg hectorhdzg merged commit 218656c into Azure:main May 20, 2026
19 checks passed
ninghu pushed a commit to ninghu/azure-sdk-for-python that referenced this pull request May 22, 2026
…Azure#47002)

During sustained 429 throttling, failed telemetry accumulates as blob
files in local storage. On recovery, _transmit_from_storage() drained
all blobs in a tight loop, creating a burst of requests that could
immediately re-trigger throttling.

Changes:
- Cap storage drain to 10 blobs per invocation (_MAX_STORAGE_DRAIN_BATCH)
  to spread retry load across export cycles
- Stop draining immediately when a retryable failure occurs, since the
  service is still under pressure
- Add tests for both drain cap and early termination behaviors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Monitor - Exporter Monitor OpenTelemetry Exporter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants