Skip to content

🐛 retry transient 503 errors in telemetry error checking#4273

Merged
thomas-lebeau merged 2 commits intomainfrom
thomas.lebeau/retry-503-telemetry-errors
Mar 9, 2026
Merged

🐛 retry transient 503 errors in telemetry error checking#4273
thomas-lebeau merged 2 commits intomainfrom
thomas.lebeau/retry-503-telemetry-errors

Conversation

@thomas-lebeau
Copy link
Collaborator

Motivation

The deploy-to-prod CI job polls the Datadog Logs Analytics API every 60 seconds for 30 minutes after deployment. A transient 503 Service Unavailable response (e.g. from rate limiting) crashes the script immediately since fetchHandlingError throws on any non-OK response with no retry logic.

Changes

Add retry logic in queryLogsApi (scripts/deploy/lib/checkTelemetryErrors.ts):

  • Wrap the fetchHandlingError call in a retry loop (up to 3 attempts)
  • On 503, wait RATE_LIMIT_DELAY_MS (6s) before retrying
  • Use existing findError/FetchError utilities to detect the error type
  • Log each retry with printLog for CI visibility
  • Re-throw immediately on non-503 errors or after exhausting retries

Test instructions

  • yarn typecheck passes
  • yarn test:script passes

Checklist

  • Tested locally
  • Tested on staging
  • Added unit tests for this change.
  • Added e2e/integration tests for this change.
  • Updated documentation and/or relevant AGENTS.md file

@thomas-lebeau thomas-lebeau requested a review from a team as a code owner March 4, 2026 11:23
@cit-pr-commenter-54b7da
Copy link

cit-pr-commenter-54b7da bot commented Mar 4, 2026

Bundles Sizes Evolution

📦 Bundle Name Base Size Local Size 𝚫 𝚫% Status
Rum 173.96 KiB 173.96 KiB 0 B 0.00%
Rum Profiler 4.71 KiB 4.71 KiB 0 B 0.00%
Rum Recorder 24.88 KiB 24.88 KiB 0 B 0.00%
Logs 56.50 KiB 56.50 KiB 0 B 0.00%
Flagging 944 B 944 B 0 B 0.00%
Rum Slim 129.66 KiB 129.66 KiB 0 B 0.00%
Worker 23.63 KiB 23.63 KiB 0 B 0.00%
🚀 CPU Performance

Pending...

🧠 Memory Performance

Pending...

🔗 RealWorld

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Mar 4, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 77.18% (+0.23%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 1d0f527 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

The deploy-to-prod CI job polls the Datadog Logs Analytics API for 30
minutes after deployment. A transient 503 Service Unavailable response
would crash the script immediately since there was no retry logic.

Add a retry loop (up to 3 attempts with a 6s delay) around the API call
in queryLogsApi for 503 errors, using the existing RATE_LIMIT_DELAY_MS
as the backoff interval.
@thomas-lebeau thomas-lebeau force-pushed the thomas.lebeau/retry-503-telemetry-errors branch from 21b81c1 to b16c0b1 Compare March 4, 2026 11:26
Comment on lines +154 to +168
if (
!data ||
!data.data ||
!Array.isArray(data.data.buckets) ||
!data.data.buckets.every((bucket) => bucket.computes && typeof bucket.computes.c0 === 'number')
) {
throw new Error(`Unexpected response from the API: ${JSON.stringify(data)}`)
}

return data.data.buckets
} catch (error) {
const fetchError = findError(error, FetchError)
if (!fetchError || fetchError.response.status !== 503 || attempt === MAX_RETRIES) {
throw error
}
Copy link
Collaborator

@bcaudan bcaudan Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 suggestion: ‏it seems that we are using the first throw to control the execution flow.
it could be easier to follow / maintain with a structure like:

for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
  const response = await queryLogs()
  if (isValid(response)) {
    return response.data.buckets
  } else if (shouldRetry(response)) {
    printLog(...)
    await timeout(...)
  } else {
    throw new Error(...)
  }
}

- Replace `fetchHandlingError` wrapper with direct `undici.fetch` call
  in `queryLogsApi`, handling response status inline
- Extract `shouldRetry` and `isValidData` helpers for clarity
- Switch from for-loop retry to recursive approach
- Extract reusable `createFetchError` from `fetchHandlingError`
- Update tests to mock `globalThis.fetch` instead of `fetchHandlingError`
@thomas-lebeau thomas-lebeau force-pushed the thomas.lebeau/retry-503-telemetry-errors branch from fef06ba to 1d0f527 Compare March 5, 2026 14:59
Copy link
Collaborator

@bcaudan bcaudan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomas-lebeau thomas-lebeau merged commit 6e817a5 into main Mar 9, 2026
21 checks passed
@thomas-lebeau thomas-lebeau deleted the thomas.lebeau/retry-503-telemetry-errors branch March 9, 2026 08:25
@github-actions github-actions bot locked and limited conversation to collaborators Mar 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants