🐛 retry transient 503 errors in telemetry error checking#4273
Merged
thomas-lebeau merged 2 commits intomainfrom Mar 9, 2026
Merged
🐛 retry transient 503 errors in telemetry error checking#4273thomas-lebeau merged 2 commits intomainfrom
thomas-lebeau merged 2 commits intomainfrom
Conversation
Bundles Sizes Evolution
🚀 CPU PerformancePending... 🧠 Memory PerformancePending... |
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 1d0f527 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
The deploy-to-prod CI job polls the Datadog Logs Analytics API for 30 minutes after deployment. A transient 503 Service Unavailable response would crash the script immediately since there was no retry logic. Add a retry loop (up to 3 attempts with a 6s delay) around the API call in queryLogsApi for 503 errors, using the existing RATE_LIMIT_DELAY_MS as the backoff interval.
21b81c1 to
b16c0b1
Compare
bcaudan
reviewed
Mar 4, 2026
Comment on lines
+154
to
+168
| if ( | ||
| !data || | ||
| !data.data || | ||
| !Array.isArray(data.data.buckets) || | ||
| !data.data.buckets.every((bucket) => bucket.computes && typeof bucket.computes.c0 === 'number') | ||
| ) { | ||
| throw new Error(`Unexpected response from the API: ${JSON.stringify(data)}`) | ||
| } | ||
|
|
||
| return data.data.buckets | ||
| } catch (error) { | ||
| const fetchError = findError(error, FetchError) | ||
| if (!fetchError || fetchError.response.status !== 503 || attempt === MAX_RETRIES) { | ||
| throw error | ||
| } |
Collaborator
There was a problem hiding this comment.
💬 suggestion: it seems that we are using the first throw to control the execution flow.
it could be easier to follow / maintain with a structure like:
for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
const response = await queryLogs()
if (isValid(response)) {
return response.data.buckets
} else if (shouldRetry(response)) {
printLog(...)
await timeout(...)
} else {
throw new Error(...)
}
}- Replace `fetchHandlingError` wrapper with direct `undici.fetch` call in `queryLogsApi`, handling response status inline - Extract `shouldRetry` and `isValidData` helpers for clarity - Switch from for-loop retry to recursive approach - Extract reusable `createFetchError` from `fetchHandlingError` - Update tests to mock `globalThis.fetch` instead of `fetchHandlingError`
fef06ba to
1d0f527
Compare
BenoitZugmeyer
approved these changes
Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The deploy-to-prod CI job polls the Datadog Logs Analytics API every 60 seconds for 30 minutes after deployment. A transient
503 Service Unavailableresponse (e.g. from rate limiting) crashes the script immediately sincefetchHandlingErrorthrows on any non-OK response with no retry logic.Changes
Add retry logic in
queryLogsApi(scripts/deploy/lib/checkTelemetryErrors.ts):fetchHandlingErrorcall in a retry loop (up to 3 attempts)RATE_LIMIT_DELAY_MS(6s) before retryingfindError/FetchErrorutilities to detect the error typeprintLogfor CI visibilityTest instructions
yarn typecheckpassesyarn test:scriptpassesChecklist