From 54f028fc57e9cb0764e0566ca447944a514c206c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20Font=C3=A1n?= Date: Tue, 21 Apr 2026 00:08:08 +0200 Subject: [PATCH 1/4] [SDCI-2079] Document automatic job retries (Preview) Adds customer-facing documentation for the automatic job retries feature on GitHub Actions and GitLab. Uses a dedicated automatic_retries.md page as the source of truth, surfaced through the compatibility tables on each provider page and the supported features matrix on the pipelines index. --- .../pipelines/_index.md | 1 + .../pipelines/automatic_retries.md | 72 +++++++++++++++++++ .../pipelines/github.md | 2 + .../pipelines/gitlab.md | 2 + 4 files changed, 77 insertions(+) create mode 100644 content/en/continuous_integration/pipelines/automatic_retries.md diff --git a/content/en/continuous_integration/pipelines/_index.md b/content/en/continuous_integration/pipelines/_index.md index 3a760a89f1a..4155c8b33dc 100644 --- a/content/en/continuous_integration/pipelines/_index.md +++ b/content/en/continuous_integration/pipelines/_index.md @@ -47,6 +47,7 @@ Select your CI provider to set up CI Visibility in Datadog: | {{< ci-details title="Infrastructure correlation" >}}Correlation of host-level information for the Datadog Agent, CI pipelines, or job runners to CI pipeline execution data.{{< /ci-details >}} | | | {{< X >}} | | | {{< X >}} | {{< X >}} | {{< X >}} | | | | {{< ci-details title="Running pipelines" >}}Identification of pipelines executions that are running with associated tracing.{{< /ci-details >}} | {{< X >}} | | | | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | | {{< ci-details title="Partial retries" >}}Identification of partial retries (for example, when only a subset of jobs were retried).{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | +| {{< ci-details title="Automatic job retries" >}}Preview. Datadog retries failed jobs classified as transient by its AI error model. More info.{{< /ci-details >}} | | | | | | {{< X >}} | {{< X >}} | | | | | {{< ci-details title="Step granularity" >}}Step level spans are available for more granular visibility.{{< /ci-details >}} | | | | | {{< X >}} | {{< X >}} | | {{< X >}}
(_Presented as job spans_) | | {{< X >}} | | {{< ci-details title="Manual steps" >}}Identification of when there is a job with a manual approval phase in the overall pipeline.{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | diff --git a/content/en/continuous_integration/pipelines/automatic_retries.md b/content/en/continuous_integration/pipelines/automatic_retries.md new file mode 100644 index 00000000000..26a2cbd5376 --- /dev/null +++ b/content/en/continuous_integration/pipelines/automatic_retries.md @@ -0,0 +1,72 @@ +--- +title: Automatic Job Retries +further_reading: + - link: "/continuous_integration/pipelines" + tag: "Documentation" + text: "Explore Pipeline Execution Results and Performance" + - link: "/continuous_integration/troubleshooting/" + tag: "Documentation" + text: "Troubleshooting CI Visibility" +--- + +
Automatic job retries are in Preview. To request access, contact your Datadog account team.
+ +## Overview + +Automatic job retries save developer time by re-running only the failures that are likely transient—such as network timeouts, infrastructure hiccups, or flaky tests—while leaving genuine code defects untouched. Datadog classifies each failed job with an AI-powered error model and, when the failure is determined retriable, triggers a retry through the CI provider's API without manual intervention. + +This reduces the number of pipelines developers manually re-run, shortens feedback loops, and keeps pipeline success metrics focused on real problems. + +## How it works + +1. A CI job fails in your pipeline. +2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient. +3. If the failure is classified as retriable, Datadog requests a retry through the provider's API. +4. Datadog retries each job up to a configurable maximum to prevent infinite retry loops. +5. The retry outcome is reflected on the original pipeline in CI Visibility. + +## Requirements + +- CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration. +- [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries. +- Automatic job retries enabled for your organization. Because this feature is in Preview, access is gated—contact your Datadog account team to request enablement. + +## Provider support + +{{< tabs >}} +{{% tab "GitLab" %}} + +Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs that aren't classified retriable, and passing jobs, aren't affected. + +- Retries are triggered per job as soon as the job finishes failing. +- Works with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Datadog Source Code Integration. +- No additional CI cost beyond the retried job itself. + +{{% /tab %}} +{{% tab "GitHub Actions" %}} + +GitHub Actions imposes two provider-level limitations that shape how retries work: + +- **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries. +- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog uses the "rerun failed jobs" endpoint, which re-runs every failed job in the workflow. This may increase the GitHub Actions compute minutes consumed by your pipelines. + +### Protected branches + +The Datadog GitHub App's default permissions do not allow retries on protected branches. To enable automatic retries on a protected branch (for example, your default branch), grant the app Maintainer-level access. Review your organization's policies before expanding permissions. + +{{% /tab %}} +{{< /tabs >}} + +## Limitations + +- Each logical job is retried at most one time. +- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried. +- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry. + +## Further reading + +{{< partial name="whats-next/whats-next.html" >}} + +[1]: /continuous_integration/pipelines/github/ +[2]: /continuous_integration/pipelines/gitlab/ +[3]: /integrations/guide/source-code-integration/ diff --git a/content/en/continuous_integration/pipelines/github.md b/content/en/continuous_integration/pipelines/github.md index b6320e0ca6b..fbdf8028b67 100644 --- a/content/en/continuous_integration/pipelines/github.md +++ b/content/en/continuous_integration/pipelines/github.md @@ -30,6 +30,7 @@ Set up CI Visibility for GitHub Actions to track the execution of your workflows | [Running pipelines][2] | Running pipelines | View pipeline executions that are running. Queued or waiting pipelines show with status "Running" on Datadog. | | [CI jobs failure analysis][23] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. | | [Partial retries][3] | Partial pipelines | View partially retried pipeline executions. | +| [Automatic job retries][27] | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. | | Logs correlation | Logs correlation | Correlate pipeline and job spans to logs and enable [job log collection](#collect-job-logs). | | Infrastructure metric correlation | Infrastructure metric correlation | Correlate jobs to [infrastructure host metrics][4] for GitHub jobs. | | [Custom tags][5] [and measures at runtime][6] | Custom tags and measures at runtime | Configure [custom tags and measures][7] at runtime. | @@ -158,3 +159,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep [24]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/ [25]: /glossary/#pipeline-execution-time [26]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments +[27]: /continuous_integration/pipelines/automatic_retries/ diff --git a/content/en/continuous_integration/pipelines/gitlab.md b/content/en/continuous_integration/pipelines/gitlab.md index df397f6232b..66d8357bd54 100644 --- a/content/en/continuous_integration/pipelines/gitlab.md +++ b/content/en/continuous_integration/pipelines/gitlab.md @@ -28,6 +28,7 @@ Set up CI Visibility for GitLab to collect data on your pipeline executions, ana | [CI jobs failure analysis][28] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. | | [Filter CI Jobs on the critical path][29] | Filter CI Jobs on the critical path | Filter by jobs on the critical path. | | [Partial retries][19] | Partial pipelines | View partially retried pipeline executions. | +| [Automatic job retries][31] | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. | | [Manual steps][20] | Manual steps | View manually triggered pipelines. | | [Queue time][21] | Queue time | View the amount of time pipeline jobs sit in the queue before processing. | | Logs correlation | Logs correlation | Correlate pipeline spans to logs and enable [job log collection][12]. | @@ -466,3 +467,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep [28]: /continuous_integration/guides/use_ci_jobs_failure_analysis/ [29]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/ [30]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments +[31]: /continuous_integration/pipelines/automatic_retries/ From a02aa0f58fa60a95f5bc59bd6cc912b0e097b9b8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20Font=C3=A1n?= Date: Tue, 21 Apr 2026 09:34:42 +0200 Subject: [PATCH 2/4] [SDCI-2079] Apply editorial fixes from PR review - Replace "hiccups" colloquialism with "failures". - Split long em-dashed sentence in Overview into two sentences. - Replace passive "the retry outcome is reflected" with active. - Replace "configurable maximum" with "maximum number of attempts" since the limit is not customer-tunable today. - Replace "when the failure is determined retriable" with "when the failure is identified as retriable". - Replace quoted "rerun failed jobs" with plain prose (GitHub API call). - Replace awkward "compute minutes consumed by your pipelines" with "GitHub Actions compute usage". - Add missing "as" in "aren't classified retriable" (GitLab tab). - Make GitLab provider list items structurally consistent (all full sentences). - Rename "Provider support" heading to "Provider-specific behavior" for stronger AI retrieval. - Add GitHub Actions and GitLab setup pages to further_reading. - Replace em dash with period in access-gating sentence. --- .../pipelines/automatic_retries.md | 26 ++++++++++++------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/content/en/continuous_integration/pipelines/automatic_retries.md b/content/en/continuous_integration/pipelines/automatic_retries.md index 26a2cbd5376..a7f331c190d 100644 --- a/content/en/continuous_integration/pipelines/automatic_retries.md +++ b/content/en/continuous_integration/pipelines/automatic_retries.md @@ -4,6 +4,12 @@ further_reading: - link: "/continuous_integration/pipelines" tag: "Documentation" text: "Explore Pipeline Execution Results and Performance" + - link: "/continuous_integration/pipelines/github/" + tag: "Documentation" + text: "Set up CI Visibility for GitHub Actions" + - link: "/continuous_integration/pipelines/gitlab/" + tag: "Documentation" + text: "Set up CI Visibility for GitLab" - link: "/continuous_integration/troubleshooting/" tag: "Documentation" text: "Troubleshooting CI Visibility" @@ -13,34 +19,34 @@ further_reading: ## Overview -Automatic job retries save developer time by re-running only the failures that are likely transient—such as network timeouts, infrastructure hiccups, or flaky tests—while leaving genuine code defects untouched. Datadog classifies each failed job with an AI-powered error model and, when the failure is determined retriable, triggers a retry through the CI provider's API without manual intervention. +Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are left alone. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the CI provider's API without manual intervention. -This reduces the number of pipelines developers manually re-run, shortens feedback loops, and keeps pipeline success metrics focused on real problems. +This reduces the number of pipelines developers manually re-run, shortens feedback loops, and keeps pipeline success metrics focused on non-transient failures. ## How it works 1. A CI job fails in your pipeline. 2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient. 3. If the failure is classified as retriable, Datadog requests a retry through the provider's API. -4. Datadog retries each job up to a configurable maximum to prevent infinite retry loops. -5. The retry outcome is reflected on the original pipeline in CI Visibility. +4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops. +5. Datadog records the retry outcome on the original pipeline in CI Visibility. ## Requirements - CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration. - [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries. -- Automatic job retries enabled for your organization. Because this feature is in Preview, access is gated—contact your Datadog account team to request enablement. +- Automatic job retries enabled for your organization. Because this feature is in Preview, access is gated. Contact your Datadog account team to request enablement. -## Provider support +## Provider-specific behavior {{< tabs >}} {{% tab "GitLab" %}} -Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs that aren't classified retriable, and passing jobs, aren't affected. +Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected. - Retries are triggered per job as soon as the job finishes failing. -- Works with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Datadog Source Code Integration. -- No additional CI cost beyond the retried job itself. +- Smart retries work with GitLab.com (SaaS) and with self-hosted GitLab instances reachable by the Datadog Source Code Integration. +- There is no additional CI cost beyond the retried job. {{% /tab %}} {{% tab "GitHub Actions" %}} @@ -48,7 +54,7 @@ Datadog performs **smart retries** on GitLab: only the specific job classified a GitHub Actions imposes two provider-level limitations that shape how retries work: - **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries. -- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog uses the "rerun failed jobs" endpoint, which re-runs every failed job in the workflow. This may increase the GitHub Actions compute minutes consumed by your pipelines. +- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog reruns every failed job in the workflow through a single GitHub API call. This may increase your GitHub Actions compute usage. ### Protected branches From eafc81ad23f4205c2302ceeb8491438188f2f772 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20Font=C3=A1n?= Date: Tue, 21 Apr 2026 11:03:53 +0200 Subject: [PATCH 3/4] [SDCI-2079] Apply second round of editorial fixes - Collapse Requirements bullet 3 into a single fragment; redirect readers to the banner instead of repeating access-request instructions. - Replace "Genuine code defects are left alone" with "not retried" to avoid the idiom. - Replace ambiguous "This reduces the number of pipelines developers manually re-run" with "Automatic retries reduce the number of pipelines that developers re-run by hand". - GitLab tab: replace awkward "as soon as the job finishes failing" with "as soon as the job fails", and drop the redundant second "with" in the Smart retries bullet. --- .../pipelines/automatic_retries.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/en/continuous_integration/pipelines/automatic_retries.md b/content/en/continuous_integration/pipelines/automatic_retries.md index a7f331c190d..09e94b80045 100644 --- a/content/en/continuous_integration/pipelines/automatic_retries.md +++ b/content/en/continuous_integration/pipelines/automatic_retries.md @@ -19,9 +19,9 @@ further_reading: ## Overview -Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are left alone. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the CI provider's API without manual intervention. +Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the CI provider's API without manual intervention. -This reduces the number of pipelines developers manually re-run, shortens feedback loops, and keeps pipeline success metrics focused on non-transient failures. +Automatic retries reduce the number of pipelines that developers re-run by hand, shorten feedback loops, and keep pipeline success metrics focused on non-transient failures. ## How it works @@ -35,7 +35,7 @@ This reduces the number of pipelines developers manually re-run, shortens feedba - CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration. - [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries. -- Automatic job retries enabled for your organization. Because this feature is in Preview, access is gated. Contact your Datadog account team to request enablement. +- Automatic job retries enabled for your organization (see the banner above for how to request access). ## Provider-specific behavior @@ -44,8 +44,8 @@ This reduces the number of pipelines developers manually re-run, shortens feedba Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected. -- Retries are triggered per job as soon as the job finishes failing. -- Smart retries work with GitLab.com (SaaS) and with self-hosted GitLab instances reachable by the Datadog Source Code Integration. +- Retries are triggered per job, as soon as the job fails. +- Smart retries work with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Datadog Source Code Integration. - There is no additional CI cost beyond the retried job. {{% /tab %}} From 7148ff0fd54aa8d31dc9e1ce6819c442b801f3f9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20Font=C3=A1n?= Date: Tue, 21 Apr 2026 13:32:45 +0200 Subject: [PATCH 4/4] [SDCI-2079] Add job log collection dependency to Requirements Automatic retries use the same AI error classifier as CI jobs failure analysis, which reads indexed CI job logs to decide whether a failure is transient. Adds the log collection dependency to the Requirements list with provider-specific setup links, plus a cross-reference to the failure analysis guide. --- .../continuous_integration/pipelines/automatic_retries.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/content/en/continuous_integration/pipelines/automatic_retries.md b/content/en/continuous_integration/pipelines/automatic_retries.md index 09e94b80045..48085a832bf 100644 --- a/content/en/continuous_integration/pipelines/automatic_retries.md +++ b/content/en/continuous_integration/pipelines/automatic_retries.md @@ -35,8 +35,11 @@ Automatic retries reduce the number of pipelines that developers re-run by hand, - CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration. - [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries. +- Indexed CI job logs for those repositories (see [Collect job logs for GitHub Actions][4] or [Collect job logs for GitLab][5]). - Automatic job retries enabled for your organization (see the banner above for how to request access). +Automatic retries rely on the same AI error classifier used by [CI jobs failure analysis][6], which reads indexed job logs to decide whether a failure is transient. + ## Provider-specific behavior {{< tabs >}} @@ -76,3 +79,6 @@ The Datadog GitHub App's default permissions do not allow retries on protected b [1]: /continuous_integration/pipelines/github/ [2]: /continuous_integration/pipelines/gitlab/ [3]: /integrations/guide/source-code-integration/ +[4]: /continuous_integration/pipelines/github/#collect-job-logs +[5]: /continuous_integration/pipelines/gitlab/#collect-job-logs +[6]: /continuous_integration/guides/use_ci_jobs_failure_analysis/