diff --git a/content/en/continuous_integration/pipelines/_index.md b/content/en/continuous_integration/pipelines/_index.md index 3a760a89f1a..4155c8b33dc 100644 --- a/content/en/continuous_integration/pipelines/_index.md +++ b/content/en/continuous_integration/pipelines/_index.md @@ -47,6 +47,7 @@ Select your CI provider to set up CI Visibility in Datadog: | {{< ci-details title="Infrastructure correlation" >}}Correlation of host-level information for the Datadog Agent, CI pipelines, or job runners to CI pipeline execution data.{{< /ci-details >}} | | | {{< X >}} | | | {{< X >}} | {{< X >}} | {{< X >}} | | | | {{< ci-details title="Running pipelines" >}}Identification of pipelines executions that are running with associated tracing.{{< /ci-details >}} | {{< X >}} | | | | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | | {{< ci-details title="Partial retries" >}}Identification of partial retries (for example, when only a subset of jobs were retried).{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | +| {{< ci-details title="Automatic job retries" >}}Preview. Datadog retries failed jobs classified as transient by its AI error model. More info.{{< /ci-details >}} | | | | | | {{< X >}} | {{< X >}} | | | | | {{< ci-details title="Step granularity" >}}Step level spans are available for more granular visibility.{{< /ci-details >}} | | | | | {{< X >}} | {{< X >}} | | {{< X >}}
(_Presented as job spans_) | | {{< X >}} | | {{< ci-details title="Manual steps" >}}Identification of when there is a job with a manual approval phase in the overall pipeline.{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | diff --git a/content/en/continuous_integration/pipelines/automatic_retries.md b/content/en/continuous_integration/pipelines/automatic_retries.md new file mode 100644 index 00000000000..48085a832bf --- /dev/null +++ b/content/en/continuous_integration/pipelines/automatic_retries.md @@ -0,0 +1,84 @@ +--- +title: Automatic Job Retries +further_reading: + - link: "/continuous_integration/pipelines" + tag: "Documentation" + text: "Explore Pipeline Execution Results and Performance" + - link: "/continuous_integration/pipelines/github/" + tag: "Documentation" + text: "Set up CI Visibility for GitHub Actions" + - link: "/continuous_integration/pipelines/gitlab/" + tag: "Documentation" + text: "Set up CI Visibility for GitLab" + - link: "/continuous_integration/troubleshooting/" + tag: "Documentation" + text: "Troubleshooting CI Visibility" +--- + +
Automatic job retries are in Preview. To request access, contact your Datadog account team.
+ +## Overview + +Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the CI provider's API without manual intervention. + +Automatic retries reduce the number of pipelines that developers re-run by hand, shorten feedback loops, and keep pipeline success metrics focused on non-transient failures. + +## How it works + +1. A CI job fails in your pipeline. +2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient. +3. If the failure is classified as retriable, Datadog requests a retry through the provider's API. +4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops. +5. Datadog records the retry outcome on the original pipeline in CI Visibility. + +## Requirements + +- CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration. +- [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries. +- Indexed CI job logs for those repositories (see [Collect job logs for GitHub Actions][4] or [Collect job logs for GitLab][5]). +- Automatic job retries enabled for your organization (see the banner above for how to request access). + +Automatic retries rely on the same AI error classifier used by [CI jobs failure analysis][6], which reads indexed job logs to decide whether a failure is transient. + +## Provider-specific behavior + +{{< tabs >}} +{{% tab "GitLab" %}} + +Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected. + +- Retries are triggered per job, as soon as the job fails. +- Smart retries work with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Datadog Source Code Integration. +- There is no additional CI cost beyond the retried job. + +{{% /tab %}} +{{% tab "GitHub Actions" %}} + +GitHub Actions imposes two provider-level limitations that shape how retries work: + +- **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries. +- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog reruns every failed job in the workflow through a single GitHub API call. This may increase your GitHub Actions compute usage. + +### Protected branches + +The Datadog GitHub App's default permissions do not allow retries on protected branches. To enable automatic retries on a protected branch (for example, your default branch), grant the app Maintainer-level access. Review your organization's policies before expanding permissions. + +{{% /tab %}} +{{< /tabs >}} + +## Limitations + +- Each logical job is retried at most one time. +- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried. +- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry. + +## Further reading + +{{< partial name="whats-next/whats-next.html" >}} + +[1]: /continuous_integration/pipelines/github/ +[2]: /continuous_integration/pipelines/gitlab/ +[3]: /integrations/guide/source-code-integration/ +[4]: /continuous_integration/pipelines/github/#collect-job-logs +[5]: /continuous_integration/pipelines/gitlab/#collect-job-logs +[6]: /continuous_integration/guides/use_ci_jobs_failure_analysis/ diff --git a/content/en/continuous_integration/pipelines/github.md b/content/en/continuous_integration/pipelines/github.md index b6320e0ca6b..fbdf8028b67 100644 --- a/content/en/continuous_integration/pipelines/github.md +++ b/content/en/continuous_integration/pipelines/github.md @@ -30,6 +30,7 @@ Set up CI Visibility for GitHub Actions to track the execution of your workflows | [Running pipelines][2] | Running pipelines | View pipeline executions that are running. Queued or waiting pipelines show with status "Running" on Datadog. | | [CI jobs failure analysis][23] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. | | [Partial retries][3] | Partial pipelines | View partially retried pipeline executions. | +| [Automatic job retries][27] | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. | | Logs correlation | Logs correlation | Correlate pipeline and job spans to logs and enable [job log collection](#collect-job-logs). | | Infrastructure metric correlation | Infrastructure metric correlation | Correlate jobs to [infrastructure host metrics][4] for GitHub jobs. | | [Custom tags][5] [and measures at runtime][6] | Custom tags and measures at runtime | Configure [custom tags and measures][7] at runtime. | @@ -158,3 +159,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep [24]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/ [25]: /glossary/#pipeline-execution-time [26]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments +[27]: /continuous_integration/pipelines/automatic_retries/ diff --git a/content/en/continuous_integration/pipelines/gitlab.md b/content/en/continuous_integration/pipelines/gitlab.md index df397f6232b..66d8357bd54 100644 --- a/content/en/continuous_integration/pipelines/gitlab.md +++ b/content/en/continuous_integration/pipelines/gitlab.md @@ -28,6 +28,7 @@ Set up CI Visibility for GitLab to collect data on your pipeline executions, ana | [CI jobs failure analysis][28] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. | | [Filter CI Jobs on the critical path][29] | Filter CI Jobs on the critical path | Filter by jobs on the critical path. | | [Partial retries][19] | Partial pipelines | View partially retried pipeline executions. | +| [Automatic job retries][31] | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. | | [Manual steps][20] | Manual steps | View manually triggered pipelines. | | [Queue time][21] | Queue time | View the amount of time pipeline jobs sit in the queue before processing. | | Logs correlation | Logs correlation | Correlate pipeline spans to logs and enable [job log collection][12]. | @@ -466,3 +467,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep [28]: /continuous_integration/guides/use_ci_jobs_failure_analysis/ [29]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/ [30]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments +[31]: /continuous_integration/pipelines/automatic_retries/