Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/en/continuous_integration/pipelines/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ Select your CI provider to set up CI Visibility in Datadog:
| {{< ci-details title="Infrastructure correlation" >}}Correlation of host-level information for the Datadog Agent, CI pipelines, or job runners to CI pipeline execution data.{{< /ci-details >}} | | | {{< X >}} | | | {{< X >}} | {{< X >}} | {{< X >}} | | |
| {{< ci-details title="Running pipelines" >}}Identification of pipelines executions that are running with associated tracing.{{< /ci-details >}} | {{< X >}} | | | | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} |
| {{< ci-details title="Partial retries" >}}Identification of partial retries (for example, when only a subset of jobs were retried).{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} |
| {{< ci-details title="Automatic job retries" >}}Preview. Datadog retries failed jobs classified as transient by its AI error model. <a href="https://docs.datadoghq.com/continuous_integration/pipelines/automatic_retries/">More info</a>.{{< /ci-details >}} | | | | | | {{< X >}} | {{< X >}} | | | |
| {{< ci-details title="Step granularity" >}}Step level spans are available for more granular visibility.{{< /ci-details >}} | | | | | {{< X >}} | {{< X >}} | | {{< X >}} <br /> (_Presented as job spans_) | | {{< X >}} |
| {{< ci-details title="Manual steps" >}}Identification of when there is a job with a manual approval phase in the overall pipeline.{{< /ci-details >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} | {{< X >}} | {{< X >}} | {{< X >}} | | {{< X >}} |

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: Automatic Job Retries
further_reading:
- link: "/continuous_integration/pipelines"
tag: "Documentation"
text: "Explore Pipeline Execution Results and Performance"
- link: "/continuous_integration/pipelines/github/"
tag: "Documentation"
text: "Set up CI Visibility for GitHub Actions"
- link: "/continuous_integration/pipelines/gitlab/"
tag: "Documentation"
text: "Set up CI Visibility for GitLab"
- link: "/continuous_integration/troubleshooting/"
tag: "Documentation"
text: "Troubleshooting CI Visibility"
---

<div class="alert alert-info">Automatic job retries are in Preview. To request access, contact your Datadog account team.</div>

## Overview

Automatic job retries save developer time by re-running failures that are likely transient, such as network timeouts, infrastructure failures, or flaky tests. Genuine code defects are not retried. Datadog runs each failed job through an AI-powered error classifier. When the failure is identified as retriable, Datadog triggers a retry through the CI provider's API without manual intervention.

Automatic retries reduce the number of pipelines that developers re-run by hand, shorten feedback loops, and keep pipeline success metrics focused on non-transient failures.

## How it works

1. A CI job fails in your pipeline.
2. Datadog's AI error classifier inspects the job's logs and error context to determine whether the failure is transient.
3. If the failure is classified as retriable, Datadog requests a retry through the provider's API.
4. Datadog retries each job up to a maximum number of attempts to prevent infinite retry loops.
5. Datadog records the retry outcome on the original pipeline in CI Visibility.

## Requirements

- CI Visibility enabled for your [GitHub Actions][1] or [GitLab][2] integration.
- [Datadog Source Code Integration][3] configured for the repositories where you want automatic retries.
- Indexed CI job logs for those repositories (see [Collect job logs for GitHub Actions][4] or [Collect job logs for GitLab][5]).
- Automatic job retries enabled for your organization (see the banner above for how to request access).

Automatic retries rely on the same AI error classifier used by [CI jobs failure analysis][6], which reads indexed job logs to decide whether a failure is transient.

## Provider-specific behavior

{{< tabs >}}
{{% tab "GitLab" %}}

Datadog performs **smart retries** on GitLab: only the specific job classified as retriable is re-run. Other failed jobs (that aren't classified as retriable) and passing jobs aren't affected.

- Retries are triggered per job, as soon as the job fails.
- Smart retries work with GitLab.com (SaaS) and self-hosted GitLab instances reachable by the Datadog Source Code Integration.
- There is no additional CI cost beyond the retried job.

{{% /tab %}}
{{% tab "GitHub Actions" %}}

GitHub Actions imposes two provider-level limitations that shape how retries work:

- **Retries happen after the workflow finishes.** The GitHub API does not allow retrying an individual job while the rest of the workflow is still running. Datadog waits for the workflow to reach a final state before issuing retries.
- **All failed jobs are retried together.** The GitHub API does not support retrying a single job when other jobs in the workflow have also failed. Datadog reruns every failed job in the workflow through a single GitHub API call. This may increase your GitHub Actions compute usage.

### Protected branches

The Datadog GitHub App's default permissions do not allow retries on protected branches. To enable automatic retries on a protected branch (for example, your default branch), grant the app Maintainer-level access. Review your organization's policies before expanding permissions.

{{% /tab %}}
{{< /tabs >}}

## Limitations

- Each logical job is retried at most one time.
- Jobs classified as non-retriable (for example, compilation errors or asserted test failures) are never retried.
- If a job has already been retried manually or by provider-native retry rules, Datadog does not issue an additional retry.

## Further reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: /continuous_integration/pipelines/github/
[2]: /continuous_integration/pipelines/gitlab/
[3]: /integrations/guide/source-code-integration/
[4]: /continuous_integration/pipelines/github/#collect-job-logs
[5]: /continuous_integration/pipelines/gitlab/#collect-job-logs
[6]: /continuous_integration/guides/use_ci_jobs_failure_analysis/
2 changes: 2 additions & 0 deletions content/en/continuous_integration/pipelines/github.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Set up CI Visibility for GitHub Actions to track the execution of your workflows
| [Running pipelines][2] | Running pipelines | View pipeline executions that are running. Queued or waiting pipelines show with status "Running" on Datadog. |
| [CI jobs failure analysis][23] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. |
| [Partial retries][3] | Partial pipelines | View partially retried pipeline executions. |
| [Automatic job retries][27] | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. |
| Logs correlation | Logs correlation | Correlate pipeline and job spans to logs and enable [job log collection](#collect-job-logs). |
| Infrastructure metric correlation | Infrastructure metric correlation | Correlate jobs to [infrastructure host metrics][4] for GitHub jobs. |
| [Custom tags][5] [and measures at runtime][6] | Custom tags and measures at runtime | Configure [custom tags and measures][7] at runtime. |
Expand Down Expand Up @@ -158,3 +159,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep
[24]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/
[25]: /glossary/#pipeline-execution-time
[26]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments
[27]: /continuous_integration/pipelines/automatic_retries/
2 changes: 2 additions & 0 deletions content/en/continuous_integration/pipelines/gitlab.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Set up CI Visibility for GitLab to collect data on your pipeline executions, ana
| [CI jobs failure analysis][28] | CI jobs failure analysis | Uses LLM models on relevant logs to analyze the root cause of failed CI jobs. |
| [Filter CI Jobs on the critical path][29] | Filter CI Jobs on the critical path | Filter by jobs on the critical path. |
| [Partial retries][19] | Partial pipelines | View partially retried pipeline executions. |
| [Automatic job retries][31] | Automatic job retries | Preview. Datadog retries failed jobs classified as transient by its AI error model. |
| [Manual steps][20] | Manual steps | View manually triggered pipelines. |
| [Queue time][21] | Queue time | View the amount of time pipeline jobs sit in the queue before processing. |
| Logs correlation | Logs correlation | Correlate pipeline spans to logs and enable [job log collection][12]. |
Expand Down Expand Up @@ -466,3 +467,4 @@ The **CI Pipeline List** page shows data for only the default branch of each rep
[28]: /continuous_integration/guides/use_ci_jobs_failure_analysis/
[29]: /continuous_integration/guides/identify_highest_impact_jobs_with_critical_path/
[30]: /continuous_integration/guides/use_ci_jobs_failure_analysis/#using-pr-comments
[31]: /continuous_integration/pipelines/automatic_retries/
Loading