Fix e2e_autoscaling CI failures by L3n41c · Pull Request #2632 · DataDog/datadog-operator

L3n41c · 2026-02-23T15:25:56Z

Summary

The e2e_autoscaling CI job has had a 0% success rate for 2+ weeks (32 errors out of 33 runs on main). This PR fixes three distinct bugs in the autoscaling e2e tests and their supporting infrastructure:

1. AWS STS token expiry during test suite (`69ba0c8`)

AWS credentials were captured once during SetupSuite() and stored as static strings. These STS session tokens expired after ~1 hour. By the time later test groups ran (~45 min in), all AWS API calls failed with HTTP 403 ExpiredToken.

Fix: Store the aws.Config object instead of static credential strings, and call Credentials.Retrieve() before each kubectl-datadog subprocess invocation to get fresh tokens.

2. CloudFormation stack deletion failure due to orphaned IAM instance profiles (`542ba7dc`)

Karpenter dynamically creates IAM instance profiles at runtime and attaches the KarpenterNodeRole. During uninstall, Karpenter is killed (Helm uninstall) before cleaning up these instance profiles. The orphaned profiles still have the role attached, so CloudFormation cannot delete the role and the stack deletion fails with DELETE_FAILED — Cannot delete entity, must remove roles from instance profile first (Service: Iam, Status Code: 409).

Fix: Add an explicit deleteKarpenterInstanceProfiles step in the uninstall flow that lists all instance profiles associated with the KarpenterNodeRole, removes the role from each, then deletes them. This runs after Helm uninstall but before CloudFormation stack deletion.

3. CI trigger glob not matching nested files (`d0c3b72`)

The ** glob pattern in GitLab CI's Ruby File.fnmatch with FNM_PATHNAME only matches one directory level, not recursively. Changes to files in nested subdirectories (e.g. cmd/kubectl-datadog/autoscaling/cluster/uninstall/uninstall.go) did not auto-trigger the e2e_autoscaling job.

Fix: Change the glob from ** to **/* so nested file changes correctly trigger the job.

Test plan

go vet and go build pass
CI validation workflow (make test) passes
CI check-golang-version passes
e2e_autoscaling job passes end-to-end (triggered manually, running)

🤖 Generated with Claude Code

The e2e_autoscaling suite has had a 0% success rate for 2+ weeks. AWS credentials were captured once during SetupSuite and stored as static strings. By the time later test groups ran (~45 min in), the STS session tokens had expired, causing all AWS API calls to fail with HTTP 403 ExpiredToken. Store the aws.Config instead and call Credentials.Retrieve() before each kubectl-datadog subprocess invocation to get fresh tokens from the SDK credential chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-02-23T16:50:39Z

Codecov Report

❌ Patch coverage is 0% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.51%. Comparing base (e9a6dec) to head (49625e4).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...datadog/autoscaling/cluster/uninstall/uninstall.go	0.00%	32 Missing ⚠️
...adog/autoscaling/cluster/common/clients/clients.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2632      +/-   ##
==========================================
+ Coverage   38.43%   38.51%   +0.07%     
==========================================
  Files         305      305              
  Lines       26276    26497     +221     
==========================================
+ Hits        10098    10204     +106     
- Misses      15425    15527     +102     
- Partials      753      766      +13

Flag	Coverage Δ
unittests	`38.51% <0.00%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...adog/autoscaling/cluster/common/clients/clients.go	`0.00% <0.00%> (ø)`
...datadog/autoscaling/cluster/uninstall/uninstall.go	`0.00% <0.00%> (ø)`

... and 4 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9a6dec...49625e4. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…tion Karpenter dynamically creates IAM instance profiles at runtime and attaches the KarpenterNodeRole to them. During uninstall, Karpenter is killed (Helm uninstall) before it finishes cleaning up these instance profiles. The orphaned instance profiles still have the KarpenterNodeRole attached, so CloudFormation cannot delete the role and the stack deletion fails with: DELETE_FAILED — Cannot delete entity, must remove roles from instance profile first. (Service: Iam, Status Code: 409) Add an explicit deleteKarpenterInstanceProfiles step in the uninstall flow that lists all instance profiles associated with the KarpenterNodeRole, removes the role from each, then deletes them. This runs after Helm uninstall but before CloudFormation stack deletion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The `go work sync` + `go mod tidy` in CI correctly marks aws-sdk-go-v2 as a direct dependency in test/e2e/go.mod since it is imported directly in the autoscaling e2e test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

L3n41c · 2026-02-25T10:19:03Z

@codex review

chatgpt-codex-connector · 2026-02-25T10:24:58Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The `**` glob pattern in GitLab CI's Ruby File.fnmatch with FNM_PATHNAME only matches one directory level, not recursively. Change to `**/*` so that changes to files in nested subdirectories (e.g. cmd/kubectl-datadog/autoscaling/cluster/uninstall/uninstall.go) correctly auto-trigger the e2e_autoscaling job. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexanderYastrebov

Nice!

The previous fix retrieved fresh STS tokens via Retrieve() before each subprocess call, but passed them as static env vars. If kubectl-datadog runs for 10+ minutes (e.g. waiting for CloudFormation stack creation), the static tokens expire mid-execution with HTTP 403 ExpiredToken. Instead, propagate AWS_PROFILE and config file paths so the subprocess uses the same AssumeRole credential chain and can refresh its own tokens during long-running operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The full suite (provisioning + 4 test groups + teardown) consistently exceeds the 60m limit. Bumping to 90m to avoid spurious timeouts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The `go test -timeout` flag was set to 55m via E2E_GO_TEST_TIMEOUT, killing the suite before the 90m GitLab job timeout. Introduce a dedicated E2E_AUTOSCALING_GO_TEST_TIMEOUT=80m variable so the autoscaling suite has enough headroom without affecting the regular e2e-tests target (which runs under a 40m GitLab job timeout). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

L3n41c added the bug Something isn't working label Feb 23, 2026

L3n41c force-pushed the lenaic/fix-e2e-autoscaling-aws-creds-expiry branch from 14b4165 to 542ba7d Compare February 24, 2026 15:49

L3n41c added this to the v1.25.0 milestone Feb 24, 2026

Fix AI slop

abed0b6

L3n41c changed the title ~~Fix e2e_autoscaling failures caused by expired AWS STS tokens~~ Fix e2e_autoscaling CI failures Feb 25, 2026

L3n41c marked this pull request as ready for review February 25, 2026 11:17

L3n41c requested a review from a team February 25, 2026 11:17

L3n41c requested review from a team as code owners February 25, 2026 11:17

AlexanderYastrebov approved these changes Feb 25, 2026

View reviewed changes

L3n41c force-pushed the lenaic/fix-e2e-autoscaling-aws-creds-expiry branch from 6df2b81 to e350f4b Compare February 25, 2026 13:29

Increase e2e_autoscaling job timeout from 60m to 90m

9e01a6d

The full suite (provisioning + 4 test groups + teardown) consistently exceeds the 60m limit. Bumping to 90m to avoid spurious timeouts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lavigne958 approved these changes Feb 25, 2026

View reviewed changes

gh-worker-dd-devflow-36fce6 bot added the mergequeue-status: waiting label Feb 25, 2026

Merge branch 'main' into lenaic/fix-e2e-autoscaling-aws-creds-expiry

cf23e1a

tbavelier approved these changes Feb 25, 2026

View reviewed changes

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: rejected and removed mergequeue-status: waiting mergequeue-status: queued mergequeue-status: in_progress labels Feb 25, 2026

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: removed and removed mergequeue-status: rejected mergequeue-status: queued mergequeue-status: in_progress labels Feb 26, 2026

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: removed mergequeue-status: queued labels Feb 26, 2026

gh-worker-dd-mergequeue-cf854d bot merged commit 1091668 into main Feb 26, 2026
54 checks passed

gh-worker-dd-devflow-36fce6 bot removed the mergequeue-status: in_progress label Feb 26, 2026

gh-worker-dd-mergequeue-cf854d bot deleted the lenaic/fix-e2e-autoscaling-aws-creds-expiry branch February 26, 2026 14:55

gh-worker-dd-devflow-36fce6 bot added the mergequeue-status: done label Feb 26, 2026

L3n41c mentioned this pull request Feb 27, 2026

Fix e2e: increase testInstall context timeout from 15m to 25m #2666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix e2e_autoscaling CI failures#2632

Fix e2e_autoscaling CI failures#2632
gh-worker-dd-mergequeue-cf854d[bot] merged 9 commits intomainfrom
lenaic/fix-e2e-autoscaling-aws-creds-expiry

L3n41c commented Feb 23, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 23, 2026 •

edited

Loading

Uh oh!

L3n41c commented Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 25, 2026

Uh oh!

AlexanderYastrebov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

L3n41c commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. AWS STS token expiry during test suite (69ba0c8)

2. CloudFormation stack deletion failure due to orphaned IAM instance profiles (542ba7dc)

3. CI trigger glob not matching nested files (d0c3b72)

Test plan

Uh oh!

codecov-commenter commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

L3n41c commented Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 25, 2026

Uh oh!

AlexanderYastrebov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

L3n41c commented Feb 23, 2026 •

edited

Loading

1. AWS STS token expiry during test suite (`69ba0c8`)

2. CloudFormation stack deletion failure due to orphaned IAM instance profiles (`542ba7dc`)

3. CI trigger glob not matching nested files (`d0c3b72`)

codecov-commenter commented Feb 23, 2026 •

edited

Loading