Skip to content

Fix e2e_autoscaling CI failures#2632

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 9 commits intomainfrom
lenaic/fix-e2e-autoscaling-aws-creds-expiry
Feb 26, 2026
Merged

Fix e2e_autoscaling CI failures#2632
gh-worker-dd-mergequeue-cf854d[bot] merged 9 commits intomainfrom
lenaic/fix-e2e-autoscaling-aws-creds-expiry

Conversation

@L3n41c
Copy link
Member

@L3n41c L3n41c commented Feb 23, 2026

Summary

The e2e_autoscaling CI job has had a 0% success rate for 2+ weeks (32 errors out of 33 runs on main). This PR fixes three distinct bugs in the autoscaling e2e tests and their supporting infrastructure:

1. AWS STS token expiry during test suite (69ba0c8)

AWS credentials were captured once during SetupSuite() and stored as static strings. These STS session tokens expired after ~1 hour. By the time later test groups ran (~45 min in), all AWS API calls failed with HTTP 403 ExpiredToken.

Fix: Store the aws.Config object instead of static credential strings, and call Credentials.Retrieve() before each kubectl-datadog subprocess invocation to get fresh tokens.

2. CloudFormation stack deletion failure due to orphaned IAM instance profiles (542ba7dc)

Karpenter dynamically creates IAM instance profiles at runtime and attaches the KarpenterNodeRole. During uninstall, Karpenter is killed (Helm uninstall) before cleaning up these instance profiles. The orphaned profiles still have the role attached, so CloudFormation cannot delete the role and the stack deletion fails with DELETE_FAILED — Cannot delete entity, must remove roles from instance profile first (Service: Iam, Status Code: 409).

Fix: Add an explicit deleteKarpenterInstanceProfiles step in the uninstall flow that lists all instance profiles associated with the KarpenterNodeRole, removes the role from each, then deletes them. This runs after Helm uninstall but before CloudFormation stack deletion.

3. CI trigger glob not matching nested files (d0c3b72)

The ** glob pattern in GitLab CI's Ruby File.fnmatch with FNM_PATHNAME only matches one directory level, not recursively. Changes to files in nested subdirectories (e.g. cmd/kubectl-datadog/autoscaling/cluster/uninstall/uninstall.go) did not auto-trigger the e2e_autoscaling job.

Fix: Change the glob from ** to **/* so nested file changes correctly trigger the job.

Test plan

  • go vet and go build pass
  • CI validation workflow (make test) passes
  • CI check-golang-version passes
  • e2e_autoscaling job passes end-to-end (triggered manually, running)

🤖 Generated with Claude Code

The e2e_autoscaling suite has had a 0% success rate for 2+ weeks.
AWS credentials were captured once during SetupSuite and stored as
static strings. By the time later test groups ran (~45 min in), the
STS session tokens had expired, causing all AWS API calls to fail
with HTTP 403 ExpiredToken.

Store the aws.Config instead and call Credentials.Retrieve() before
each kubectl-datadog subprocess invocation to get fresh tokens from
the SDK credential chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@L3n41c L3n41c added the bug Something isn't working label Feb 23, 2026
@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 0% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.51%. Comparing base (e9a6dec) to head (49625e4).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% 32 Missing ⚠️
...adog/autoscaling/cluster/common/clients/clients.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2632      +/-   ##
==========================================
+ Coverage   38.43%   38.51%   +0.07%     
==========================================
  Files         305      305              
  Lines       26276    26497     +221     
==========================================
+ Hits        10098    10204     +106     
- Misses      15425    15527     +102     
- Partials      753      766      +13     
Flag Coverage Δ
unittests 38.51% <0.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...adog/autoscaling/cluster/common/clients/clients.go 0.00% <0.00%> (ø)
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% <0.00%> (ø)

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9a6dec...49625e4. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…tion

Karpenter dynamically creates IAM instance profiles at runtime and
attaches the KarpenterNodeRole to them. During uninstall, Karpenter
is killed (Helm uninstall) before it finishes cleaning up these
instance profiles. The orphaned instance profiles still have the
KarpenterNodeRole attached, so CloudFormation cannot delete the role
and the stack deletion fails with:

  DELETE_FAILED — Cannot delete entity, must remove roles from
  instance profile first. (Service: Iam, Status Code: 409)

Add an explicit deleteKarpenterInstanceProfiles step in the uninstall
flow that lists all instance profiles associated with the
KarpenterNodeRole, removes the role from each, then deletes them.
This runs after Helm uninstall but before CloudFormation stack deletion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@L3n41c L3n41c force-pushed the lenaic/fix-e2e-autoscaling-aws-creds-expiry branch from 14b4165 to 542ba7d Compare February 24, 2026 15:49
The `go work sync` + `go mod tidy` in CI correctly marks
aws-sdk-go-v2 as a direct dependency in test/e2e/go.mod since
it is imported directly in the autoscaling e2e test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@L3n41c L3n41c added this to the v1.25.0 milestone Feb 24, 2026
@L3n41c
Copy link
Member Author

L3n41c commented Feb 25, 2026

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The `**` glob pattern in GitLab CI's Ruby File.fnmatch with FNM_PATHNAME
only matches one directory level, not recursively. Change to `**/*` so
that changes to files in nested subdirectories (e.g.
cmd/kubectl-datadog/autoscaling/cluster/uninstall/uninstall.go) correctly
auto-trigger the e2e_autoscaling job.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@L3n41c L3n41c changed the title Fix e2e_autoscaling failures caused by expired AWS STS tokens Fix e2e_autoscaling CI failures Feb 25, 2026
@L3n41c L3n41c marked this pull request as ready for review February 25, 2026 11:17
@L3n41c L3n41c requested a review from a team February 25, 2026 11:17
@L3n41c L3n41c requested review from a team as code owners February 25, 2026 11:17
Copy link
Contributor

@AlexanderYastrebov AlexanderYastrebov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

The previous fix retrieved fresh STS tokens via Retrieve() before each
subprocess call, but passed them as static env vars. If kubectl-datadog
runs for 10+ minutes (e.g. waiting for CloudFormation stack creation),
the static tokens expire mid-execution with HTTP 403 ExpiredToken.

Instead, propagate AWS_PROFILE and config file paths so the subprocess
uses the same AssumeRole credential chain and can refresh its own
tokens during long-running operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@L3n41c L3n41c force-pushed the lenaic/fix-e2e-autoscaling-aws-creds-expiry branch from 6df2b81 to e350f4b Compare February 25, 2026 13:29
The full suite (provisioning + 4 test groups + teardown) consistently
exceeds the 60m limit. Bumping to 90m to avoid spurious timeouts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The `go test -timeout` flag was set to 55m via E2E_GO_TEST_TIMEOUT,
killing the suite before the 90m GitLab job timeout. Introduce a
dedicated E2E_AUTOSCALING_GO_TEST_TIMEOUT=80m variable so the
autoscaling suite has enough headroom without affecting the regular
e2e-tests target (which runs under a 40m GitLab job timeout).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working mergequeue-status: done

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants