Fix e2e_autoscaling CI failures#2632
Fix e2e_autoscaling CI failures#2632gh-worker-dd-mergequeue-cf854d[bot] merged 9 commits intomainfrom
Conversation
The e2e_autoscaling suite has had a 0% success rate for 2+ weeks. AWS credentials were captured once during SetupSuite and stored as static strings. By the time later test groups ran (~45 min in), the STS session tokens had expired, causing all AWS API calls to fail with HTTP 403 ExpiredToken. Store the aws.Config instead and call Credentials.Retrieve() before each kubectl-datadog subprocess invocation to get fresh tokens from the SDK credential chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2632 +/- ##
==========================================
+ Coverage 38.43% 38.51% +0.07%
==========================================
Files 305 305
Lines 26276 26497 +221
==========================================
+ Hits 10098 10204 +106
- Misses 15425 15527 +102
- Partials 753 766 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 4 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
…tion Karpenter dynamically creates IAM instance profiles at runtime and attaches the KarpenterNodeRole to them. During uninstall, Karpenter is killed (Helm uninstall) before it finishes cleaning up these instance profiles. The orphaned instance profiles still have the KarpenterNodeRole attached, so CloudFormation cannot delete the role and the stack deletion fails with: DELETE_FAILED — Cannot delete entity, must remove roles from instance profile first. (Service: Iam, Status Code: 409) Add an explicit deleteKarpenterInstanceProfiles step in the uninstall flow that lists all instance profiles associated with the KarpenterNodeRole, removes the role from each, then deletes them. This runs after Helm uninstall but before CloudFormation stack deletion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14b4165 to
542ba7d
Compare
The `go work sync` + `go mod tidy` in CI correctly marks aws-sdk-go-v2 as a direct dependency in test/e2e/go.mod since it is imported directly in the autoscaling e2e test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The `**` glob pattern in GitLab CI's Ruby File.fnmatch with FNM_PATHNAME only matches one directory level, not recursively. Change to `**/*` so that changes to files in nested subdirectories (e.g. cmd/kubectl-datadog/autoscaling/cluster/uninstall/uninstall.go) correctly auto-trigger the e2e_autoscaling job. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous fix retrieved fresh STS tokens via Retrieve() before each subprocess call, but passed them as static env vars. If kubectl-datadog runs for 10+ minutes (e.g. waiting for CloudFormation stack creation), the static tokens expire mid-execution with HTTP 403 ExpiredToken. Instead, propagate AWS_PROFILE and config file paths so the subprocess uses the same AssumeRole credential chain and can refresh its own tokens during long-running operations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6df2b81 to
e350f4b
Compare
The full suite (provisioning + 4 test groups + teardown) consistently exceeds the 60m limit. Bumping to 90m to avoid spurious timeouts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The `go test -timeout` flag was set to 55m via E2E_GO_TEST_TIMEOUT, killing the suite before the 90m GitLab job timeout. Introduce a dedicated E2E_AUTOSCALING_GO_TEST_TIMEOUT=80m variable so the autoscaling suite has enough headroom without affecting the regular e2e-tests target (which runs under a 40m GitLab job timeout). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
The
e2e_autoscalingCI job has had a 0% success rate for 2+ weeks (32 errors out of 33 runs onmain). This PR fixes three distinct bugs in the autoscaling e2e tests and their supporting infrastructure:1. AWS STS token expiry during test suite (
69ba0c8)AWS credentials were captured once during
SetupSuite()and stored as static strings. These STS session tokens expired after ~1 hour. By the time later test groups ran (~45 min in), all AWS API calls failed withHTTP 403 ExpiredToken.Fix: Store the
aws.Configobject instead of static credential strings, and callCredentials.Retrieve()before eachkubectl-datadogsubprocess invocation to get fresh tokens.2. CloudFormation stack deletion failure due to orphaned IAM instance profiles (
542ba7dc)Karpenter dynamically creates IAM instance profiles at runtime and attaches the
KarpenterNodeRole. During uninstall, Karpenter is killed (Helm uninstall) before cleaning up these instance profiles. The orphaned profiles still have the role attached, so CloudFormation cannot delete the role and the stack deletion fails withDELETE_FAILED — Cannot delete entity, must remove roles from instance profile first (Service: Iam, Status Code: 409).Fix: Add an explicit
deleteKarpenterInstanceProfilesstep in the uninstall flow that lists all instance profiles associated with theKarpenterNodeRole, removes the role from each, then deletes them. This runs after Helm uninstall but before CloudFormation stack deletion.3. CI trigger glob not matching nested files (
d0c3b72)The
**glob pattern in GitLab CI's RubyFile.fnmatchwithFNM_PATHNAMEonly matches one directory level, not recursively. Changes to files in nested subdirectories (e.g.cmd/kubectl-datadog/autoscaling/cluster/uninstall/uninstall.go) did not auto-trigger thee2e_autoscalingjob.Fix: Change the glob from
**to**/*so nested file changes correctly trigger the job.Test plan
go vetandgo buildpassvalidationworkflow (make test) passescheck-golang-versionpassese2e_autoscalingjob passes end-to-end (triggered manually, running)🤖 Generated with Claude Code