Skip to content

refactor(linux): only start secure-tls-bootstrap.service via kubelet WantedBy=#8632

Open
cameronmeissner wants to merge 2 commits into
mainfrom
cameissner/stls-defer-startup
Open

refactor(linux): only start secure-tls-bootstrap.service via kubelet WantedBy=#8632
cameronmeissner wants to merge 2 commits into
mainfrom
cameissner/stls-defer-startup

Conversation

@cameronmeissner
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

only start secure-tls-bootstrap.service via kubelet WantedBy= - this ensure that we only start the bootstrap process after we've established outbound connectivity with the cluster's API server - this should generally increase secure TLS bootstrapping QoS and lower bootstrapping latency

Which issue(s) this PR fixes:

Fixes #

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Linux CSE secure TLS bootstrapping flow so secure-tls-bootstrap.service is enabled with WantedBy=kubelet.service (and ordered Before=kubelet.service) rather than being explicitly started during CSE. This aligns secure TLS bootstrapping execution with kubelet startup timing, after the CSE’s API server connectivity validation.

Changes:

  • Rename configureAndStartSecureTLSBootstrappingconfigureAndEnableSecureTLSBootstrapping and switch behavior from “enable+start” to “enable only”.
  • Update the secure TLS bootstrapping systemd drop-in to include an [Install] section with WantedBy=kubelet.service.
  • Adjust error code naming and ShellSpec coverage to reflect the new enable-only behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
parts/linux/cloud-init/artifacts/cse_config.sh Renames the secure TLS bootstrapping configurator and changes it to systemctl enable secure-tls-bootstrap (no explicit start), while writing a drop-in with WantedBy=kubelet.service.
parts/linux/cloud-init/artifacts/cse_main.sh Updates the nodePrep call site and event name to use configureAndEnableSecureTLSBootstrapping.
parts/linux/cloud-init/artifacts/cse_helpers.sh Renames the secure TLS bootstrapping error code constant to reflect “enable” failure semantics.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Updates ShellSpec tests/mocks to assert enable-only behavior and the new function name.

@djsly
Copy link
Copy Markdown
Collaborator

djsly commented Jun 5, 2026

🕵️ AgentBaker E2E Detective — Daily TME run

Failed E2E run: Agentbaker E2E - TME Tenant #20260605.8 (build 166961759) — triggered by Daily TME VHD build 166941956 from this PR's branch (cameissner/stls-defer-startup).

Failed job: Run AgentBaker E2E Linux Tests (logs)

Failed tests (2 / 401 executed):

Test Image Signature
Test_Ubuntu2204_CSE_CachedPerformance/default/Task_configureKubeletAndKubectl 2204gen2containerd configureKubeletAndKubectl took 30.722s, exceeds 27s perf threshold (breach: +3.7s)
Test_LocalDNSHostsPlugin/AzureLinuxV3/scriptless_nbc AzureLinuxV3gen2 Plugin/validation step failed after kube readiness (stacktrace truncated in ADO)

Likely cause: Perf-threshold flake under westus capacity/SKU contention (node create 3m21s, pod-ready 1m36s in the same run). PR #8632 only re-orders secure-tls-bootstrap.service via kubelet WantedBy= — it does not add work to configureKubeletAndKubectl, and the LocalDNS failure happens after node Ready, off the TLS-bootstrap path touched here.

Flake vs regression: flake (confidence: medium). The same two tests also failed in adjacent E2E run 166885687 on a different VHD/branch (official/v20260605), so the pattern predates and crosses this PR.

Suggested owner: AgentBaker Linux Node SIG (Node Lifecycle) — perf-threshold owner; LocalDNS plugin scenario owner for the second test.

Recommended next action: Do not block this PR on this run. Re-run E2E once to confirm flake; if Test_Ubuntu2204_CSE_CachedPerformance trips again, consider raising the 27s threshold or capturing westus VM-create latency telemetry. Append to the existing flake tracker for both tests.

Posted automatically by Clawpilot (local detective). Analysis only — no pipeline mutation. If this is wrong, ping @sylvainboily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants