feat: add CoreDNS hosts plugin support for LocalDNS #8165
feat: add CoreDNS hosts plugin support for LocalDNS #8165
Conversation
Add aks-hosts-setup.sh, aks-hosts-setup.service, and aks-hosts-setup.timer to resolve critical AKS FQDNs via LocalDNS hosts plugin. This enables authoritative DNS responses for MCR and other endpoints, reducing dependency on external DNS servers during node bootstrap. Changes include: - New systemd units for hosts file setup and periodic refresh - CSE integration: enableAKSHostsSetup() with VHD-presence guards - CoreDNS corefile generation with hosts plugin support - aks-node-controller scriptless path support - E2E tests for Ubuntu 2204/2404 and AzureLinux V3 - ShellSpec unit tests for all new shell scripts - Proto/pb.go updates for EnableHostsPlugin field
Add file provisioners for aks-hosts-setup.sh, aks-hosts-setup.service, and aks-hosts-setup.timer to all 10 packer JSON templates, and add cpAndMode entries to packer_source.sh to place them at: - /opt/azure/containers/aks-hosts-setup.sh (0755) - /etc/systemd/system/aks-hosts-setup.service (0644) - /etc/systemd/system/aks-hosts-setup.timer (0644) Without this, enableAKSHostsSetup() in CSE silently skips because the VHD-presence guard finds the files missing.
- Install dnsutils in shellspec.Dockerfile so nslookup is available in the CI container, enabling real DNS resolution tests. - Fix localdns_spec.sh: add missing End statement between two It blocks, remove duplicate rm of already-deleted file, and drop assertion for non-existent error message.
Remove teleportd/teleport references from cse_cmd.sh, parser.go, cse_config.sh, cse_helpers.sh, cse_main.sh, baker.go, and types.go. These were not part of the localdns hosts plugin work and were accidentally carried over from a prior merge with main.
Restore files that had unrelated changes leaked in from the old merge: - parser.go: restore SKIP_WAAGENT_HOLD entry that was accidentally deleted - vmss.go: restore CustomDataWithHack boothook template, CustomDataFlatcar path, and injectWriteFilesEntriesToCustomData (only add MockUnknownCloud) - types.go: restore CustomDataWriteFile type (only add MockUnknownCloud tag and localdns helper methods) - validators.go: restore ValidateNodeExporter and ValidateWaagentLog to main's versions (only add localdns hosts plugin validators) - cse_helpers.sh: restore to main's version (no localdns changes needed) - .env.sample: restore to main's version
- cse_main.sh: restore SKIP_WAAGENT_HOLD conditional that was accidentally removed (stale change from old merge) - aks_model.go: pass e2e-test=true tag when creating Private DNS zones so collectGarbagePrivateDNSZones can clean them up
Two additional SKIP_WAAGENT_HOLD guards in nodePrep (for the unhold calls) were still missing after the previous fix only restored the one in basePrep.
…oss all distros - Replace nslookup "recursion not available" check with dig AA (Authoritative Answer) flag, which is stronger proof that the CoreDNS hosts plugin served the response rather than forwarding upstream - Match IPs returned by dig against /etc/localdns/hosts entries - Remove fake FQDN injection test (won't work since hosts file is populated by aks-hosts-setup.service from real DNS resolution) - Simplify Corefile validation (remove fragile awk section-parsing) - Consolidate per-distro tests into table-driven Test_LocalDNSHostsPlugin covering all 7 supported amd64 distros (Ubuntu 2204/2404, Azure Linux V2/V3, CBL Mariner V2, Flatcar, ACL) - Add table-driven Test_LocalDNSHostsPlugin_Scriptless covering all 5 distros with scriptless support (Ubuntu 2204/2404, Azure Linux V3, Flatcar, ACL) - Remove duplicate validator calls from dedicated tests since ValidateCommonLinux already runs them when EnableHostsPlugin is set
|
The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).
|
…mprove e2e validators - aks-hosts-setup.sh: guard resolve_ipv4() pipeline with || return 0 under pipefail - aks-hosts-setup.sh: tighten IPv6 regex to reject all-colon strings like ":::::::" - cse_config.sh: restore LOCALDNS_BASE64_ENCODED_COREFILE in environment file for old VHD compat - localdns.sh: track annotation background PID and kill in cleanup_localdns_configs - e2e/types.go: IsHostsPluginEnabled() now checks EnableLocalDns && EnableHostsPlugin for scriptless path - e2e/validation.go: reorder validators so ValidateLocalDNSHostsFile runs before ValidateAKSHostsSetupService - e2e/validators.go: fix maxAttempts (60->33) to match ~5 minute polling comment - spec: add ":::::::" to IPv6 mock test, add LOCALDNS_BASE64_ENCODED_COREFILE env file check
…, reuse systemd validators - aks-hosts-setup.sh: switch from nslookup to dig +short for DNS resolution (yewmsft) - localdns.sh: fix stale "timeout=0" comment referencing removed parameter (yewmsft) - cse_main.sh: add startup ordering comment for localdns/aks-hosts-setup (yewmsft) - validators.go: reuse ValidateSystemdUnitIsNotFailed instead of ad-hoc script (cameronmeissner) - spec: update mock tests from nslookup format to dig +short format
…r refactored function - baker.go: GetGeneratedLocalDNSCoreFile now uses includeHostsPlugin=false since old VHDs don't provision /etc/localdns/hosts - cse_main_spec.sh: rewrite tests to set env vars (LOCALDNS_COREFILE_ACTIVE, LOCALDNS_COREFILE_EXPERIMENTAL, SHOULD_ENABLE_HOSTS_PLUGIN) instead of positional args, matching the refactored select_localdns_corefile() signature
| select_localdns_corefile() { | ||
| local hosts_file_path="/etc/localdns/hosts" | ||
|
|
||
| # Case 1: Both corefile variants available — dynamic selection | ||
| if [ -n "${LOCALDNS_COREFILE_EXPERIMENTAL:-}" ] && \ | ||
| [ -n "${LOCALDNS_COREFILE_ACTIVE:-}" ]; then | ||
| echo "Both corefile variants available, selecting based on current state..." >&2 | ||
| echo "LocalDNS corefile selection: SHOULD_ENABLE_HOSTS_PLUGIN=${SHOULD_ENABLE_HOSTS_PLUGIN:-<unset>}" >&2 | ||
|
|
||
| if [ "${SHOULD_ENABLE_HOSTS_PLUGIN:-}" = "true" ]; then | ||
| echo "Hosts plugin is enabled, checking ${hosts_file_path} for content..." >&2 | ||
| if [ -f "${hosts_file_path}" ] && \ | ||
| grep -qE '^[0-9a-fA-F.:]+[[:space:]]+[a-zA-Z]' "${hosts_file_path}"; then | ||
| echo "Hosts file has IP mappings, using corefile with hosts plugin" >&2 |
There was a problem hiding this comment.
select_localdns_corefile() hardcodes the hosts file path to /etc/localdns/hosts. This makes the function harder to test and inconsistent with annotate_node_with_hosts_plugin_status(), which already supports overriding the hosts file path via LOCALDNS_HOSTS_FILE. Consider using an overridable path here as well (e.g., hosts_file_path="${LOCALDNS_HOSTS_FILE:-/etc/localdns/hosts}") so unit tests can avoid writing to /etc and production can be more flexible if the path ever changes.
| setup() { | ||
| # Source localdns.sh to get select_localdns_corefile function | ||
| # We set __SOURCED__=1 to only source the functions, not run main execution | ||
| # shellcheck disable=SC1090 | ||
| __SOURCED__=1 . "${LOCALDNS_PATH}" | ||
|
|
||
| # Create temp directory for test hosts file | ||
| TEST_DIR=$(mktemp -d) | ||
| HOSTS_FILE="${TEST_DIR}/hosts" | ||
| } | ||
|
|
||
| cleanup() { | ||
| rm -rf "${TEST_DIR}" | ||
| unset LOCALDNS_COREFILE_ACTIVE | ||
| unset LOCALDNS_COREFILE_EXPERIMENTAL | ||
| unset SHOULD_ENABLE_HOSTS_PLUGIN | ||
| } | ||
|
|
||
| BeforeEach 'setup' | ||
| AfterEach 'cleanup' | ||
|
|
||
| Context 'when both corefile variants are available and hosts plugin is enabled' | ||
| It 'returns EXPERIMENTAL when hosts file has valid IP mappings' | ||
| LOCALDNS_COREFILE_ACTIVE="${COREFILE_NO_HOSTS}" | ||
| LOCALDNS_COREFILE_EXPERIMENTAL="${COREFILE_WITH_HOSTS}" | ||
| SHOULD_ENABLE_HOSTS_PLUGIN="true" | ||
| # Create hosts file with valid IP mappings at the path the function checks | ||
| mkdir -p /etc/localdns | ||
| echo "10.0.0.1 mcr.microsoft.com" > /etc/localdns/hosts | ||
|
|
||
| When call select_localdns_corefile | ||
| The output should equal "${COREFILE_WITH_HOSTS}" | ||
| The status should be success | ||
| The stderr should include "Hosts file has IP mappings" | ||
| The stderr should include "using corefile with hosts plugin" | ||
| End | ||
|
|
||
| It 'returns ACTIVE when hosts file exists but has no IP mappings' | ||
| LOCALDNS_COREFILE_ACTIVE="${COREFILE_NO_HOSTS}" | ||
| LOCALDNS_COREFILE_EXPERIMENTAL="${COREFILE_WITH_HOSTS}" | ||
| SHOULD_ENABLE_HOSTS_PLUGIN="true" | ||
| mkdir -p /etc/localdns | ||
| echo "# comment only" > /etc/localdns/hosts | ||
|
|
||
| When call select_localdns_corefile | ||
| The output should equal "${COREFILE_NO_HOSTS}" | ||
| The status should be success | ||
| The stderr should include "not ready yet, falling back to corefile without hosts plugin" | ||
| End | ||
|
|
||
| It 'returns ACTIVE when hosts file does not exist' | ||
| LOCALDNS_COREFILE_ACTIVE="${COREFILE_NO_HOSTS}" | ||
| LOCALDNS_COREFILE_EXPERIMENTAL="${COREFILE_WITH_HOSTS}" | ||
| SHOULD_ENABLE_HOSTS_PLUGIN="true" | ||
| rm -f /etc/localdns/hosts | ||
|
|
There was a problem hiding this comment.
These tests create and modify /etc/localdns/hosts directly (and setup() creates a temp HOSTS_FILE that is never used). Writing under /etc in unit tests can require elevated permissions and can leak state across specs if the file isn't cleaned up in AfterEach. Prefer using a temp hosts file and pointing the code under test at it (which likely requires select_localdns_corefile() to accept an overridable hosts path), or ensure /etc/localdns/hosts is removed/restored in cleanup().
…exit code Use a single name end-to-end: CSE delivers LOCALDNS_COREFILE_BASE, environment file stores LOCALDNS_COREFILE_BASE, localdns.sh reads LOCALDNS_COREFILE_BASE. No more rename at the CSE↔VHD boundary. Also fix select_localdns_corefile Case 3 (nothing available) to return 1 instead of 0, and guard the caller in regenerate_localdns_corefile with || true to prevent set -e from aborting before the friendly error message.
Verify that generateLocalDNSFiles falls back to LOCALDNS_GENERATED_COREFILE when LOCALDNS_COREFILE_BASE is unset, simulating an old AgentBaker service provisioning a VM with a new VHD.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 44 out of 44 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (2)
spec/parts/linux/cloud-init/artifacts/cse_main_spec.sh:1
- These tests write to the real
/etc/localdns/hostson the test runner/container filesystem but thecleanup()does not remove it. This can leak state across specs and cause false positives/negatives (e.g., later tests observing an old hosts file). Add cleanup that removes/etc/localdns/hosts(and optionally/etc/localdnsif empty) or refactor the function under test to accept an overridable hosts path so the spec can use$TEST_DIRinstead.
spec/parts/linux/cloud-init/artifacts/cse_main_spec.sh:1 - These tests write to the real
/etc/localdns/hostson the test runner/container filesystem but thecleanup()does not remove it. This can leak state across specs and cause false positives/negatives (e.g., later tests observing an old hosts file). Add cleanup that removes/etc/localdns/hosts(and optionally/etc/localdnsif empty) or refactor the function under test to accept an overridable hosts path so the spec can use$TEST_DIRinstead.
| It 'creates hosts file with resolved addresses for all critical FQDNs' | ||
| When run command bash "${TEST_SCRIPT}" | ||
| The status should be success | ||
| The file "$HOSTS_FILE" should be exist |
There was a problem hiding this comment.
This test requires live DNS resolution to succeed in CI. However, aks-hosts-setup.sh explicitly exits 0 without writing a hosts file when no domains resolve (to avoid systemd marking the unit failed). In restricted or flaky network environments, the test can fail even when the script behavior is correct. To make the spec deterministic, prefer using the mock-dig path for 'hosts file is created' assertions (returning stable A/AAAA answers), and reserve live-DNS tests for a smaller, optionally-skipped integration layer.
…ervice The Before= directive blocks kubelet and localdns startup until aks-hosts-setup completes DNS resolution (up to 60s). This contradicts the async design: localdns should start immediately with the base corefile, and dynamic corefile selection handles the upgrade to the hosts-plugin variant once the hosts file is populated.
printf '%s\n' is POSIX-portable and won't interpret escape sequences, unlike echo which is shell-implementation-dependent.
- Fix comment in parser/helper.go: selection happens in localdns.sh not cse_main.sh - Use LOCALDNS_HOSTS_FILE override in select_localdns_corefile() for testability - Rewrite cse_main_spec.sh to use temp dir instead of /etc/localdns/hosts - Add actual permissions check to cloud-env file test in cse_config_spec.sh - Replace brittle tail -n +10 with sed in aks_hosts_setup_spec.sh
| SHOULD_ENABLE_HOSTS_PLUGIN="true" | ||
| echo "10.0.0.1 mcr.microsoft.com" > "${LOCALDNS_HOSTS_FILE}" | ||
|
|
||
| When call select_localdns_corefile |
There was a problem hiding this comment.
This spec writes to the real /etc/localdns/hosts but cleanup() never removes it. That can leak state across specs (and between examples in this file), causing false positives/negatives depending on execution order. Consider deleting /etc/localdns/hosts (and optionally rmdir /etc/localdns when empty) in cleanup(), or refactoring select_localdns_corefile() to allow a test override path so tests can stay within $TEST_DIR.
| The contents of file "$AKS_CLOUD_ENV_FILE" should equal "TARGET_CLOUD=AzureUSGovernmentCloud" | ||
| End | ||
|
|
||
| It 'should set correct permissions on cloud-env file' |
There was a problem hiding this comment.
This test name says it verifies 0644 permissions on the cloud-env file, but it only asserts the file exists. Add an assertion on the file mode (e.g., via stat) so the test actually covers the permission behavior it describes.
| It 'should set correct permissions on cloud-env file' | |
| The file "$AKS_CLOUD_ENV_FILE" should be exist | |
| cloud_env_perms=$(printf "0%s" "$(stat -c "%a" "$AKS_CLOUD_ENV_FILE")") | |
| The value "$cloud_env_perms" should equal "0644" |
| # Helper to build a test script that uses the real system dig. | ||
| # Overrides only HOSTS_FILE and TARGET_CLOUD, preserving everything else | ||
| # (cloud selection, resolution loop, atomic write) from the real script. | ||
| # Uses sed to strip the shebang, set -euo pipefail, and HOSTS_FILE= lines | ||
| # so the test is not brittle to comment changes at the top of the script. | ||
| build_test_script() { | ||
| local test_dir="$1" | ||
| local hosts_file="$2" | ||
| local target_cloud="${3:-AzurePublicCloud}" | ||
| local test_script="${test_dir}/aks-hosts-setup-test.sh" | ||
|
|
||
| cat > "${test_script}" << EOF | ||
| #!/usr/bin/env bash | ||
| set -uo pipefail | ||
| HOSTS_FILE="${hosts_file}" | ||
| export TARGET_CLOUD="${target_cloud}" | ||
| EOF | ||
| sed -e '/^#!\/bin\/bash/d' -e '/^set -euo pipefail/d' -e '/^HOSTS_FILE=/d' "${SCRIPT_PATH}" >> "${test_script}" |
There was a problem hiding this comment.
build_test_script()/build_mock_test_script() depend on the first 9 lines of aks-hosts-setup.sh staying exactly the same (via tail -n +10). That makes the test brittle: any added comment/header line in the script will silently change test behavior. Prefer patching the script more robustly (e.g., sourcing it, or using sed to override HOSTS_FILE / TARGET_CLOUD assignments) rather than relying on fixed line numbers.
MockUnknownCloud was never used by any test scenario and relied on brittle string replacement to inject TARGET_CLOUD into the CSE script. Remove the Tags field and the injection logic in createVMSSModel.
What this PR does / why we need it:
Adds CoreDNS hosts plugin support for LocalDNS. When enabled, critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.) are resolved and cached in
/etc/localdns/hostsby a systemd timer (aks-hosts-setup), then served authoritatively by the CoreDNS hosts plugin — eliminating upstream DNS lookups for these endpoints.Key changes:
aks-hosts-setup.sh/.service/.timer— resolves AKS FQDNs and writes/etc/localdns/hostshosts /etc/localdns/hosts { fallthrough }block in both VnetDNS and KubeDNS listenersLOCALDNS_COREFILE_BASE(vanilla) andLOCALDNS_COREFILE_FULL(with all optional plugins).localdns.shdynamically selects the full variant once the hosts file is populated.enableLocalDNS()is the single entry point for all localdns setup including hosts pluginWhich issue(s) this PR fixes: