feat(e2e): add HTTPS_PROXY + private DNS test scenario#8470
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an E2E regression scenario covering node bootstrapping with HTTPS_PROXY configured while the API server FQDN resolves via an Azure Private DNS zone, and wires shared proxy + private DNS infrastructure into cluster preparation. Also refactors scenario mutator signatures to accept the active *Cluster, and broadens AKS cluster creation retry logic to include a specific managed-identity reconciliation NotFound case.
Changes:
- Add
Test_Ubuntu2204_HTTPSProxy_PrivateDNSscenario and pass cluster-derived proxy URL intoHTTPProxyConfig. - Provision a lightweight HTTP CONNECT proxy (ConfigMap + DaemonSet) and set up a private DNS zone for the API server FQDN during
prepareCluster(non-network-isolated clusters). - Update mutator function signatures to accept
*Cluster, and improve cluster-create retry logic viaisRetryableClusterError.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| e2e/types.go | Update mutator function types to accept *Cluster. |
| e2e/test_helpers.go | Thread *Cluster into mutator invocation and pre-provision wrapper scenarios. |
| e2e/scenario_win_test.go | Adapt Windows scenarios/helpers to new mutator signatures. |
| e2e/scenario_test.go | Add Ubuntu 22.04 HTTPS proxy + private DNS scenario; update existing scenarios for new signatures. |
| e2e/scenario_gpu_managed_experience_test.go | Update GPU scenario mutator signatures. |
| e2e/scenario_gpu_daemonset_test.go | Update GPU DaemonSet scenario mutator signature. |
| e2e/scenario_cse_perf_test.go | Update performance scenario mutator signatures. |
| e2e/kube.go | Add proxy ConfigMap/DaemonSet creation + proxy URL discovery via Kubernetes API. |
| e2e/cluster.go | Add cluster-scoped ProxyURL, run proxy + private DNS setup in prepareCluster, and expand retryable cluster-create errors. |
| pods, err := k.Typed.CoreV1().Pods("default").List(ctx, metav1.ListOptions{ | ||
| LabelSelector: "app=" + proxyAppLabel, | ||
| }) | ||
| if err != nil || len(pods.Items) == 0 { |
| const ( | ||
| hostNetworkDebugAppLabel = "debug-mariner-tolerated" | ||
| podNetworkDebugAppLabel = "debugnonhost-mariner-tolerated" | ||
| proxyAppLabel = "e2e-tinyproxy" |
3c5c071 to
ee5e9b5
Compare
ee5e9b5 to
e06302b
Compare
| // proxy is not available on network-isolated clusters | ||
| if !isNetworkIsolated { | ||
| if err := k.ensureProxyConfigMap(ctx); err != nil { | ||
| return err | ||
| } | ||
| proxyDS := daemonsetProxy(ctx) | ||
| if err := k.CreateDaemonset(ctx, proxyDS); err != nil { | ||
| return err | ||
| } |
e06302b to
c75a491
Compare
c75a491 to
38f8a94
Compare
Add E2E test for node bootstrapping with HTTPProxyConfig set and private DNS zone for the API server FQDN. Regression coverage for IcM 603699115 / ADO#31707996. Changes: - Refactor BootstrapConfigMutator and AKSNodeConfigMutator to accept *Cluster parameter, enabling scenarios to access cluster properties - Deploy Python-based CONNECT proxy DaemonSet on all non-isolated clusters using mcr.microsoft.com/cbl-mariner/base/python:3 - Create private DNS zone for API server FQDN on all non-isolated clusters, linked to VNet with A record - Add Test_Ubuntu2204_HTTPProxy_PrivateDNS scenario - Fix cluster creation retry to handle NotFound errors Test verified: node boots, CSE completes, kubelet starts, node Ready, test pod runs. Proxy receives CONNECT traffic from CSE outbound check. Fixes: ADO#31707996 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
38f8a94 to
a1bebdc
Compare
| err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| resp, err := config.Azure.PrivateZonesClient.Get(ctx, nodeResourceGroup, privateZoneName, nil) | ||
| if err != nil { | ||
| return false, nil |
| var aRecords []*armprivatedns.ARecord | ||
| for _, ip := range ips { | ||
| if parsed := net.ParseIP(ip); parsed != nil && parsed.To4() != nil { | ||
| aRecords = append(aRecords, &armprivatedns.ARecord{IPv4Address: to.Ptr(ip)}) | ||
| } | ||
| } | ||
| if len(aRecords) == 0 { | ||
| return fmt.Errorf("no IPv4 addresses for %q", fqdn) | ||
| } |
| // that can be resolved by retrying, such as 409 Conflict (concurrent operations) | ||
| // and NotFound during managed identity reconciliation (stale references after cluster deletion). | ||
| func isRetryableClusterError(err error) bool { | ||
| var respErr *azcore.ResponseError | ||
| if !errors.As(err, &respErr) { | ||
| return false | ||
| } | ||
| if respErr.StatusCode == 409 { | ||
| return true | ||
| } | ||
| return respErr.ErrorCode == "NotFound" && strings.Contains(err.Error(), "Reconcile managed identity credential failed") |
…APIServer createPrivateZone and createPrivateDNSLink already handle 409 conflicts internally. The outer PollUntilContextTimeout loops had dead-code 409 checks — inner functions swallow the 409 and return either success or a differently-typed error that never matches the outer check. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously all errors were swallowed during polling, turning non-transient failures (403, invalid names) into misleading timeouts. Now only 404/NotFound is retried; other errors surface immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| } | ||
| return k.GetProxyURL(ctx) | ||
| }, debugDeps...) | ||
| if !isNetworkIsolated { |
There was a problem hiding this comment.
This surprises me. Probably due to me not quite understanding things. I thought for network isolated clusters, we would need a private DNS for the API server setup. But this code only sets up the private DNS for non-network isolated clusters.
There was a problem hiding this comment.
Network-isolated clusters block all outbound and use private endpoints with NSG allow-rules for specific IPs. They don't use HTTP_PROXY — the proxy pattern is for clusters that have internet access through a forward proxy. The private DNS zone here mirrors public DNS for the API server FQDN, which only matters when nodes use a system resolver + proxy combination.
|
Re: timmy-wright's question about private DNS only for non-isolated clusters: Network-isolated clusters already have their own private DNS setup via Network-isolated clusters block outbound entirely and use a different connectivity model (private endpoints + NSG rules for specific IPs), so the proxy + private DNS pattern doesn't apply to them. |
PR #8470 (merged after this branch's last sync with main) changed BootstrapConfigMutator from func(*datamodel.NodeBootstrappingConfiguration) to func(*Cluster, *datamodel.NodeBootstrappingConfiguration). The new Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG_H100_NoReboot was using the old single-arg signature, which broke compilation after merging main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Add E2E test for node bootstrapping with HTTPS_PROXY configured and private DNS zone for the API server FQDN. Regression coverage for IcM 603699115 / ADO#31707996.
Changes
New test:
Test_Ubuntu2204_HTTPSProxy_PrivateDNSValidates that a node bootstraps successfully when
HTTPS_PROXYis set and the API server FQDN resolves via a private DNS zone.Proxy infrastructure (all non-isolated clusters)
mcr.microsoft.com/cbl-mariner/base/python:3)prepareClusterDAG, consistent across clustersSignature refactor
BootstrapConfigMutatorandAKSNodeConfigMutatornow accept*Clusteras first parameterProxyURL)Bug fix
createNewAKSClusterWithRetrynow retries on managed identity reconciliationNotFounderrors after cluster deletionTest Result
Fixes: ADO#31707996