chore(ci): dnsmasq caching resolver on runners — DNS root-cause spike (gated, off by default)#23493
Draft
AztecBot wants to merge 1 commit into
Draft
chore(ci): dnsmasq caching resolver on runners — DNS root-cause spike (gated, off by default)#23493AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
… (gated, off by default)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
Draft spike, not for merge as-is. Explores the root-cause fix for the recurring merge-train DNS failures (
Could not resolve host). The immediate mitigation (retrying the flaky downloads) is #23490 againstmerge-train/spartan; this is the durable fix.Root cause
The devbox build container in
ci3/bootstrap_ec2inherits the host's resolver = the EC2 VPC Route 53 resolver (.2), which caps at ~1024 packets/sec per ENI and silently drops queries beyond that. A 64/128-core box fanning out parallelnargo/forge/curlblows past the cap → intermittent failures across github.com, *.githubusercontent.com, binaries.soliditylang.org.What this does
Stands up a host-local dnsmasq caching resolver and points the build container at it via the docker bridge gateway (
--dns 172.17.0.1). Repeated lookups become cache hits, so query volume to the VPC resolver collapses and stays under the cap — fixing all hostnames at once rather than per-host.CI_DNS_CACHE=1(set in the CI environment that runsbootstrap_ec2). Off by default = zero impact until explicitly enabled.server=forwards to whatever the host currently uses (discovered from/etc/resolv.conf), not hardcoded.How to validate
Set
CI_DNS_CACHE=1for merge-train runners for a few days; expect the github.com / githubusercontent.com / soliditylang DNS failure classes to drop to ~0 and the VPC resolver query rate to fall sharply.Open questions for the real version (why it's a draft)
apt-get install-ing per container boot, and set the host/etc/resolv.confto127.0.0.1so Docker propagates it automatically (no per-run--dns). We can't rebuild the AMI from here — this spike is the at-boot version so the approach can be exercised first.unboundwithserve-expired: yesis stronger if we want lookups to survive a fully unreachable upstream. Worth comparing.172.17.0.1when the host script runs (true on these runners since dockerd is already running).Companion write-up: https://gist.github.com/AztecBot/a22cc18bd30ec0bd3dff72b70d675304
Created by claudebox · group:
slackbot