Skip to content

Fix EDNS OPT record corruption in DNS cache#3997

Merged
nekohasekai merged 1 commit intoSagerNet:testingfrom
berkayozdemirci:fix/edns-opt-cache-corruption
Apr 10, 2026
Merged

Fix EDNS OPT record corruption in DNS cache#3997
nekohasekai merged 1 commit intoSagerNet:testingfrom
berkayozdemirci:fix/edns-opt-cache-corruption

Conversation

@berkayozdemirci
Copy link
Copy Markdown
Contributor

@berkayozdemirci berkayozdemirci commented Apr 5, 2026

Summary

Fixes EDNS OPT record corruption when caching DNS responses. Two interrelated bugs in dns/client.go:

Bug 1: OPT record TTL field treated as actual TTL

Per RFC 6891 Section 6.1.3, the OPT record's Hdr.Ttl field encodes (ExtRCode << 24) | (Version << 16) | DO | Z — it is NOT a time-to-live. The TTL computation and assignment loops iterate over response.Extra (which contains OPT) and read/write record.Header().Ttl uniformly, destroying the EDNS0 metadata. Cached responses end up with EDNS version 255 and garbage flags.

Bug 2: storeCache() stores raw *dns.Msg pointer

After storeCache() saves the pointer, Exchange() continues mutating response (EDNS version downgrade modifies response.Extra). All mutations affect the cached object since freelru stores pointers as-is.

Steps to reproduce

  1. Start sing-box with TUN + hijack-dns + independent_cache: true + HTTPS DNS upstream (e.g. Cloudflare)
  2. Restart sing-box to clear cache
  3. Query an affected domain (e.g. pypi.org):
$ sudo resolvectl flush-caches
$ resolvectl query pypi.org
pypi.org: 151.101.0.223     -- link: tun0
# WORKS — response comes directly from upstream
  1. Wait ~30 seconds for cache to be populated
  2. Query again:
$ sudo resolvectl flush-caches
$ resolvectl query pypi.org
pypi.org: resolve call failed: Received invalid reply
# FAILS — response from cache with corrupted EDNS
  1. Restart sing-box to clear cache → works again immediately.

Not all domains trigger this. The corruption depends on the OPT record content from the upstream — specifically whether the MBZ/Z bits (stored in the lower 16 bits of OPT's "TTL" field) are nonzero. Domains like github.com and google.com happen to work because their OPT metadata produces values that don't catastrophically corrupt the version byte.

Evidence

sing-box logs show the corruption at cache write time:

# Upstream exchange (correct):
dns: exchanged OPT OPT PSEUDOSECTION: EDNS: version 0 flags: MBZ: 0x000c, udp: 1232

# Cache write (corrupted):
dns: cached OPT OPT PSEUDOSECTION: EDNS: version 255 flags: do co MBZ: 0x3fcf, udp: 1232

systemd-resolved debug logs:

systemd-resolved: Using feature level UDP+EDNS0 for transaction 26075.
systemd-resolved: Processing incoming packet on transaction 26075 (rcode=SUCCESS).
systemd-resolved: EDNS version newer that our request, bad server.
systemd-resolved: Regular transaction 26075 now complete with <invalid-reply>

Changes

  • Skip dns.TypeOPT records in all TTL computation and assignment loops in Exchange() and loadResponse() (5 locations)
  • Use message.Copy() in storeCache() to isolate cache from caller mutations (4 cache calls)

Related

Why previous fixes didn't resolve this

Several commits addressed symptoms but not the root cause:

  • ec4d472: Added EDNS version downgrade logic — if cached response has bad version, strips OPT and creates fresh one. Masks the corruption for the current response, but cache still holds corrupted data.
  • e7ef1b2: Moved storeCache() before response.Id = messageId. Fixes ID mutation, but the EDNS version downgrade code after it still mutates response.Extra on the same pointer as the cached object.
  • f98a3a4: Extended "simple request" check to include OPT records with Ttl == 0. Works as an accidental filter — OPT records with nonzero EDNS0 metadata have Ttl != 0, bypassing cache. But some upstream OPT records have nonzero MBZ bits, hitting the corruption.

None of these address the actual problem: the TTL loops modify the OPT record's TTL field.

The TTL computation and assignment loops treat OPT record's Hdr.Ttl
as a regular TTL, but per RFC 6891 it encodes EDNS0 metadata
(ExtRCode|Version|Flags). This corrupts cached responses causing
systemd-resolved to reject them with EDNS version 255.

Also fix pointer aliasing: storeCache() stored raw *dns.Msg pointer
so subsequent mutations by Exchange() corrupted cached data.

- Skip OPT records in all TTL loops (Exchange + loadResponse)
- Use message.Copy() in storeCache() to isolate cache from mutations
@nekohasekai nekohasekai force-pushed the testing branch 4 times, most recently from 6967a17 to a2aa2ac Compare April 10, 2026 03:36
@nekohasekai nekohasekai merged commit cda3883 into SagerNet:testing Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants