Skip to content

feature: TLS passthrough will select known app_id from TXT record w multiple entries#653

Open
wwwehr wants to merge 1 commit intoDstack-TEE:masterfrom
wwwehr:npw/gateway-support-multi-txt-records
Open

feature: TLS passthrough will select known app_id from TXT record w multiple entries#653
wwwehr wants to merge 1 commit intoDstack-TEE:masterfrom
wwwehr:npw/gateway-support-multi-txt-records

Conversation

@wwwehr
Copy link
Copy Markdown

@wwwehr wwwehr commented Apr 21, 2026

Problem

As a phala cloud user/operator, I want to have two dstack-ingress instances share a
given _dstack-app-address.xxx so that I can properly route a shared hostname
using a load balancer. Currently, the router only accepts first dns record it
discovers which, also, isn't deterministic.

Fix

Extract select_app_address to iterate all values in the TXT record and return
the first whose app_id is known to the local gateway's apps map. Falls back
to the first parseable entry when none are locally known (preserves behavior
for single-instance setups).

The compat path (_tapp-address legacy) is unchanged.

Tests

  • test_resolve_app_address — original live DNS integration test with mock state
  • test_select_app_address_prefers_local — multi-record, picks local app_id
  • test_select_app_address_fallback_when_none_local — multi-record, fallback to first

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the TLS passthrough SNI router to better handle DNS TXT app-address lookups when multiple entries are present, preferring an app address whose app_id is already known locally (and falling back to the first parseable entry otherwise).

Changes:

  • Adds select_app_address(...) to choose an app address from TXT data, preferring locally-known app_ids with a fallback behavior.
  • Threads &Proxy into resolve_app_address(...) so selection can consult the in-memory apps map.
  • Adds unit tests for selection behavior and updates the existing DNS integration test to construct a Proxy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 71 to +80
for lookup in [lookup, lookup_legacy] {
let Ok(lookup) = lookup else {
continue;
};
let Some(txt_record) = lookup.iter().next() else {
continue;
};
let Some(data) = txt_record.txt_data().first() else {
continue;
};
return AppAddress::parse(data)
.with_context(|| format!("failed to parse app address for {sni}"));
let locked = state.lock();
if let Ok(addr) = select_app_address(txt_record.txt_data(), &locked.state.apps) {
return Ok(addr);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_app_address still only inspects the first TXT RR (lookup.iter().next()), so multi-record TXT sets (multiple RRs for the same name, which is common when publishing multiple values) will remain non-deterministic and may ignore locally-known app_ids. To match the stated multi-entry behavior, iterate across all TXT records returned by the lookup and consider all txt_data() strings when selecting the address (prefer local matches across the full set, otherwise fall back to the first parseable).

Copilot uses AI. Check for mistakes.
Comment on lines 83 to 89
} else if let Ok(lookup) = resolver.txt_lookup(txt_domain).await {
if let Some(txt_record) = lookup.iter().next() {
if let Some(data) = txt_record.txt_data().first() {
return AppAddress::parse(data)
.with_context(|| format!("failed to parse app address for {sni}"));
let locked = state.lock();
if let Ok(addr) = select_app_address(txt_record.txt_data(), &locked.state.apps) {
return Ok(addr);
}
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-wildcard branches the error from select_app_address(...) is discarded (if let Ok(addr) = ...), so cases like “TXT record exists but no parseable/usable entries” end up as a generic resolve failure (or a misleading wildcard-lookup error) with no indication that the primary TXT record was present but invalid. Consider retaining the selection error and returning it (with domain context) when no later branch succeeds, so operators can diagnose malformed records.

Copilot uses AI. Check for mistakes.
Comment on lines 59 to +60
/// resolve app address by sni
async fn resolve_app_address(prefix: &str, sni: &str, compat: bool) -> Result<AppAddress> {
async fn resolve_app_address(prefix: &str, sni: &str, compat: bool, state: &Proxy) -> Result<AppAddress> {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_app_address now depends on &Proxy solely to read locked.state.apps, which tightens coupling and forces tests to construct a full Proxy (spawning background tasks, initializing KvStore/certbot/acceptors) just to exercise DNS parsing/selection. A simpler, more testable shape would be to pass in a read-only view of known app_ids (e.g., a &BTreeMap<...> or &BTreeSet<String> captured/cloned by the caller after locking) and keep DNS resolution independent of the full proxy runtime.

Copilot uses AI. Check for mistakes.
@Leechael
Copy link
Copy Markdown
Collaborator

Leechael commented Apr 22, 2026

Thanks for putting this together. Before landing, two concerns about the DNS layer this operates on — the fix targets a layer that won't see the scenario it's meant to solve.

1. RRset vs. character-strings within one record

DNS has two independent "multi-value" axes at a TXT name:

  • Multiple records in the RRset (each RR is a separate entry). Independent writers — e.g., two dstack-ingress instances each calling their provider's set_txt_record — produce this shape.
  • Multiple <character-string> within a single RR (RFC 1035 §3.3.14). Normally used to encode a single logical value >255 bytes (SPF/DKIM), not alternatives. Producing this shape requires one coordinated writer.

In hickory-resolver 0.24.4, TxtLookup::iter() walks the RRset (yields &TXT, one per RR), and TXT::txt_data() returns the character-strings of one RR. The new code still does:

if let Some(txt_record) = lookup.iter().next() {     // picks ONE RR from the RRset
    if let Ok(addr) = select_app_address(txt_record.txt_data(), ...) {
        //                               ^^^^^^^^^^^^^^^^^^^^^^
        //                               iterates strings within that one RR

So select_app_address never sees a second writer's TXT in the multi-writer scenario — lookup.iter().next() already collapsed the RRset to one RR, nondeterministically. The tests pass because they feed select_app_address a Vec<Box<[u8]>> directly, simulating the single-RR-with-two-strings shape, which bypasses the lookup.iter() step. Against live DNS (where each writer produces one RR with one string), this code's output equals the old .first() path. The same structure exists in the compat branch and the wildcard-fallback branch.

If the intent is "iterate across the RRset", the shape is more like:

for txt_record in lookup.iter() {
    if let Some(data) = txt_record.txt_data().first() {
        // apply known-app selection against `data`
    }
}

2. Upstream writer makes this scenario non-stationary regardless

Even with the gateway-side read fixed, the reference ingress writer (dstack-examples/custom-domain/dstack-ingress/scripts/dns_providers/base.py) defines set_txt_record, set_caa_record, and set_cname_record as delete-then-create against any existing record whose content doesn't match the writer's own. Two instances with different APP_ID:PORT or different LE accounturi overwrite each other on every container start — the RRset has one element at a time, not two. CAA is strictly worse: RFC 8659 permits additive issue records, but the code deletes on tag match, so two LE accounts can't coexist without manual DNS editing.

As you note, the co-tenant scenario is reachable today only by manually pruning CAA and manually forcing two projects onto the same name — not a supported setup. That's fine as a niche, but it means the premise for "TXT records will contain multiple app_id candidates the gateway should choose from" depends on out-of-band manual DNS edits that also have to survive every ingress restart.

Suggestion

If the goal is to keep the door open for externally-coordinated multi-writer setups:

  1. Move the iteration to lookup.iter() so the gateway actually reads the RRset rather than one RR's strings (apply in all three call sites — primary, compat, wildcard).
  2. Document the preconditions (manual DNS co-management, additive CAA, TTL vs. restart-rewrite races) so operators don't expect this to work with stock dstack-ingress.

If the goal is just to make TXT selection deterministic on the common single-record path, the smaller change is fine — but then the PR description should drop the "two shared ingress instances" framing, since the upstream writer prevents that RRset shape from appearing on the wire.

Happy to iterate on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants