Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NixOS: can't get machine FQDN when resolved is enabled #132646

Open
B4dM4n opened this issue Aug 4, 2021 · 15 comments · Fixed by #263634
Open

NixOS: can't get machine FQDN when resolved is enabled #132646

B4dM4n opened this issue Aug 4, 2021 · 15 comments · Fixed by #263634

Comments

@B4dM4n
Copy link
Contributor

B4dM4n commented Aug 4, 2021

Describe the bug

When services.resolved.enable = true is used (either manually or with networking.useNetworkd = true), I can no longer get the FQDN of my machine with hostname -f.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Checkout f09a770 (current master commit)
  2. Put services.resolved.enable = true in nixos/tests/hostname.nix
  3. Run nix-build ./nixos/tests/hostname.nix -A explicitDomain
  4. The test fails:
...
ahost: must succeed: hostname --fqdn
(0.01 seconds)
error:
Traceback (most recent call last):
  File "/nix/store/mb9lhv3n6h20ybbpr2j8xdhbv5jgijwb-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 943, in run_tests
    exec(tests, globals())
  File "<string>", line 1, in <module>
  File "<string>", line 11, in <module>
AssertionError
cleaning up
killing ahost (pid 10)
(0.00 seconds)

Expected behavior

The test should succeed and hostname -f should return the FQDN when networking.domain is set.

Additional context

With the above procedure I could bisect the failure to #130503.

After #130503 the noExplicitDomain test in nixos/tests/hostname.nix also fails with the following error:

...
(0.06 seconds)
ahost: must succeed: getent hosts 127.0.0.1 | awk '{print $2}'
ahost: output:
error:
Traceback (most recent call last):
  File "/nix/store/mb9lhv3n6h20ybbpr2j8xdhbv5jgijwb-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 943, in run_tests
    exec(tests, globals())
  File "<string>", line 1, in <module>
  File "<string>", line 22, in <module>
  File "/nix/store/mb9lhv3n6h20ybbpr2j8xdhbv5jgijwb-nixos-test-driver/bin/.nixos-test-driver-wrapped", line 483, in succeed
    raise Exception(
Exception: command `getent hosts 127.0.0.1 | awk '{print $2}'` failed (exit code 2)
cleaning up
killing ahost (pid 10)
(0.00 seconds)

Notify maintainers

@flokli

Metadata

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.13.0, NixOS, 21.11 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4pre20210712_099df07`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
@flokli
Copy link
Contributor

flokli commented Aug 4, 2021

That's interesting /etc/nsswitch.conf should contain:

without resolved:

hosts:     mymachines files myhostname dns

with resolved:

hosts:     mymachines resolve [!UNAVAIL=return] files myhostname dns

This suggests something inside the resolve module resolves things differently.

What does hostname --fqdn do under the hood?

@primeos
Copy link
Member

primeos commented Aug 4, 2021

This suggests something inside the resolve module resolves things differently.

Yes, systemd doesn't know about the domain name, only the hostname. IIRC it should work with this:

hosts:     mymachines files resolve [!UNAVAIL=return] myhostname dns

Edit: Not so sure anymore, I might have forgotten some details. IIRC resolve should also look at /etc/hosts.
From man systemd-resolved:

  • The mappings defined in /etc/hosts are resolved to their configured addresses and back, but they will not affect lookups for non-address types (like MX). Support for /etc/hosts may be disabled with ReadEtcHosts=no, see resolved.conf(5).
  • This resolver reads and caches /etc/hosts internally. (In other words, nss-resolve replaces nss-files in addition to nss-dns). Entries in /etc/hosts have highest priority.
  • Some names are always resolved internally (see Synthetic Records above). Traditionally they would be resolved by nss-files, and only if provided in /etc/hosts.

Not sure what's the best solution here and if we should maybe just do what systemd does and drop the FQDN again from /etc/hosts if that causes more problems than it's worth (I lost track of all the advantages and drawbacks).

The Synthetic Records likely cause this problem:

  • The local, configured hostname is resolved to all locally configured IP addresses ordered by their scope, or — if none are configured — the IPv4 address 127.0.0.2 (which is on the local loopback interface) and the IPv6 address ::1 (which is the local host).

What does hostname --fqdn do under the hood?

It was implemented like this when I last looked at it: https://gist.github.com/primeos/8f7f8e1e95518076ef38924125a2f921

@flokli
Copy link
Contributor

flokli commented Aug 4, 2021

This suggests something inside the resolve module resolves things differently.

Yes, systemd doesn't know about the domain name, only the hostname. IIRC it should work with this:

hosts:     mymachines files resolve [!UNAVAIL=return] myhostname dns

We shouldn't be deviating from what upstream recommends. There's a lot of reasons on why NSS modules are configured in the order they are (see upstream docs on it).

/etc/hosts is meant to be the place for overrides for myhostname - which is why files comes before myhostname (and resolve internally uses the same order).

Maybe we populate our /etc/hosts file wrongly, or maybe use another hostname implementation (both proposed in #119236). Maybe myhostname would do the right thing out of the box, but we have something misleading in /etc/hosts?

@stale
Copy link

stale bot commented Apr 29, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 29, 2022
@mpasternacki
Copy link
Contributor

The report doesn't mention what hostname --fqdn actually prints for the author, so I'm not sure if it's the same issue, but on my system, nixpkgs-unstable with networking.hostName and networking.domain set, hostname --fqdn shows only short hostname (without domain).

[japhy@bmo:~] % nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.0.17, NixOS, 23.05 (Stoat), 23.05.20230106.b3818a4`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.12.0`
 - nixpkgs: `/etc/nixpkgs/channels/nixpkgs`

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 8, 2023
@wkral
Copy link
Contributor

wkral commented May 18, 2023

I was attempting to fix nixos.tests.networking.hostname.explicitDomain for ZHF and saw that seeminly this issue was fixed at some point but broke again in this build: https://hydra.nixos.org/build/208026588 on Feb 3rd. I bisected the changes and it seems it broke with fbfe290 introducing nsncd (assuming I bisected correctly).

I'm sort of piecing things together, from man hostname:

The FQDN is the canonical name returned by gethostbyname2(2) when resolving the result of the gethostname(2) name. The DNS domain name is the part after the first dot.

I browsed through the code and noticed gethostbyname and some others are not currently implemented, but from twosigma/nsncd#49 it seems @flokli you're working on that in nsncd. So perhaps this gets resolved when that does? I didn't see any mention of gethostbyname2 in the project but maybe there is something equivalent that will resolve this issue/fix that test.

@picnoir
Copy link
Member

picnoir commented May 18, 2023

Related:

#196934
nix-community/nsncd#4

picnoir added a commit to picnoir/nixpkgs that referenced this issue Oct 26, 2023
Note: we decided to rewrite the history of the fork who somehow got
out of hand. Feature-wise, this version bump fixes the various host
faulty behaviour. See the
nix-community/nsncd#9 and
nix-community/nsncd#10 PRs for more details.

We're in the process of upstreaming this change to twosigma/nsncd,
however, upstream has been pretty slow to review our PRs so far. Since
the hostname bug surfaces quite regularly in the Nixpkgs issue
tracker, we decided to use the nix-community fork as canon for Nixpkgs
for now.

Fixes: NixOS#132646
Fixes: NixOS#261269
@999eagle
Copy link
Contributor

I still have this issue on my system (or a closely related one).

The exact conditions to trigger this issue seem to be:

  • use systemd-resolved and nsncd
  • have networking.enableIPv6 = true

hostname --short returns my actual (short) hostname, hostname --fqdn always returns localhost for me.

If I stop systemd-resolved, stop nscd or remove the line ::1 localhost from /etc/hosts, the correct FQDN is returned.

relevant systemd-resolved logs for the error case
varlink-28: New incoming message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"$hostname","family":10,"flags":0}}
Looking up RR for $hostname IN AAAA.
Following CNAME/DNAME $hostname → localhost.
varlink-28: Sending message: {"parameters":{"addresses":[{"family":10,"address":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]}],"name":"localhost","flags":786945}}
relevant systemd-resolved logs when IPv6 is disabled
varlink-28: New incoming message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"$hostname","family":10,"flags":0}}
Looking up RR for $hostname IN AAAA.
varlink-28: Sending message: {"error":"io.systemd.Resolve.NoSuchResourceRecord","parameters":{}}
varlink-28: New incoming message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"$hostnamefamily":2,"flags":0}}
Looking up RR for $hostname IN A.
Following CNAME/DNAME $hostname → $fqdn.
varlink-28: Sending message: {"parameters":{"addresses":[{"family":2,"address":[127,0,0,2]}],"name":"$fqdn,"flags":786945}}

When the nscd service is stopped, no DNS query is logged by systemd-resolved. This looks like nsncd first tries to resolve the short hostname over IPv6 only and systemd-resolved rewrites this to localhost because all three entries for localhost, the short hostname and the FQDN are mapped to the same IPv6 address ::1 in /etc/hosts.

@picnoir
Copy link
Member

picnoir commented Oct 27, 2023

I can't reproduce :(

Could you give me the Nixpkgs commit ref you used for this test? Did you double-check that 92a4138 is in your pin?

@999eagle
Copy link
Contributor

I'm on 8efd5d1e283604f75a808a20e6cde0ef313d07d4 but with nsncd overridden from 364a38956d05b52e67bf3a3bcc9640368786bae1.

@picnoir
Copy link
Member

picnoir commented Oct 27, 2023

Gah. This will never end.

Do we have a VM test for this?

I can't check it out rn. Re-opening the issue in the meantime.

@picnoir picnoir reopened this Oct 27, 2023
@picnoir
Copy link
Member

picnoir commented Oct 27, 2023

cc @Mic92 for #132646 (comment)

Maybe you have more context about this name resolution since you added this ::1 line in the hosts file.

@999eagle, could you try the same test with services.nscd.enableNsncd = false; to see if we hit the same behaviour with Nscd. It'd be handy to see if this is Nsncd related or not.

@999eagle
Copy link
Contributor

999eagle commented Oct 27, 2023

I've tried again with nscd instead of nsncd now and it shows the exact same issue with the same logs in systemd-resolved. The "default" nscd apparently also tries to resolve the FQDN using IPv6 first. systemd-resolved still resolves the short hostname to ::1 with the name localhost as that's the first name matching ::1 in /etc/hosts.
Manually changing /etc/hosts to contain the line ::1 $fqdn $hostname localhost makes systemd-resolved correctly return the FQDN. The discussion in #119236 which was already linked here before seems to be very relevant to this.

Edit: My workaround for this has been system.nssDatabases.hosts = lib.mkOrder 500 ["files"]; which still works now. If this is set, no DNS requests for my hostname hit systemd-resolved and getent hosts has some interesting differences.

# with the workaround applied
$ getent hosts ::1
::1             localhost
$ getent hosts $hostname
::1             $fqdn $hostname

# without the workaround
$ getent hosts ::1
::1             localhost $hostname $fqdn
$ getent hosts $hostname
::1             localhost

@B4dM4n
Copy link
Contributor Author

B4dM4n commented Oct 27, 2023

Do we have a VM test for this?

The reproduce steps from #132646 (comment) are sufficient to trigger the bug.

The test still fails when services.resolved.enable = true; is added. Adding services.nscd.enableNsncd = false; also does not fix the issue (tested on master c05811c).

@sedlund
Copy link
Contributor

sedlund commented Jan 8, 2024

On 23.11 with

networking.hostName = "nixos";
networking.domain = "lan";
systemd.network.enable = true;

I get:

$ hostname
nixos

$ hostname -f
localhost

Fedora 39 has files first. more specifically:

hosts:      files myhostname resolve [!UNAVAIL=return] dns

RedHat is a large contributor to systemd, uses it in their distribution and recommends it

I made files first and hostname -f returns nixos.lan as expected. I also removed [!UNAVAIL=return] as it caused my pings to shortnames of my resolved.domains to time out and falsely claim they didn't exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants