Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable nscd caching #50316

Merged
merged 8 commits into from Dec 12, 2018

Conversation

@arianvp
Copy link
Contributor

arianvp commented Nov 13, 2018

Motivation for this change

Nscd caching was interferring with systemd's nss module. Disable caching as
caching is not the reason why we use nscd. We use it to resolve nss modules correctly. See commit message for more
detailed information.

Fixes #50273

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

@arianvp arianvp requested a review from Infinisil as a code owner Nov 13, 2018

@arianvp arianvp force-pushed the arianvp:fix-dynamic-user branch 3 times, most recently Nov 13, 2018

@arianvp arianvp changed the title Fix dynamic user Disable nscd caching Nov 13, 2018

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Nov 13, 2018

Supersedes #50042 if we also disable host caching

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Nov 13, 2018

CC @Mic92 . We could also replace it with unscd while we're at it.

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Nov 13, 2018

I haven't actually tried setting enable-cache no but just took @dezgeg 's word for it that it only works if we set ttls to 0 . We do the same when we enable sssd. I also elaborated this in the commit message but haven't actually tested this hypothesis. I could also try fully disabling the caches.

@Mic92

This comment has been minimized.

Copy link
Contributor

Mic92 commented Nov 13, 2018

I think we should first disable negative-time-to-live caches for hosts:

negative-time-to-live hosts 0

This should fix problems people have when they cannot reach a site, when the hostname was resolved while having no connectivity.
And then test unscd for a while. It could have even funnier bugs.

@lheckemann

This comment has been minimized.

Copy link
Member

lheckemann commented Nov 13, 2018

I think this is worth mentioning in the release notes.

@peterhoeg

This comment has been minimized.

Copy link
Member

peterhoeg commented Nov 14, 2018

I think we should first disable negative-time-to-live caches

We might as well disable the positive TTL too and say that nscd is not for caching. Whenever something unexplained happens with name resolution, my first step is always just to restart nscd which solves it nine times out of 10.

@edolstra

This comment has been minimized.

Copy link
Member

edolstra commented Nov 14, 2018

Hm, what's the advantage of disabling caching of positive results?

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Nov 14, 2018

Basically I wanted to be 100% sure nothing fishy is happening. I disabled everything to be 100% sure no funny stuff is happening. Systemd project acknowledged that they are not nscd-aware and they might look into fixing that. This means there might be more subtle bugs lurking in the presence of caching. I'm not sure if the positive cache is invalidated when a systemd unit with Dynamic users stops. If it doesn't, this might cause funny bugs when another systemd service with Dynamic users recycles that uid and discovers there's already a name associated to the uid.

user lookups are basically instantaneous and the only moment we would want to cache them is in the presence of LDAP to improve performance. But sssd already does its own caching and so we already disable nscd caching when sssd is enabled in NixOS and Red Hat (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/usingnscd-sssd)

So the only moment we seem to have caching is when the passwd database is backed by a file. I think this is a bit overkill and will cause more issues than gains.

Edit: @edolstra oh I think I now just realised you were talking about caching of hosts, not the other caches.

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Nov 14, 2018

I've disabled negative caching for hosts for now. and left positive caching in.

I'm not very familiar with the release process of NixOS. What should a release notes entry look like, and where do I put it?

@Mic92

This comment has been minimized.

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Nov 14, 2018

Now that sssd and nscd ship basically the same config, is it worth to remove the settings from sssd ?

@Mic92

This comment has been minimized.

Copy link
Contributor

Mic92 commented Nov 14, 2018

@arianvp when they are 100% the same, we can drop the sssd one.

suggested-size group 211
check-files group yes
persistent group no
shared group yes

enable-cache hosts yes
positive-time-to-live hosts 600
negative-time-to-live hosts 5
negative-time-to-live hosts 0

This comment has been minimized.

Copy link
@Mic92

Mic92 Nov 14, 2018

Contributor

I wonder if we at least should backport this one, since it can be quiet annoying if websites do not load correctly after connecting to a hotspot.

This comment has been minimized.

Copy link
@Profpatsch

Profpatsch Nov 20, 2018

Member

Yes, I’ve had this problem for more than a month now, so the release must have it as well.

@peterhoeg

This comment has been minimized.

Copy link
Member

peterhoeg commented Nov 15, 2018

Hm, what's the advantage of disabling caching of positive results?

We don't gain anything by having it enabled and disabling it removes one potential source of problems.

nixos/doc/manual/release-notes/rl-1903.xml Outdated
<listitem>
<para>
The <literal>nscd</literal> now disables all caching of
<literal>passwd</literal> and <literal>group</litral> databases by

This comment has been minimized.

Copy link
@jokogr

jokogr Nov 15, 2018

Contributor

Should be </literal> the closing tag in group

This comment has been minimized.

Copy link
@lheckemann

lheckemann Nov 15, 2018

Member

nix build -f my-nixpkgs/nixos config.system.build.manual --arg configuration {} to make sure the manual builds :)

@Mic92

This comment has been minimized.

Copy link
Contributor

Mic92 commented Nov 15, 2018

@peterhoeg we gain a dns cache if it is enabled. I don't see how this would cause problems if it caches valid entries. DNS is designed around for those caches and uses TTLs itself to control invalidation.

@@ -245,6 +245,66 @@
options.
</para>
</listitem>
<listitem>
<para>
The <literal>nscd</literal> now disables all caching of

This comment has been minimized.

Copy link
@flokli

flokli Dec 5, 2018

Contributor
Suggested change
The <literal>nscd</literal> now disables all caching of
The <literal>nscd</literal> service now disables all caching of
of <literal>DynamicUser=</literal> in systemd services.
The was already the default behaviour in presence of
<literal>services.sssd.enable = true</literal> because nscd caching
would interfere sssd in unpredictable ways as well.Because we're using nscd

This comment has been minimized.

Copy link
@flokli

flokli Dec 5, 2018

Contributor
Suggested change
would interfere sssd in unpredictable ways as well.Because we're using nscd
would interfere with `sssd` in unpredictable ways as well. Because we're using nscd

(from #50316 (comment))

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Dec 5, 2018

@edolstra I reverted your revert, as I don't think it fixes anything. I'm 100% sure that nscd already invalidates the cache when /etc/resolv.conf is changed, through inotify.

@arianvp arianvp force-pushed the arianvp:fix-dynamic-user branch to aeb4aa1 Dec 5, 2018

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Dec 5, 2018

After reading the discussion at #42569 perhaps we should only unrevert the commit Eelco did, after we fully disable caching in nscd. I added Eelco's revert to the patchset again ...

I think we should merge this as is, and then in a next PR:

  1. introducde systemd-resolved
  2. Disable positive caching of domains in nscd
  3. Remove all forms of restart triggers / cache invalidations of nscd

@flokli flokli referenced this pull request Dec 6, 2018

Merged

GCE OSLogin module: init #51566

8 of 14 tasks complete
<literal>libnss_systemd.so</literal> module which is used by
<literal>systemd</literal> to manage uids and usernames in the presence
of <literal>DynamicUser=</literal> in systemd services.
The was already the default behaviour in presence of

This comment has been minimized.

Copy link
@flokli

flokli Dec 6, 2018

Contributor
Suggested change
The was already the default behaviour in presence of
This was already the default behaviour in presence of
@flokli
Copy link
Contributor

flokli left a comment

more doc feedback

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Dec 12, 2018

I have addressed the doc issues. I think this is ready for merge. I have created a follow-up ticket with the rest of the tasks that popped up in this discussion #51911

@@ -4,25 +4,41 @@ paranoia no
debug-level 0

enable-cache passwd yes

This comment has been minimized.

Copy link
@flokli

flokli Dec 12, 2018

Contributor

do we want to disable caches where we set a zero ttl for both positive and negative? Seems less confusing to me :-)

This comment has been minimized.

Copy link
@arianvp

arianvp Dec 12, 2018

Author Contributor

As eelco mentioned, this would have a serious performance penalty. so I don't want to change it until #51911 is implemented. I think this is a good compromise where people do not get failed lookups when switching networks, but do also have performance. Until we figure out the resolved business

This comment has been minimized.

Copy link
@arianvp

arianvp Dec 12, 2018

Author Contributor

Oh I misread: See the original commit message why we can't do this:

Note that we can not just put in /etc/nscd.conf:
enable-cache passwd no

As this will actually cause glibc to _not_ forward the call to nscd
at all, and thus never reach the nss modules. Instead we set
the negative and positive cache ttls  to 0 seconds as a workaround.
This way, Glibc will always forward requests to nscd, but results
will never be cached.

This comment has been minimized.

Copy link
@arianvp

arianvp Dec 12, 2018

Author Contributor

I added this as a comment to nscd.conf to clarify

arianvp added some commits Nov 13, 2018

nixos/nscd: Disable caching of group and passwd
Systemd provides an option for allocating DynamicUsers
which we want to use in NixOS to harden service configuration.
However, we discovered that the user wasn't allocated properly
for services. After some digging this turned out to be, of course,
a cache inconsistency problem.

When a DynamicUser creation is performed, Systemd check beforehand
whether the requested user already exists statically. If it does,
it bails out. If it doesn't, systemd continues with allocating the
user.

However, by checking whether the user exists,  nscd will store
the fact that the user does not exist in it's negative cache.
When the service tries to lookup what user is associated to its
uid (By calling whoami, for example), it will try to consult
libnss_systemd.so However this will read from the cache and tell
report that the user doesn't exist, and thus will return that
there is no user associated with the uid. It will continue
to do so for the cache duration time.  If the service
doesn't immediately looks up its username, this bug is not
triggered, as the cache will be invalidated around this time.
However, if the service is quick enough, it might end up
in a situation where it's incorrectly reported that the
user doesn't exist.

Preferably, we would not be using nscd at all. But we need to
use it because glibc reads  nss modules from /etc/nsswitch.conf
by looking relative to the global LD_LIBRARY_PATH.  Because LD_LIBRARY_PATH
is not set globally (as that would lead to impurities and ABI issues),
glibc will fail to find any nss modules.
Instead, as a hack, we start up nscd with LD_LIBRARY_PATH set
for only that service. Glibc will forward all nss syscalls to
nscd, which will then respect the LD_LIBRARY_PATH and only
read from locations specified in the NixOS config.
we can load nss modules in a pure fashion.

However, I think by accident, we just copied over the default
settings of nscd, which actually caches user and group lookups.
We already disable this when sssd is enabled, as this interferes
with the correct working of libnss_sss.so as it already
does its own caching of LDAP requests.
(See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/usingnscd-sssd)

Because nscd caching is now also interferring with libnss_systemd.so
and probably also with other nsss modules, lets just pre-emptively
disable caching for now for all options related to users and groups,
but keep it for caching hosts ans services lookups.

Note that we can not just put in /etc/nscd.conf:
enable-cache passwd no

As this will actually cause glibc to _not_ forward the call to nscd
at all, and thus never reach the nss modules. Instead we set
the negative and positive cache ttls  to 0 seconds as a workaround.
This way, Glibc will always forward requests to nscd, but results
will never be cached.

Fixes #50273
nixos/nscd: also add netgroup to the config
It was the last database that wasn't listed.

@arianvp arianvp force-pushed the arianvp:fix-dynamic-user branch to 1d5f4cb Dec 12, 2018

@flokli

This comment has been minimized.

Copy link
Contributor

flokli commented Dec 12, 2018

LGTM.

@Mic92 Mic92 merged commit 5feba45 into NixOS:master Dec 12, 2018

9 checks passed

grahamcofborg-eval ^.^!
Details
grahamcofborg-eval-check-meta config.nix: checkMeta = true
Details
grahamcofborg-eval-nixos-manual nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ofborg"; } ./nixos/release.nix -A manual
Details
grahamcofborg-eval-nixos-options nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ofborg"; } ./nixos/release.nix -A options
Details
grahamcofborg-eval-nixpkgs-manual nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ofborg"; } ./pkgs/top-level/release.nix -A manual
Details
grahamcofborg-eval-nixpkgs-tarball nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ofborg"; } ./pkgs/top-level/release.nix -A tarball
Details
grahamcofborg-eval-nixpkgs-unstable-jobset nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ofborg"; } ./pkgs/top-level/release.nix -A unstable
Details
grahamcofborg-eval-package-list nix-env -qa --json --file .
Details
grahamcofborg-eval-package-list-no-aliases nix-env -qa --json --file . --arg config { allowAliases = false; }
Details
@Mic92

This comment has been minimized.

Copy link
Contributor

Mic92 commented Dec 12, 2018

I backported @edolstra revert in 25d0880, which fix the biggest pain point we had at the moment. Before backporting a change like this one, I would give a test period on unstable. We might not need to backport at all since the problems with DynamicUser should not affect many people.

@flokli

This comment has been minimized.

Copy link
Contributor

flokli commented Feb 4, 2019

@arianvp Had a discussion at FOSDEM about how we (ab)use nscd to pass nss modules, as described in 1d5f4cb.

This seems to be a bad idea, due to glibc falling back to local resolving (with broken nss modules) after very quick timeouts.

Can we get glibc to load nss modules from some global /run/nss-modules path instead, pointing to something in the nix-store?

@Mic92

This comment has been minimized.

Copy link
Contributor

Mic92 commented Feb 4, 2019

@flokli This can break if a program was linked against a different version of glibc
since. This is why we use nscd in the first place.

@arianvp

This comment has been minimized.

Copy link
Contributor Author

arianvp commented Feb 5, 2019

Only in the case where glibc breaks abi right? We do something similar for OpenGL/Vulkan iirc

@flokli

This comment has been minimized.

Copy link
Contributor

flokli commented Feb 5, 2019

I opened #55276 about that.

@Mic92

This comment has been minimized.

Copy link
Contributor

Mic92 commented Feb 5, 2019

@arianvp This could make it impossible to mix releases, but not just for OpenGL applications but every application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.