Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSS is a problem #129

Closed
JonathonReinhart opened this issue Jun 9, 2020 · 10 comments · Fixed by #139
Closed

NSS is a problem #129

JonathonReinhart opened this issue Jun 9, 2020 · 10 comments · Fixed by #139
Labels
nss Issues related to GLIBC NSS

Comments

@JonathonReinhart
Copy link
Owner

JonathonReinhart commented Jun 9, 2020

Background

Name Service Switch (NSS) is a feature of GLIBC that allows different name databases to be extensible and dynamically enabled via /etc/nsswitch.conf. A SERVICE will be provided by libnss_SERVICE.so.

Core Problem

When a glibc binary built with staticx runs on a target system:

  • The bundled glibc will read /etc/nsswitch.conf on the target system and try to load various libnss_SERVICE.so libraries
  • libnss_SERVICE.so (assuming it exists) from the target system will be (attempted to be) loaded which is likely not compatible with the bundled glibc

Issues

  1. The associated libnss_SERVICE.so libraries are not discoverable by ldd
  • This means they are excluded by default
  1. Even if staticx bundled all of the configured libnss_SERVICE.so files, the target system could still have additional services configured which are not bundled.
  2. Even if staticx bundled every possible libnss_SERVICE.so file (and their dependencies), there's no way to know if they would be compatible with the target system.
  • E.g. libnss_winbind.so talks to winbindd

References

Terms

  • bundled -- Included in the staticx archive
  • target system -- The system where a staticx binary is run
@JonathonReinhart
Copy link
Owner Author

JonathonReinhart commented Jun 9, 2020

Tests

Build on Debian 10, run on CentOS 6

Running id from Debian 10 in a CentOS 6 docker container (noting here that I had to enable vsyscall=emulate).

/etc/nsswitch.conf

passwd:     files
shadow:     files
group:      files

strace:

23    openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3 
23    fstat(3, {st_mode=S_IFREG|0644, st_size=1688, ...}) = 0 
23    read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1688
23    read(3, "", 4096)                 = 0
23    close(3)                          = 0
23    openat(AT_FDCWD, "/tmp/staticx-mMCPcj/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
23    openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 
23    fstat(3, {st_mode=S_IFREG|0644, st_size=11236, ...}) = 0 
23    mmap(NULL, 11236, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f57e3caf000
23    close(3)                          = 0
23    openat(AT_FDCWD, "/lib64/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 3 
23    read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360!\0\0\0\0\0\0"..., 832) = 832 
23    fstat(3, {st_mode=S_IFREG|0755, st_size=66432, ...}) = 0 
23    mmap(NULL, 2151824, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f57e3aa1000
23    mprotect(0x7f57e3aae000, 2093056, PROT_NONE) = 0 
23    mmap(0x7f57e3cad000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc000) = 0x7f57e3cad000
23    close(3)                          = 0

We can see:

  • /etc/nsswitch.conf is read
  • /tmp/staticx-mMCPcj/libnss_files.so.2 is attempted but didn't exist
  • /etc/ld.so.cache is consulted
  • /lib64/libnss_files.so.2 is loaded ⚠️

It's a wonder this worked.

Build on CentOS 6, run on Debian 10

First of all, the application didn't run correctly: It failed to resolve gid/uid numbers:

$ strace -f -o id-sx.staticx ./id.sx 
uid=1000 gid=1000 groups=1000,24,25,27,29,30,44,46,109,112,116,125,127,998

And this snippet of strace output:

2634 open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
32634 fstat(3, {st_mode=S_IFREG|0644, st_size=553, ...}) = 0
32634 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa4663e000
32634 read(3, "# /etc/nsswitch.conf\n#\n# Example"..., 4096) = 553
32634 read(3, "", 4096)                 = 0
32634 close(3)                          = 0
32634 munmap(0x7ffa4663e000, 4096)      = 0  
32634 open("/tmp/staticx-cgEGCK/libnss_files.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
32634 open("/etc/ld.so.cache", O_RDONLY) = 3
32634 fstat(3, {st_mode=S_IFREG|0644, st_size=113870, ...}) = 0
32634 mmap(NULL, 113870, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffa46623000
32634 close(3)                          = 0
32634 open("/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY) = 3
32634 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0003\0\0\0\0\0\0"..., 832) = 832
32634 fstat(3, {st_mode=S_IFREG|0644, st_size=55792, ...}) = 0
32634 mmap(NULL, 83768, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffa4660e000
32634 mprotect(0x7ffa46611000, 40960, PROT_NONE) = 0
32634 mmap(0x7ffa46611000, 28672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7ffa46611000
32634 mmap(0x7ffa46618000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xa000) = 0x7ffa46618000
32634 mmap(0x7ffa4661b000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc000) = 0x7ffa4661b000
32634 mmap(0x7ffa4661d000, 22328, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffa4661d000
32634 close(3)                          = 0

...shows similar behavior, but with the versions reversed.

@JonathonReinhart
Copy link
Owner Author

JonathonReinhart commented Jun 10, 2020

Possible Workarounds

__nss_configure_lookup

This API allows you to override /etc/nsswitch.conf programmatically. It appears to only work before the database is accessed, however.

#include <stdio.h>
#include <error.h>
#include <errno.h>
#include <netdb.h>
#include <nss.h>
#include <arpa/inet.h>

static void test_gethost(void)
{
    struct hostent *h;
    const char *host = "www.google.com";
    h = gethostbyname("www.google.com");
    if (!h)
        error(2, 0, "gethostbyname() failed");

    printf("Addresses for %s:\n", host);
    char **a;
    for (a = h->h_addr_list; *a; a++) {
        char buf[100];
        printf("    %s\n", inet_ntop(h->h_addrtype, *a, buf, sizeof(buf)));
    }
}

int main(int argc, char *argv[])
{
    if (argc > 1) {
        const char *db = "hosts";
        const char *service_line = "files";

        printf("__nss_configure_lookup(\"%s\", \"%s\")\n", db, service_line);

        int rc = __nss_configure_lookup(db, service_line);
        if (rc) {
            fprintf(stderr, "__nss_configure_lookup failed: %m\n");
            return 2;
        }
    }

    test_gethost();
}
$ ./nss-hack
Addresses for www.google.com:
    172.217.0.4

$ ./nss-hack 1
__nss_configure_lookup("hosts", "files")
./nss-hack: gethostbyname() failed

Resources:

So it's pretty gross and involved, but since we can't really neuter or avoid NSS, perhaps this scheme would work:

  • Add a staticx hook that detects glibc, and forcefully adds a predetermined minimal set of service libraries (and their dependencies, of course) to the dependency list, e.g.:
    • libnss_files.so
    • libnss_dns.so
  • Write a small GLIBC-compatible library which calls __nss_configure_lookup for all databases, effectively overriding the target system's /etc/nsswitch.conf with a sane, minimal set, which matches the shared libraries listed above
  • When invoking the target process, LD_PRELOAD that library

This will of course complicate the build system as well.

@joshuarli
Copy link

https://sourceware.org/glibc/wiki/FAQ

In fact, one cannot say anymore that a libc compiled with this option is using NSS. There is no switch anymore. Therefore it is highly recommended not to use --enable-static-nss since this makes the behaviour of the programs on the system inconsistent.

It might be worth asking how popular /etc/nsswitch.conf is. Last I checked, musl libc 1.1.21 doesn't even read it. Example program and its strace.

There is even a very recent golang PR, authored by the alpine linux creator, for an accepted proposal to prefer /etc/hosts for DNS when no /etc/nsswitch.conf is present.

@joshuarli
Copy link

joshuarli commented Jun 23, 2020

So, you could potentially avoid doing a lot of work (and making staticx binaries better IMO, you can look at that glibc strace in the linked page and see how gross it is) by vendoring the appropriate functions from musl libc and patching the resulting executable.

But I don't know how important or popular /etc/nsswitch.conf is.

@JonathonReinhart
Copy link
Owner Author

JonathonReinhart commented Jul 8, 2020

The __nss_configure_lookup hack seems to work.

Running nss-neuter (12a950d):

bad_nsswitch.conf:

passwd: bad
shadow: bad
group: bad
hosts: bad

Command:

docker run --rm -it \
    -v $(realpath ./bad_nsswitch.conf):/etc/nsswitch.conf:ro \
    -v $(pwd):$(pwd):ro -w $(pwd) \
    centos:6 /bin/bash -c 'whoami; echo --------; ./whoami.sx'

Output:

whoami: cannot find name for user ID 0
--------
nsskill: __nss_configure_lookup("aliases", "files")
nsskill: __nss_configure_lookup("ethers", "files")
nsskill: __nss_configure_lookup("group", "files")
nsskill: __nss_configure_lookup("gshadow", "files")
nsskill: __nss_configure_lookup("hosts", "files dns")
nsskill: __nss_configure_lookup("initgroups", "files")
nsskill: __nss_configure_lookup("netgroup", "files")
nsskill: __nss_configure_lookup("networks", "files")
nsskill: __nss_configure_lookup("passwd", "files")
nsskill: __nss_configure_lookup("protocols", "files")
nsskill: __nss_configure_lookup("publickey", "files")
nsskill: __nss_configure_lookup("rpc", "files")
nsskill: __nss_configure_lookup("services", "files")
nsskill: __nss_configure_lookup("shadow", "files")
root

Note that my whoami.sx was able to resolve the uid=0, even though whoami from the docker image was not.

@JonathonReinhart
Copy link
Owner Author

@joshuarli Thanks for the insight!

staticx will never be able to fully support NSS, and indeed that is not my goal. My goal with this issue is to simply avoid the issues that are caused by the dlopen() calls from the bundled GLIBC.

So, you could potentially avoid doing a lot of work... by vendoring the appropriate functions from musl libc and patching the resulting executable.

This is one road that I could take, but:

  1. It would be pretty complicated. Patching an arbitrary glibc binary to neuter the NSS functionality would be no easy task. These are internal functions in an optimized binary that probably has no symbols.
  2. It would be a departure from the way staticx works today. Staticx tries to make the application run the way the developer intended (with glibc). NSS is a bit of a corner case here because it is runtime configuration that enables loading of arbitrary code.

But I don't know how important or popular /etc/nsswitch.conf is.

It's present on every Linux system I've ever used. Whether or not users realize it or not, installed packages sometimes modify nsswitch.conf to plug in their behavior. For example, here's my never-intentionally-modified nsswitch.conf from Debian 10:

passwd:         files systemd
group:          files systemd
shadow:         files
gshadow:        files

hosts:          files mdns4_minimal [NOTFOUND=return] dns myhostname
networks:       files

protocols:      db files
services:       db files
ethers:         db files
rpc:            db files

netgroup:       nis

Note:

@JonathonReinhart
Copy link
Owner Author

Looking towards getting this functionality into master, I have some thoughts/concerns:

Compatibility

There are several realms between which there needs to be compatibility:

  • (A) System where staticx itself is built
    • This is usually Travis CI where Python wheels are built
    • Could also be a source installation
  • (B) System from which user programs come
  • (C) System where staticx is run to build staticx archive programs from B
    • I would expect this to usually be the same as B, but there could be cases where it is not:
      • Different Docker images in a CI pipeline
      • Different libc (e.g. musl)
  • (D) Target system where staticx archives are run

The goal of StaticX is to completely isolate D, so we can ignore that.

I think C != B is an unsupported use-case of staticx: It's the same as moving a dynamically-linked program to another system which is the problem that staticx intends to solve. So for this conversation we'll assume B == C.

That leaves us with A and C.

The compatibility requirements between A and C is historically non-existent: The bootloader is statically-linked against musl libc. But now, libnsskill.so is a dynamic shared object, and needs to be directly compatible with the bundled libc (from C).

  • This is not a problem if the user installs staticx from source; libnsskill.so will be built with the system gcc and linked against the same libc as the user programs.
  • The primary problem then is ensuring compatibility between the libnsskill.so built in Travis and the bundled GLIBC.

@JonathonReinhart
Copy link
Owner Author

JonathonReinhart commented Jul 12, 2020

Why did this job fail?
https://travis-ci.org/github/JonathonReinhart/staticx/jobs/706981043

PyInstalled application run:
aux: Hello from our auxiliary app: /tmp/_MEId67Fk3/aux-glibc-dynamic
aux: Hello from our statically-linked auxiliary app: /tmp/_MEId67Fk3/aux-glibc-static
aux: Hello from our auxiliary app: /tmp/_MEId67Fk3/aux-musl-dynamic
aux: Hello from our statically-linked auxiliary app: /tmp/_MEId67Fk3/aux-musl-static
Making staticx executable ($STATICX_FLAGS=):
WARNING:root:Unexpected ldd error (1): /tmp/staticx-pyi-269cseyn/aux-musl-dynamic: error while loading shared libraries: /usr/lib/x86_64-linux-gnu/libc.so: invalid ELF header
Running staticx executable
aux: Hello from our auxiliary app: /tmp/_MEIcM5Gqb/aux-glibc-dynamic
aux: Hello from our statically-linked auxiliary app: /tmp/_MEIcM5Gqb/aux-glibc-static
Error relocating /tmp/staticx-jekcji/libnsskill.so: __nss_configure_lookup: symbol not found
Error relocating /tmp/staticx-jekcji/libnsskill.so: __fprintf_chk: symbol not found
[9023] Failed to execute script app
Traceback (most recent call last):
  File "app.py", line 39, in <module>
    main()
  File "app.py",
(truncated?)

Okay, this comes from a super-weird corner case test (that isn't really realistic): pyinstall-aux-static-exec. This test bundles four "auxiliary" applications inside of a PyInstaller application, and then staticx-ifies that:

  • aux-glibc-dynamic
  • aux-glibc-static
  • aux-musl-dynamic
  • aux-musl-static

The statically-linked dynamic applications obviously should work fine. And the glibc dynamic app should be using the same libc as the python interpreter, and its dependencies will get picked up okay.

The weird (and arguably invalid) one here is aux-musl-dynamic. Why should a user expect to be able to take an application which is dynamically-linked against musl libc, and bundle it in a PyInstaller app and move it around? In fact, we already had to special-case this when running under Docker: 493c814.

Regardless, the error occurs (I think) because the staticx bootloader sets LD_PRELOAD to load libnsskill.so. This environment variable is persisted through, to the point where the python app tries to run aux-musl-dynamic. Then /usr/lib/x86_64-linux-musl/libc.so sees LD_PRELOAD and tries to load libnsskill.so which needs __nss_configure_lookup, which doesn't exist in musl.

So even though this is a weird test case, it highlights the fact that LD_PRELOAD might be problematic for child processes of the bundled application. I'm not sure how to handle this.

@JonathonReinhart
Copy link
Owner Author

JonathonReinhart commented Jul 14, 2020

I was able to successfully hook execv, execve (ugh, probably need to hook all variants) in libnsskill.so, and unset LD_PRELOAD from the environment block before executing the child process. (Inspired by https://haxelion.eu/article/LD_NOT_PRELOADED_FOR_REAL/).

However, this presents a decision: Do we always want to unset LD_PRELOAD?

  • If the user program executing a system executable, then the answer is definitely no. We don't want to interfere with system glibc-linked programs.
  • Otherwise, what is the user program executing?
    • Another "bundled" application?
      • This is something that Staticx does not currently support. If you try this (by adding another program, unmodified to the bundle), then that program would run with the target system GLIBC because we didn't modify it to set its rpath/interp
    • What about our painful PyInstaller example? (A PyInstaller application with a bundled program). Well it depends.
      • *-static -- Doesn't care about LD_PRELOAD
      • glibc-dynamic -- This is already a problem for the same reasons I just said. So why should we try to support it?
      • musl-dynamic -- This pathological case doesn't work anyway, for even worse reason (musl libc.so isn't on the target)

So maybe it doesn't really even matter. To me, it seems that the safest choice is to unset LD_PRELOAD at init time, and call it a day, since we don't support dynamic bundled apps at all anyway.


Edit: Crap. I forgot that the PyInstaller bootloader will fork and exec, just like staticx bootloader. So this means that we will (under the current plan) LD_PRELOAD libnsskill into the PyInstaller bootloader, but not the child process (where the python code actually runs).

So I guess this means we need to do the exec-hooking solution.....

On the plus side, PyInstaller execs itself (which is already patched), so we don't need to worry about nodeflib, etc.

JonathonReinhart added a commit that referenced this issue Jul 19, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 19, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 19, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 22, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 22, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 22, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 23, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 25, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 26, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 26, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 26, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
JonathonReinhart added a commit that referenced this issue Jul 28, 2020
Bringing along your own extra dynamically-linked executable is not
supported by staticx.

Even the name of this overall test says "static".

See:
#129 (comment)
@JonathonReinhart
Copy link
Owner Author

To close the loop on the previous dialog, I ended up (in #139), dropping the LD_PRELOAD in favor of patchelf --add-needed. This had the benefit of not having to worry about LD_PRELOAD being dropped or used inadvertently, and working across a fork/exec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nss Issues related to GLIBC NSS
Projects
None yet
2 participants