top: do not depend on ps(1) in container

This ended up more complicated then expected. Lets start first with the problem to show why I am doing this: Currently we simply execute ps(1) in the container. This has some drawbacks. First, obviously you need to have ps(1) in the container image. That is no always the case especially in small images. Second, even if you do it will often be only busybox's ps which supports far less options. Now we also have psgo which is used by default but that only supports a small subset of ps(1) options. Implementing all options there is way to much work. Docker on the other hand executes ps(1) directly on the host and tries to filter pids with `-q` an option which is not supported by busybox's ps and conflicts with other ps(1) arguments. That means they fall back to full ps(1) on the host and then filter based on the pid in the output. This is kinda ugly and fails short because users can modify the ps output and it may not even include the pid in the output which causes an error. So every solution has a different drawback, but what if we can combine them somehow?! This commit tries exactly that. We use ps(1) from the host and execute that in the container's pid namespace. There are some security concerns that must be addressed: - mount the executable paths for ps and podman itself readonly to prevent the container from overwriting it via /proc/self/exe. - set NO_NEW_PRIVS, SET_DUMPABLE and PDEATHSIG - close all non std fds to prevent leaking files in that the caller had open - unset all environment variables to not leak any into the contianer Technically this could be a breaking change if somebody does not have ps on the host and only in the container but I find that very unlikely so I have removed the in container fallback. This updates the docs accordingly, note that podman pod top never falls back to executing ps in the container as this makes no sense with multiple containers so I fixed the docs there as well. Fixes containers#19001 Fixes https://bugzilla.redhat.com/show_bug.cgi?id=2215572 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Luap99 · Jul 7, 2023 · d963033 · d963033
1 parent 2560716
commit d963033
Show file tree

Hide file tree

Showing 10 changed files with 303 additions and 54 deletions.
diff --git a/docs/source/markdown/podman-pod-top.1.md.in b/docs/source/markdown/podman-pod-top.1.md.in
@@ -7,7 +7,9 @@ podman\-pod\-top - Display the running processes of containers in a pod
 **podman pod top** [*options*] *pod* [*format-descriptors*]
 
 ## DESCRIPTION
-Display the running processes of containers in a pod. The *format-descriptors* are ps (1) compatible AIX format descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities of a given process. The descriptors can either be passed as separate arguments or as a single comma-separated argument. Note that if additional options of ps(1) are specified, Podman falls back to executing ps with the specified arguments and options in the container.
+Display the running processes of containers in a pod. The *format-descriptors* are ps (1) compatible AIX format
+descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities
+of a given process. The descriptors can either be passed as separate arguments or as a single comma-separated argument.
 
 ## OPTIONS
 

diff --git a/docs/source/markdown/podman-top.1.md.in b/docs/source/markdown/podman-top.1.md.in
@@ -9,7 +9,13 @@ podman\-top - Display the running processes of a container
 **podman container top** [*options*] *container* [*format-descriptors*]
 
 ## DESCRIPTION
-Display the running processes of the container. The *format-descriptors* are ps (1) compatible AIX format descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities of a given process. The descriptors can either be passed as separated arguments or as a single comma-separated argument. Note that options and or flags of ps(1) can also be specified; in this case, Podman falls back to executing ps with the specified arguments and flags in the container.  Please use the "h*" descriptors to extract host-related information.  For instance, `podman top $name hpid huser` to display the PID and user of the processes in the host context.
+Display the running processes of the container. The *format-descriptors* are ps (1) compatible AIX format
+descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities
+of a given process. The descriptors can either be passed as separated arguments or as a single comma-separated
+argument. Note that options and or flags of ps(1) can also be specified; in this case, Podman falls back to
+executing ps(1) from the host with the specified arguments and flags in the container namespace.  Please use the
+"h*" descriptors to extract host-related information.  For instance, `podman top $name hpid huser` to display
+the PID and user of the processes in the host context.
 
 ## OPTIONS
 
@@ -90,7 +96,7 @@ PID   SECCOMP   COMMAND     %CPU
 8     filter    vi /etc/    0.000
 ```
 
-Podman falls back to executing ps(1) in the container if an unknown descriptor is specified.
+Podman falls back to executing ps(1) from the host in the container namespace if an unknown descriptor is specified.
 
 ```
 $ podman top -l -- aux

diff --git a/libpod/container_top_linux.c b/libpod/container_top_linux.c
@@ -0,0 +1,81 @@
+#define _GNU_SOURCE
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mount.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+/* keep special_exit_code in sync with container_top_linux.go */
+int special_exit_code = 255;
+char **argv = NULL;
+
+void
+create_argv (int len)
+{
+  /* allocate one extra element because we need a final NULL in c */
+  argv = malloc (sizeof (char *) * (len + 1));
+  if (argv == NULL)
+    {
+      fprintf (stderr, "failed to allocate ps argv");
+      exit (special_exit_code);
+    }
+  /* add final NULL */
+  argv[len] = NULL;
+}
+
+void
+set_argv (int pos, char *arg)
+{
+  argv[pos] = arg;
+}
+
+/*
+  We use cgo code here so we can fork then exec separately,
+  this is done so we can mount proc after the fork because the pid namespace is
+  only active after spawning childs.
+*/
+void
+fork_exec_ps ()
+{
+  int r, status = 0;
+  pid_t pid;
+
+  if (argv == NULL)
+    {
+      fprintf (stderr, "argv not initialized");
+      exit (special_exit_code);
+    }
+
+  pid = fork ();
+  if (pid < 0)
+    {
+      fprintf (stderr, "fork: %m");
+      exit (special_exit_code);
+    }
+  if (pid == 0)
+    {
+      r = mount ("proc", "/proc", "proc", 0, NULL);
+      if (r < 0)
+        {
+          fprintf (stderr, "mount proc: %m");
+          exit (special_exit_code);
+        }
+      /* use execve to unset all env vars, we do not want to leak anything into the container */
+      execve (argv[0], argv, NULL);
+      fprintf (stderr, "execve: %m");
+      exit (special_exit_code);
+    }
+
+  r = waitpid (pid, &status, 0);
+  if (r < 0)
+    {
+      fprintf (stderr, "waitpid: %m");
+      exit (special_exit_code);
+    }
+  if (WIFEXITED (status))
+    exit (WEXITSTATUS (status));
+  if (WIFSIGNALED (status))
+    exit (128 + WTERMSIG (status));
+  exit (special_exit_code);
+}
diff --git a/libpod/container_top_linux.go b/libpod/container_top_linux.go
@@ -1,23 +1,181 @@
-//go:build linux
-// +build linux
+//go:build linux && cgo
+// +build linux,cgo
 
 package libpod
 
 import (
 	"bufio"
+	"bytes"
 	"errors"
 	"fmt"
 	"os"
+	"os/exec"
+	"path/filepath"
+	"runtime"
 	"strconv"
 	"strings"
+	"syscall"
+	"unsafe"
 
 	"github.com/containers/podman/v4/libpod/define"
 	"github.com/containers/podman/v4/pkg/rootless"
 	"github.com/containers/psgo"
+	"github.com/containers/storage/pkg/reexec"
 	"github.com/google/shlex"
-	"github.com/sirupsen/logrus"
+	"golang.org/x/sys/unix"
 )
 
+/*
+#include <stdlib.h>
+void fork_exec_ps();
+void create_argv(int len);
+void set_argv(int pos, char *arg);
+*/
+import "C"
+
+const (
+	// podmanTopCommand is the reexec key to safely setup the environment for ps to be executed
+	podmanTopCommand = "podman-top"
+
+	// podmanTopExitCode is a special exec code to signal that podman failed to to something in
+	// reexec command not ps. This is used to give a better error.
+	podmanTopExitCode = 255
+)
+
+func init() {
+	reexec.Register(podmanTopCommand, podmanTopMain)
+}
+
+// podmanTopMain - main function for the reexec
+func podmanTopMain() {
+	if err := podmanTopInner(); err != nil {
+		fmt.Fprint(os.Stderr, err.Error())
+		os.Exit(podmanTopExitCode)
+	}
+	os.Exit(0)
+}
+
+// podmanTopInner os.Args = {command name} {pid} {psPath} [args...]
+// We are rexxec'd in a new mountns, then we need to set some security settings in order
+// to safely execute ps in the container pid namespace. Most notably make sure podman and
+// ps are read only to prevent a process from overwriting it.
+func podmanTopInner() error {
+	if len(os.Args) < 3 {
+		return fmt.Errorf("internal error, need at least two arguments")
+	}
+
+	// We have to lock the thread as we a) switch namespace below and b) use PR_SET_PDEATHSIG
+	// Also do not unlock as this thread should not be reused by go we exit anyway at the end.
+	runtime.LockOSThread()
+
+	if err := unix.Prctl(unix.PR_SET_PDEATHSIG, uintptr(unix.SIGKILL), 0, 0, 0); err != nil {
+		return fmt.Errorf("PR_SET_PDEATHSIG: %w", err)
+	}
+	if err := unix.Prctl(unix.PR_SET_DUMPABLE, 0, 0, 0, 0); err != nil {
+		return fmt.Errorf("PR_SET_DUMPABLE: %w", err)
+	}
+
+	if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
+		return fmt.Errorf("PR_SET_NO_NEW_PRIVS: %w", err)
+	}
+
+	if err := unix.Mount("none", "/", "", unix.MS_REC|unix.MS_PRIVATE, ""); err != nil {
+		return fmt.Errorf("make / mount private: %w", err)
+	}
+
+	psPath := os.Args[2]
+
+	// try to mount everything read only
+	if err := unix.MountSetattr(0, "/", unix.AT_RECURSIVE, &unix.MountAttr{
+		Attr_set: unix.MOUNT_ATTR_RDONLY,
+	}); err != nil {
+		if err != unix.ENOSYS {
+			return fmt.Errorf("mount_setattr / readonly: %w", err)
+		}
+		// old kernel without mount_setattr, i.e. on RHEL 8.8
+		// Bind mount the directories readonly for both podman and ps.
+		psPath, err = remountReadOnly(psPath)
+		if err != nil {
+			return err
+		}
+		_, err = remountReadOnly(reexec.Self())
+		if err != nil {
+			return err
+		}
+	}
+
+	// extra safety check make sure the ps path is actually read only
+	err := unix.Access(psPath, unix.W_OK)
+	if err == nil {
+		return fmt.Errorf("%q was not mounted read only, this can be dangerous so we will not execute it", psPath)
+	}
+
+	pid := os.Args[1]
+	// join the pid namespace of pid
+	pidFD, err := os.Open(fmt.Sprintf("/proc/%s/ns/pid", pid))
+	if err != nil {
+		return fmt.Errorf("open pidns: %w", err)
+	}
+	if err := unix.Setns(int(pidFD.Fd()), unix.CLONE_NEWPID); err != nil {
+		return fmt.Errorf("setns NEWPID: %w", err)
+	}
+	pidFD.Close()
+
+	args := []string{psPath}
+	args = append(args, os.Args[3:]...)
+
+	C.create_argv(C.int(len(args)))
+	for i, arg := range args {
+		cArg := C.CString(arg)
+		C.set_argv(C.int(i), cArg)
+		defer C.free(unsafe.Pointer(cArg))
+	}
+
+	// Now try to close open fds except std streams
+	// While golang open everything O_CLOEXEC it could still leak fds from
+	// the parent, i.e. bash. In this case an attacker might be able to
+	// read/write from them.
+	// Do this as last step, it has to happen before to fork because the child
+	// will be immediately in pid namespace so we cannot close them in the child.
+	entries, err := os.ReadDir("/proc/self/fd")
+	if err != nil {
+		return err
+	}
+	for _, e := range entries {
+		i, err := strconv.Atoi(e.Name())
+		// IsFdInherited checks the we got the fd from a parent process and only close them,
+		// when we close all that would include the ones from the go runtime which
+		// then can panic because of that.
+		if err == nil && i > unix.Stderr && rootless.IsFdInherited(i) {
+			_ = unix.Close(i)
+		}
+	}
+
+	// this function will always exit for us
+	C.fork_exec_ps()
+	return nil
+}
+
+// remountReadOnly remounts the parent directory of the given path read only
+// return the resolved path or an error. The path can then be used to exec the
+// binary as we know it is on a read only mount now.
+func remountReadOnly(path string) (string, error) {
+	resolvedPath, err := filepath.EvalSymlinks(path)
+	if err != nil {
+		return "", fmt.Errorf("resolve symlink for %s: %w", path, err)
+	}
+	dir := filepath.Dir(resolvedPath)
+	// create mount point
+	if err := unix.Mount(dir, dir, "", unix.MS_BIND, ""); err != nil {
+		return "", fmt.Errorf("mount %s read only: %w", dir, err)
+	}
+	// remount readonly
+	if err := unix.Mount(dir, dir, "", unix.MS_BIND|unix.MS_REMOUNT|unix.MS_RDONLY, ""); err != nil {
+		return "", fmt.Errorf("mount %s read only: %w", dir, err)
+	}
+	return resolvedPath, nil
+}
+
 // Top gathers statistics about the running processes in a container. It returns a
 // []string for output
 func (c *Container) Top(descriptors []string) ([]string, error) {
@@ -70,7 +228,7 @@ func (c *Container) Top(descriptors []string) ([]string, error) {
 
 	output, err = c.execPS(psDescriptors)
 	if err != nil {
-		return nil, fmt.Errorf("executing ps(1) in the container: %w", err)
+		return nil, fmt.Errorf("executing ps(1): %w", err)
 	}
 
 	// Trick: filter the ps command from the output instead of
@@ -113,60 +271,52 @@ func (c *Container) GetContainerPidInformation(descriptors []string) ([]string,
 	return res, nil
 }
 
-// execPS executes ps(1) with the specified args in the container.
-func (c *Container) execPS(args []string) ([]string, error) {
+// execute ps(1) from the host within the container pid namespace
+func (c *Container) execPS(psArgs []string) ([]string, error) {
 	rPipe, wPipe, err := os.Pipe()
 	if err != nil {
 		return nil, err
 	}
 	defer wPipe.Close()
 	defer rPipe.Close()
 
-	rErrPipe, wErrPipe, err := os.Pipe()
-	if err != nil {
-		return nil, err
-	}
-	defer wErrPipe.Close()
-	defer rErrPipe.Close()
-
-	streams := new(define.AttachStreams)
-	streams.OutputStream = wPipe
-	streams.ErrorStream = wErrPipe
-	streams.AttachOutput = true
-	streams.AttachError = true
-
 	stdout := []string{}
 	go func() {
 		scanner := bufio.NewScanner(rPipe)
 		for scanner.Scan() {
 			stdout = append(stdout, scanner.Text())
 		}
 	}()
-	stderr := []string{}
-	go func() {
-		scanner := bufio.NewScanner(rErrPipe)
-		for scanner.Scan() {
-			stderr = append(stderr, scanner.Text())
-		}
-	}()
 
-	cmd := append([]string{"ps"}, args...)
-	config := new(ExecConfig)
-	config.Command = cmd
-	ec, err := c.Exec(config, streams, nil)
+	psPath, err := exec.LookPath("ps")
 	if err != nil {
 		return nil, err
-	} else if ec != 0 {
-		return nil, fmt.Errorf("runtime failed with exit status: %d and output: %s", ec, strings.Join(stderr, " "))
 	}
+	args := append([]string{podmanTopCommand, strconv.Itoa(c.state.PID), psPath}, psArgs...)
 
-	if logrus.GetLevel() >= logrus.DebugLevel {
-		// If we're running in debug mode or higher, we might want to have a
-		// look at stderr which includes debug logs from conmon.
-		for _, log := range stderr {
-			logrus.Debugf("%s", log)
+	cmd := reexec.Command(args...)
+	cmd.SysProcAttr = &syscall.SysProcAttr{
+		Unshareflags: unix.CLONE_NEWNS,
+	}
+	var errBuf bytes.Buffer
+	cmd.Stdout = wPipe
+	cmd.Stderr = &errBuf
+	// nil means use current env so explicitly unset all, to not leak any sensitive env vars
+	cmd.Env = []string{}
+	err = cmd.Run()
+	if err != nil {
+		exitError := &exec.ExitError{}
+		if errors.As(err, &exitError) {
+			if exitError.ExitCode() != podmanTopExitCode {
+				// ps command failed
+				err = fmt.Errorf("ps(1) failed with exit code %d: %s", exitError.ExitCode(), errBuf.String())
+			} else {
+				// podman-top reexec setup fails somewhere
+				err = fmt.Errorf("could not execute ps(1) in the container pid namespace: %s", errBuf.String())
+			}
+		} else {
+			err = fmt.Errorf("could not reexec podman-top command: %w", err)
 		}
 	}
-
-	return stdout, nil
+	return stdout, err
 }