Skip to content

Commit

Permalink
top: do not depend on ps(1) in container
Browse files Browse the repository at this point in the history
This ended up more complicated then expected. Lets start first with the
problem to show why I am doing this:

Currently we simply execute ps(1) in the container. This has some
drawbacks. First, obviously you need to have ps(1) in the container
image. That is no always the case especially in small images. Second,
even if you do it will often be only busybox's ps which supports far
less options.

Now we also have psgo which is used by default but that only supports a
small subset of ps(1) options. Implementing all options there is way to
much work.

Docker on the other hand executes ps(1) directly on the host and tries
to filter pids with `-q` an option which is not supported by busybox's
ps and conflicts with other ps(1) arguments. That means they fall back
to full ps(1) on the host and then filter based on the pid in the
output. This is kinda ugly and fails short because users can modify the
ps output and it may not even include the pid in the output which causes
an error.

So every solution has a different drawback, but what if we can combine
them somehow?! This commit tries exactly that.

We use ps(1) from the host and execute that in the container's pid
namespace.
There are some security concerns that must be addressed:
- mount the executable paths for ps and podman itself readonly to
  prevent the container from overwriting it via /proc/self/exe.
- set NO_NEW_PRIVS, SET_DUMPABLE and PDEATHSIG
- close all non std fds to prevent leaking files in that the caller had
  open
- unset all environment variables to not leak any into the contianer

Technically this could be a breaking change if somebody does not
have ps on the host and only in the container but I find that very
unlikely so I have removed the in container fallback.

This updates the docs accordingly, note that podman pod top never falls
back to executing ps in the container as this makes no sense with
multiple containers so I fixed the docs there as well.

Fixes containers#19001
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=2215572

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
  • Loading branch information
Luap99 committed Jul 7, 2023
1 parent 2560716 commit d963033
Show file tree
Hide file tree
Showing 10 changed files with 303 additions and 54 deletions.
4 changes: 3 additions & 1 deletion docs/source/markdown/podman-pod-top.1.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ podman\-pod\-top - Display the running processes of containers in a pod
**podman pod top** [*options*] *pod* [*format-descriptors*]

## DESCRIPTION
Display the running processes of containers in a pod. The *format-descriptors* are ps (1) compatible AIX format descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities of a given process. The descriptors can either be passed as separate arguments or as a single comma-separated argument. Note that if additional options of ps(1) are specified, Podman falls back to executing ps with the specified arguments and options in the container.
Display the running processes of containers in a pod. The *format-descriptors* are ps (1) compatible AIX format
descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities
of a given process. The descriptors can either be passed as separate arguments or as a single comma-separated argument.

## OPTIONS

Expand Down
10 changes: 8 additions & 2 deletions docs/source/markdown/podman-top.1.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@ podman\-top - Display the running processes of a container
**podman container top** [*options*] *container* [*format-descriptors*]

## DESCRIPTION
Display the running processes of the container. The *format-descriptors* are ps (1) compatible AIX format descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities of a given process. The descriptors can either be passed as separated arguments or as a single comma-separated argument. Note that options and or flags of ps(1) can also be specified; in this case, Podman falls back to executing ps with the specified arguments and flags in the container. Please use the "h*" descriptors to extract host-related information. For instance, `podman top $name hpid huser` to display the PID and user of the processes in the host context.
Display the running processes of the container. The *format-descriptors* are ps (1) compatible AIX format
descriptors but extended to print additional information, such as the seccomp mode or the effective capabilities
of a given process. The descriptors can either be passed as separated arguments or as a single comma-separated
argument. Note that options and or flags of ps(1) can also be specified; in this case, Podman falls back to
executing ps(1) from the host with the specified arguments and flags in the container namespace. Please use the
"h*" descriptors to extract host-related information. For instance, `podman top $name hpid huser` to display
the PID and user of the processes in the host context.

## OPTIONS

Expand Down Expand Up @@ -90,7 +96,7 @@ PID SECCOMP COMMAND %CPU
8 filter vi /etc/ 0.000
```

Podman falls back to executing ps(1) in the container if an unknown descriptor is specified.
Podman falls back to executing ps(1) from the host in the container namespace if an unknown descriptor is specified.

```
$ podman top -l -- aux
Expand Down
81 changes: 81 additions & 0 deletions libpod/container_top_linux.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/wait.h>
#include <unistd.h>

/* keep special_exit_code in sync with container_top_linux.go */
int special_exit_code = 255;
char **argv = NULL;

void
create_argv (int len)
{
/* allocate one extra element because we need a final NULL in c */
argv = malloc (sizeof (char *) * (len + 1));
if (argv == NULL)
{
fprintf (stderr, "failed to allocate ps argv");
exit (special_exit_code);
}
/* add final NULL */
argv[len] = NULL;
}

void
set_argv (int pos, char *arg)
{
argv[pos] = arg;
}

/*
We use cgo code here so we can fork then exec separately,
this is done so we can mount proc after the fork because the pid namespace is
only active after spawning childs.
*/
void
fork_exec_ps ()
{
int r, status = 0;
pid_t pid;

if (argv == NULL)
{
fprintf (stderr, "argv not initialized");
exit (special_exit_code);
}

pid = fork ();
if (pid < 0)
{
fprintf (stderr, "fork: %m");
exit (special_exit_code);
}
if (pid == 0)
{
r = mount ("proc", "/proc", "proc", 0, NULL);
if (r < 0)
{
fprintf (stderr, "mount proc: %m");
exit (special_exit_code);
}
/* use execve to unset all env vars, we do not want to leak anything into the container */
execve (argv[0], argv, NULL);
fprintf (stderr, "execve: %m");
exit (special_exit_code);
}

r = waitpid (pid, &status, 0);
if (r < 0)
{
fprintf (stderr, "waitpid: %m");
exit (special_exit_code);
}
if (WIFEXITED (status))
exit (WEXITSTATUS (status));
if (WIFSIGNALED (status))
exit (128 + WTERMSIG (status));
exit (special_exit_code);
}
228 changes: 189 additions & 39 deletions libpod/container_top_linux.go
Original file line number Diff line number Diff line change
@@ -1,23 +1,181 @@
//go:build linux
// +build linux
//go:build linux && cgo
// +build linux,cgo

package libpod

import (
"bufio"
"bytes"
"errors"
"fmt"
"os"
"os/exec"
"path/filepath"
"runtime"
"strconv"
"strings"
"syscall"
"unsafe"

"github.com/containers/podman/v4/libpod/define"
"github.com/containers/podman/v4/pkg/rootless"
"github.com/containers/psgo"
"github.com/containers/storage/pkg/reexec"
"github.com/google/shlex"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)

/*
#include <stdlib.h>
void fork_exec_ps();
void create_argv(int len);
void set_argv(int pos, char *arg);
*/
import "C"

const (
// podmanTopCommand is the reexec key to safely setup the environment for ps to be executed
podmanTopCommand = "podman-top"

// podmanTopExitCode is a special exec code to signal that podman failed to to something in
// reexec command not ps. This is used to give a better error.
podmanTopExitCode = 255
)

func init() {
reexec.Register(podmanTopCommand, podmanTopMain)
}

// podmanTopMain - main function for the reexec
func podmanTopMain() {
if err := podmanTopInner(); err != nil {
fmt.Fprint(os.Stderr, err.Error())
os.Exit(podmanTopExitCode)
}
os.Exit(0)
}

// podmanTopInner os.Args = {command name} {pid} {psPath} [args...]
// We are rexxec'd in a new mountns, then we need to set some security settings in order
// to safely execute ps in the container pid namespace. Most notably make sure podman and
// ps are read only to prevent a process from overwriting it.
func podmanTopInner() error {
if len(os.Args) < 3 {
return fmt.Errorf("internal error, need at least two arguments")
}

// We have to lock the thread as we a) switch namespace below and b) use PR_SET_PDEATHSIG
// Also do not unlock as this thread should not be reused by go we exit anyway at the end.
runtime.LockOSThread()

if err := unix.Prctl(unix.PR_SET_PDEATHSIG, uintptr(unix.SIGKILL), 0, 0, 0); err != nil {
return fmt.Errorf("PR_SET_PDEATHSIG: %w", err)
}
if err := unix.Prctl(unix.PR_SET_DUMPABLE, 0, 0, 0, 0); err != nil {
return fmt.Errorf("PR_SET_DUMPABLE: %w", err)
}

if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
return fmt.Errorf("PR_SET_NO_NEW_PRIVS: %w", err)
}

if err := unix.Mount("none", "/", "", unix.MS_REC|unix.MS_PRIVATE, ""); err != nil {
return fmt.Errorf("make / mount private: %w", err)
}

psPath := os.Args[2]

// try to mount everything read only
if err := unix.MountSetattr(0, "/", unix.AT_RECURSIVE, &unix.MountAttr{
Attr_set: unix.MOUNT_ATTR_RDONLY,
}); err != nil {
if err != unix.ENOSYS {
return fmt.Errorf("mount_setattr / readonly: %w", err)
}
// old kernel without mount_setattr, i.e. on RHEL 8.8
// Bind mount the directories readonly for both podman and ps.
psPath, err = remountReadOnly(psPath)
if err != nil {
return err
}
_, err = remountReadOnly(reexec.Self())
if err != nil {
return err
}
}

// extra safety check make sure the ps path is actually read only
err := unix.Access(psPath, unix.W_OK)
if err == nil {
return fmt.Errorf("%q was not mounted read only, this can be dangerous so we will not execute it", psPath)
}

pid := os.Args[1]
// join the pid namespace of pid
pidFD, err := os.Open(fmt.Sprintf("/proc/%s/ns/pid", pid))
if err != nil {
return fmt.Errorf("open pidns: %w", err)
}
if err := unix.Setns(int(pidFD.Fd()), unix.CLONE_NEWPID); err != nil {
return fmt.Errorf("setns NEWPID: %w", err)
}
pidFD.Close()

args := []string{psPath}
args = append(args, os.Args[3:]...)

C.create_argv(C.int(len(args)))
for i, arg := range args {
cArg := C.CString(arg)
C.set_argv(C.int(i), cArg)
defer C.free(unsafe.Pointer(cArg))
}

// Now try to close open fds except std streams
// While golang open everything O_CLOEXEC it could still leak fds from
// the parent, i.e. bash. In this case an attacker might be able to
// read/write from them.
// Do this as last step, it has to happen before to fork because the child
// will be immediately in pid namespace so we cannot close them in the child.
entries, err := os.ReadDir("/proc/self/fd")
if err != nil {
return err
}
for _, e := range entries {
i, err := strconv.Atoi(e.Name())
// IsFdInherited checks the we got the fd from a parent process and only close them,
// when we close all that would include the ones from the go runtime which
// then can panic because of that.
if err == nil && i > unix.Stderr && rootless.IsFdInherited(i) {
_ = unix.Close(i)
}
}

// this function will always exit for us
C.fork_exec_ps()
return nil
}

// remountReadOnly remounts the parent directory of the given path read only
// return the resolved path or an error. The path can then be used to exec the
// binary as we know it is on a read only mount now.
func remountReadOnly(path string) (string, error) {
resolvedPath, err := filepath.EvalSymlinks(path)
if err != nil {
return "", fmt.Errorf("resolve symlink for %s: %w", path, err)
}
dir := filepath.Dir(resolvedPath)
// create mount point
if err := unix.Mount(dir, dir, "", unix.MS_BIND, ""); err != nil {
return "", fmt.Errorf("mount %s read only: %w", dir, err)
}
// remount readonly
if err := unix.Mount(dir, dir, "", unix.MS_BIND|unix.MS_REMOUNT|unix.MS_RDONLY, ""); err != nil {
return "", fmt.Errorf("mount %s read only: %w", dir, err)
}
return resolvedPath, nil
}

// Top gathers statistics about the running processes in a container. It returns a
// []string for output
func (c *Container) Top(descriptors []string) ([]string, error) {
Expand Down Expand Up @@ -70,7 +228,7 @@ func (c *Container) Top(descriptors []string) ([]string, error) {

output, err = c.execPS(psDescriptors)
if err != nil {
return nil, fmt.Errorf("executing ps(1) in the container: %w", err)
return nil, fmt.Errorf("executing ps(1): %w", err)
}

// Trick: filter the ps command from the output instead of
Expand Down Expand Up @@ -113,60 +271,52 @@ func (c *Container) GetContainerPidInformation(descriptors []string) ([]string,
return res, nil
}

// execPS executes ps(1) with the specified args in the container.
func (c *Container) execPS(args []string) ([]string, error) {
// execute ps(1) from the host within the container pid namespace
func (c *Container) execPS(psArgs []string) ([]string, error) {
rPipe, wPipe, err := os.Pipe()
if err != nil {
return nil, err
}
defer wPipe.Close()
defer rPipe.Close()

rErrPipe, wErrPipe, err := os.Pipe()
if err != nil {
return nil, err
}
defer wErrPipe.Close()
defer rErrPipe.Close()

streams := new(define.AttachStreams)
streams.OutputStream = wPipe
streams.ErrorStream = wErrPipe
streams.AttachOutput = true
streams.AttachError = true

stdout := []string{}
go func() {
scanner := bufio.NewScanner(rPipe)
for scanner.Scan() {
stdout = append(stdout, scanner.Text())
}
}()
stderr := []string{}
go func() {
scanner := bufio.NewScanner(rErrPipe)
for scanner.Scan() {
stderr = append(stderr, scanner.Text())
}
}()

cmd := append([]string{"ps"}, args...)
config := new(ExecConfig)
config.Command = cmd
ec, err := c.Exec(config, streams, nil)
psPath, err := exec.LookPath("ps")
if err != nil {
return nil, err
} else if ec != 0 {
return nil, fmt.Errorf("runtime failed with exit status: %d and output: %s", ec, strings.Join(stderr, " "))
}
args := append([]string{podmanTopCommand, strconv.Itoa(c.state.PID), psPath}, psArgs...)

if logrus.GetLevel() >= logrus.DebugLevel {
// If we're running in debug mode or higher, we might want to have a
// look at stderr which includes debug logs from conmon.
for _, log := range stderr {
logrus.Debugf("%s", log)
cmd := reexec.Command(args...)
cmd.SysProcAttr = &syscall.SysProcAttr{
Unshareflags: unix.CLONE_NEWNS,
}
var errBuf bytes.Buffer
cmd.Stdout = wPipe
cmd.Stderr = &errBuf
// nil means use current env so explicitly unset all, to not leak any sensitive env vars
cmd.Env = []string{}
err = cmd.Run()
if err != nil {
exitError := &exec.ExitError{}
if errors.As(err, &exitError) {
if exitError.ExitCode() != podmanTopExitCode {
// ps command failed
err = fmt.Errorf("ps(1) failed with exit code %d: %s", exitError.ExitCode(), errBuf.String())
} else {
// podman-top reexec setup fails somewhere
err = fmt.Errorf("could not execute ps(1) in the container pid namespace: %s", errBuf.String())
}
} else {
err = fmt.Errorf("could not reexec podman-top command: %w", err)
}
}

return stdout, nil
return stdout, err
}

0 comments on commit d963033

Please sign in to comment.