forked from containers/podman
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
top: do not depend on ps(1) in container
This ended up more complicated then expected. Lets start first with the problem to show why I am doing this: Currently we simply execute ps(1) in the container. This has some drawbacks. First, obviously you need to have ps(1) in the container image. That is no always the case especially in small images. Second, even if you do it will often be only busybox's ps which supports far less options. Now we also have psgo which is used by default but that only supports a small subset of ps(1) options. Implementing all options there is way to much work. Docker on the other hand executes ps(1) directly on the host and tries to filter pids with `-q` an option which is not supported by busybox's ps and conflicts with other ps(1) arguments. That means they fall back to full ps(1) on the host and then filter based on the pid in the output. This is kinda ugly and fails short because users can modify the ps output and it may not even include the pid in the output which causes an error. So every solution has a different drawback, but what if we can combine them somehow?! This commit tries exactly that. We use ps(1) from the host and execute that in the container's pid namespace. There are some security concerns that must be addressed: - mount the executable paths for ps and podman itself readonly to prevent the container from overwriting it via /proc/self/exe. - set NO_NEW_PRIVS, SET_DUMPABLE and PDEATHSIG - close all non std fds to prevent leaking files in that the caller had open - unset all environment variables to not leak any into the contianer Technically this could be a breaking change if somebody does not have ps on the host and only in the container but I find that very unlikely so I have removed the in container fallback. This updates the docs accordingly, note that podman pod top never falls back to executing ps in the container as this makes no sense with multiple containers so I fixed the docs there as well. Fixes containers#19001 Fixes https://bugzilla.redhat.com/show_bug.cgi?id=2215572 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
- Loading branch information
Showing
10 changed files
with
302 additions
and
57 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
#define _GNU_SOURCE | ||
#include <errno.h> | ||
#include <stdio.h> | ||
#include <stdlib.h> | ||
#include <sys/mount.h> | ||
#include <sys/wait.h> | ||
#include <unistd.h> | ||
|
||
/* keep special_exit_code in sync with container_top_linux.go */ | ||
int special_exit_code = 255; | ||
char **argv = NULL; | ||
|
||
void | ||
create_argv (int len) | ||
{ | ||
/* allocate one extra element because we need a final NULL in c */ | ||
argv = malloc (sizeof (char *) * (len + 1)); | ||
if (argv == NULL) | ||
{ | ||
fprintf (stderr, "failed to allocate ps argv"); | ||
exit (special_exit_code); | ||
} | ||
/* add final NULL */ | ||
argv[len] = NULL; | ||
} | ||
|
||
void | ||
set_argv (int pos, char *arg) | ||
{ | ||
argv[pos] = arg; | ||
} | ||
|
||
/* | ||
We use cgo code here so we can fork then exec separately, | ||
this is done so we can mount proc after the fork because the pid namespace is | ||
only active after spawning childs. | ||
*/ | ||
void | ||
fork_exec_ps () | ||
{ | ||
int r, status = 0; | ||
pid_t pid; | ||
|
||
if (argv == NULL) | ||
{ | ||
fprintf (stderr, "argv not initialized"); | ||
exit (special_exit_code); | ||
} | ||
|
||
pid = fork (); | ||
if (pid < 0) | ||
{ | ||
fprintf (stderr, "fork: %m"); | ||
exit (special_exit_code); | ||
} | ||
if (pid == 0) | ||
{ | ||
r = mount ("proc", "/proc", "proc", 0, NULL); | ||
if (r < 0) | ||
{ | ||
fprintf (stderr, "mount proc: %m"); | ||
exit (special_exit_code); | ||
} | ||
/* use execve to unset all env vars, we do not want to leak anything into the container */ | ||
execve (argv[0], argv, NULL); | ||
fprintf (stderr, "execve: %m"); | ||
exit (special_exit_code); | ||
} | ||
|
||
r = waitpid (pid, &status, 0); | ||
if (r < 0) | ||
{ | ||
fprintf (stderr, "waitpid: %m"); | ||
exit (special_exit_code); | ||
} | ||
if (WIFEXITED (status)) | ||
exit (WEXITSTATUS (status)); | ||
if (WIFSIGNALED (status)) | ||
exit (128 + WTERMSIG (status)); | ||
exit (special_exit_code); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.