Skip to content

Sync 1.1 to 1.1-m#3

Merged
ameyag merged 29 commits intoMirantis:release-1.1-mfrom
ameyag:sync-0216
Feb 19, 2024
Merged

Sync 1.1 to 1.1-m#3
ameyag merged 29 commits intoMirantis:release-1.1-mfrom
ameyag:sync-0216

Conversation

@ameyag
Copy link
Copy Markdown

@ameyag ameyag commented Feb 16, 2024

In #2, only GHSA-xr7r-f8xq-vfvv was cherry-picked. This is syncing latest 1.1 into 1.1-m

dependabot Bot and others added 29 commits December 11, 2023 12:24
Bumps [github.com/cyphar/filepath-securejoin](https://github.com/cyphar/filepath-securejoin) from 0.2.3 to 0.2.4.
- [Release notes](https://github.com/cyphar/filepath-securejoin/releases)
- [Commits](cyphar/filepath-securejoin@v0.2.3...v0.2.4)

---
updated-dependencies:
- dependency-name: github.com/cyphar/filepath-securejoin
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
(cherry picked from commit c7ad274)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
…date_securejoin

[1.1 backport] build(deps): bump github.com/cyphar/filepath-securejoin
This field reports swap-only usage. For cgroupv1, `Usage` and `Failcnt`
are set by subtracting memory usage from memory+swap usage. For cgroupv2,
`Usage`, `Limit`, and `MaxUsage` are set. This commit also export `MaxUsage`
of memory under cgroupv2 mode, using `memory.peak` introduced in kernel 5.19.

Signed-off-by: Heran Yang <heran55@126.com>
(cherry picked from commit 104b8dc)
Signed-off-by: Harshal Patil <harpatil@redhat.com>
[1.1] feat: add swapOnlyUsage in MemoryStats
(This is a cherry-pick of 1912d59.)

Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.

The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.

In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.

As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.

Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.

Finally, add an integration test to make sure we don't regress this again.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a cherry-pick of 09822c3.)

For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).

In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.

It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a cherry-pick of 6fa8d06.)

Given we've had several bugs in this behaviour that have now been fixed,
add an integration test that makes sure that you can start a container
that joins all of the namespaces of a second container.

The only namespace we do not join is the mount namespace, because
joining a namespace that has been pivot_root'd leads to a bunch of
errors. In principle, removing everything from config.json that requires
a mount _should_ work, but the root.path configuration is mandatory and
we cannot just ignore setting up the rootfs in the namespace joining
scenario (if the user has configured a different rootfs, we need to use
it or error out, and there's no reasonable way of checking if if the
rootfs paths are the same that doesn't result in spaghetti logic).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a cherry-pick of ebcef3e.)

It turns out that the error added in commit 09822c3 ("configs:
disallow ambiguous userns and timens configurations") causes issues with
containerd and CRIO because they pass both userns mappings and a userns
path.

These configurations are broken, but to avoid the regression in this one
case, output a warning to tell the user that the configuration is
incorrect but we will continue to use it if and only if the configured
mappings are identical to the mappings of the provided namespace.

Fixes: 09822c3 ("configs: disallow ambiguous userns and timens configurations")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a cherry-pick of 482e563.)

Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:

  failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
  parsing id map failed: invalid format in line "         0          0 4294967295": integer overflow on token 4294967295

This issue was unearthed by commit 1912d59 ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.

In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
[1.1] *: fix several issues with userns path handling
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
lfbzhm (2):
  VERSION: back to development
  VERSION: release 1.1.11

LGTMs: AkihiroSuda cyphar
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Aleksa Sarai (2):
  keyring: update AkihiroSuda key expiry
  keyring: update cyphar@cyphar.com key expiry

LGTMs: AkihiroSuda lifubang
(This is a cherry-pick of 937ca10.)

Signed-off-by: hang.jiang <hang.jiang@daocloud.io>
Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If a file descriptor of a directory in the host's mount namespace is
leaked to runc init, a malicious config.json could use /proc/self/fd/...
as a working directory to allow for host filesystem access after the
container runs. This can also be exploited by a container process if it
knows that an administrator will use "runc exec --cwd" and the target
--cwd (the attacker can change that cwd to be a symlink pointing to
/proc/self/fd/... and wait for the process to exec and then snoop on
/proc/$pid/cwd to get access to the host). The former issue can lead to
a critical vulnerability in Docker and Kubernetes, while the latter is a
container breakout.

We can (ab)use the fact that getcwd(2) on Linux detects this exact case,
and getcwd(3) and Go's Getwd() return an error as a result. Thus, if we
just do os.Getwd() after chdir we can easily detect this case and error
out.

In runc 1.1, a /sys/fs/cgroup handle happens to be leaked to "runc
init", making this exploitable. On runc main it just so happens that the
leaked /sys/fs/cgroup gets clobbered and thus this is only consistently
exploitable for runc 1.1.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[refactored the implementation and added more comments]
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(This is a partial backport of a minor change included in commit
dac4171.)

This mirrors the logic in standard_init_linux.go, and also ensures that
we do not call exec.LookPath in the final execve step.

While this is okay for regular binaries, it seems exec.LookPath calls
os.Getenv which tries to emit a log entry to the test harness when
running in "go test" mode. In a future patch (in order to fix
CVE-2024-21626), we will close all of the file descriptors immediately
before execve, which would mean the file descriptor for test harness
logging would be closed at execve time. So, moving exec.LookPath earlier
is necessary.

Ref: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If we leak a file descriptor referencing the host filesystem, an
attacker could use a /proc/self/fd magic-link as the source for execve
to execute a host binary in the container. This would allow the binary
itself (or a process inside the container in the 'runc exec' case) to
write to a host binary, leading to a container escape.

The simple solution is to make sure we close all file descriptors
immediately before the execve(2) step. Doing this earlier can lead to very
serious issues in Go (as file descriptors can be reused, any (*os.File)
reference could start silently operating on a different file) so we have
to do it as late as possible.

Unfortunately, there are some Go runtime file descriptors that we must
not close (otherwise the Go scheduler panics randomly). The only way of
being sure which file descriptors cannot be closed is to sneakily
go:linkname the runtime internal "internal/poll.IsPollDescriptor"
function. This is almost certainly not recommended but there isn't any
other way to be absolutely sure, while also closing any other possible
files.

In addition, we can keep the logrus forwarding logfd open because you
cannot execve a pipe and the contents of the pipe are so restricted
(JSON-encoded in a format we pick) that it seems unlikely you could even
construct shellcode. Closing the logfd causes issues if there is an
error returned from execve.

In mainline runc, runc-dmz protects us against this attack because the
intermediate execve(2) closes all of the O_CLOEXEC internal runc file
descriptors and thus runc-dmz cannot access them to attack the host.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We auto-close this file descriptor in the final exec step, but it's
probably a good idea to not possibly leak the file descriptor to "runc
init" (we've had issues like this in the past) especially since it is a
directory handle from the host mount namespace.

In practice, on runc 1.1 this does leak to "runc init" but on main the
handle has a low enough file descriptor that it gets clobbered by the
ForkExec of "runc init".

OPEN_TREE_CLONE would let us protect this handle even further, but the
performance impact of creating an anonymous mount namespace is probably
not worth it.

Also, switch to using an *os.File for the handle so if it goes out of
scope during setup (i.e. an error occurs during setup) it will get
cleaned up by the GC.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Given the core issue in GHSA-xr7r-f8xq-vfvv was that we were unknowingly
leaking file descriptors to "runc init", it seems prudent to make sure
we proactively prevent this in the future. The solution is to simply
mark all non-stdio file descriptors as O_CLOEXEC before we spawn "runc
init".

For libcontainer library users, this could result in unrelated files
being marked as O_CLOEXEC -- however (for the same reason we are doing
this for runc), for security reasons those files should've been marked
as O_CLOEXEC anyway.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We close the logfd before execve so there's no need to special case it.
In addition, it turns out that (*os.File).Fd() doesn't handle the case
where the file was closed and so it seems suspect to use that kind of
check.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This is a security fix for CVE-2024-21626. See the advisory[1] for more
details.

Aleksa Sarai (6):
  init: don't special-case logrus fds
  libcontainer: mark all non-stdio fds O_CLOEXEC before spawning init
  cgroup: plug leaks of /sys/fs/cgroup handle
  init: close internal fds before execve
  setns init: do explicit lookup of execve argument early
  init: verify after chdir that cwd is inside the container

Hang Jiang (1):
  Fix File to Close

[1]: GHSA-xr7r-f8xq-vfvv

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
LGTMs: cyphar AkihiroSuda kolyshkin lifubang
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Ameya Gawde <agawde@mirantis.com>
@ameyag ameyag requested a review from aepifanov February 16, 2024 21:52
@ameyag ameyag merged commit 77d7dd5 into Mirantis:release-1.1-m Feb 19, 2024
@ameyag ameyag deleted the sync-0216 branch February 19, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants