Sync 1.1 to 1.1-m by ameyag · Pull Request #3 · Mirantis/runc

ameyag · 2024-02-16T21:52:15Z

In #2, only GHSA-xr7r-f8xq-vfvv was cherry-picked. This is syncing latest 1.1 into 1.1-m

Bumps [github.com/cyphar/filepath-securejoin](https://github.com/cyphar/filepath-securejoin) from 0.2.3 to 0.2.4. - [Release notes](https://github.com/cyphar/filepath-securejoin/releases) - [Commits](cyphar/filepath-securejoin@v0.2.3...v0.2.4) --- updated-dependencies: - dependency-name: github.com/cyphar/filepath-securejoin dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> (cherry picked from commit c7ad274) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

…date_securejoin [1.1 backport] build(deps): bump github.com/cyphar/filepath-securejoin

This field reports swap-only usage. For cgroupv1, `Usage` and `Failcnt` are set by subtracting memory usage from memory+swap usage. For cgroupv2, `Usage`, `Limit`, and `MaxUsage` are set. This commit also export `MaxUsage` of memory under cgroupv2 mode, using `memory.peak` introduced in kernel 5.19. Signed-off-by: Heran Yang <heran55@126.com> (cherry picked from commit 104b8dc) Signed-off-by: Harshal Patil <harpatil@redhat.com>

[1.1] feat: add swapOnlyUsage in MemoryStats

(This is a cherry-pick of 1912d59.) Our handling for name space paths with user namespaces has been broken for a long time. In particular, the need to parse /proc/self/*id_map in quite a few places meant that we would treat userns configurations that had a namespace path as if they were a userns configuration without mappings, resulting in errors. The primary issue was down to the id translation helper functions, which could only handle configurations that had explicit mappings. Obviously, when joining a user namespace we need to map the ids but figuring out the correct mapping is non-trivial in comparison. In order to get the mapping, you need to read /proc/<pid>/*id_map of a process inside the userns -- while most userns paths will be of the form /proc/<pid>/ns/user (and we have a fast-path for this case), this is not guaranteed and thus it is necessary to spawn a process inside the container and read its /proc/<pid>/*id_map files in the general case. As Go does not allow us spawn a subprocess into a target userns, we have to use CGo to fork a sub-process which does the setns(2). To be honest, this is a little dodgy in regards to POSIX signal-safety(7) but since we do no allocations and we are executing in the forked context from a Go program (not a C program), it should be okay. The other alternative would be to do an expensive re-exec (a-la nsexec which would make several other bits of runc more complicated), or to use nsenter(1) which might not exist on the system and is less than ideal. Because we need to logically remap users quite a few times in runc (including in "runc init", where joining the namespace is not feasable), we cache the mapping inside the libcontainer config struct. A future patch will make sure that we stop allow invalid user configurations where a mapping is specified as well as a userns path to join. Finally, add an integration test to make sure we don't regress this again. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

(This is a cherry-pick of 09822c3.) For userns and timens, the mappings (and offsets, respectively) cannot be changed after the namespace is first configured. Thus, configuring a container with a namespace path to join means that you cannot also provide configuration for said namespace. Previously we would silently ignore the configuration (and just join the provided path), but we really should be returning an error (especially when you consider that the configuration userns mappings are used quite a bit in runc with the assumption that they are the correct mapping for the userns -- but in this case they are not). In the case of userns, the mappings are also required if you _do not_ specify a path, while in the case of the time namespace you can have a container with a timens but no mappings specified. It should be noted that the case checking that the user has not specified a userns path and a userns mapping needs to be handled in specconv (as opposed to the configuration validator) because with this patchset we now cache the mappings of path-based userns configurations and thus the validator can't be sure whether the mapping is a cached mapping or a user-specified one. So we do the validation in specconv, and thus the test for this needs to be an integration test. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

(This is a cherry-pick of 6fa8d06.) Given we've had several bugs in this behaviour that have now been fixed, add an integration test that makes sure that you can start a container that joins all of the namespaces of a second container. The only namespace we do not join is the mount namespace, because joining a namespace that has been pivot_root'd leads to a bunch of errors. In principle, removing everything from config.json that requires a mount _should_ work, but the root.path configuration is mandatory and we cannot just ignore setting up the rootfs in the namespace joining scenario (if the user has configured a different rootfs, we need to use it or error out, and there's no reasonable way of checking if if the rootfs paths are the same that doesn't result in spaghetti logic). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

(This is a cherry-pick of ebcef3e.) It turns out that the error added in commit 09822c3 ("configs: disallow ambiguous userns and timens configurations") causes issues with containerd and CRIO because they pass both userns mappings and a userns path. These configurations are broken, but to avoid the regression in this one case, output a warning to tell the user that the configuration is incorrect but we will continue to use it if and only if the configured mappings are identical to the mappings of the provided namespace. Fixes: 09822c3 ("configs: disallow ambiguous userns and timens configurations") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

(This is a cherry-pick of 482e563.) Using ints for all of our mapping structures means that a 32-bit binary errors out when trying to parse /proc/self/*id_map: failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user: parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295 This issue was unearthed by commit 1912d59 ("*: actually support joining a userns with a new container") but the underlying issue has been present since the docker/libcontainer days. In theory, switching to uint32 (to match the spec) instead of int64 would also work, but keeping everything signed seems much less error-prone. It's also important to note that a mapping might be too large for an int on 32-bit, so we detect this during the mapping. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

[1.1] *: fix several issues with userns path handling

Signed-off-by: lfbzhm <lifubang@acmcoder.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

lfbzhm (2): VERSION: back to development VERSION: release 1.1.11 LGTMs: AkihiroSuda cyphar

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Aleksa Sarai (2): keyring: update AkihiroSuda key expiry keyring: update cyphar@cyphar.com key expiry LGTMs: AkihiroSuda lifubang

(This is a cherry-pick of 937ca10.) Signed-off-by: hang.jiang <hang.jiang@daocloud.io> Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

If a file descriptor of a directory in the host's mount namespace is leaked to runc init, a malicious config.json could use /proc/self/fd/... as a working directory to allow for host filesystem access after the container runs. This can also be exploited by a container process if it knows that an administrator will use "runc exec --cwd" and the target --cwd (the attacker can change that cwd to be a symlink pointing to /proc/self/fd/... and wait for the process to exec and then snoop on /proc/$pid/cwd to get access to the host). The former issue can lead to a critical vulnerability in Docker and Kubernetes, while the latter is a container breakout. We can (ab)use the fact that getcwd(2) on Linux detects this exact case, and getcwd(3) and Go's Getwd() return an error as a result. Thus, if we just do os.Getwd() after chdir we can easily detect this case and error out. In runc 1.1, a /sys/fs/cgroup handle happens to be leaked to "runc init", making this exploitable. On runc main it just so happens that the leaked /sys/fs/cgroup gets clobbered and thus this is only consistently exploitable for runc 1.1. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Co-developed-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [refactored the implementation and added more comments] Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

(This is a partial backport of a minor change included in commit dac4171.) This mirrors the logic in standard_init_linux.go, and also ensures that we do not call exec.LookPath in the final execve step. While this is okay for regular binaries, it seems exec.LookPath calls os.Getenv which tries to emit a log entry to the test harness when running in "go test" mode. In a future patch (in order to fix CVE-2024-21626), we will close all of the file descriptors immediately before execve, which would mean the file descriptor for test harness logging would be closed at execve time. So, moving exec.LookPath earlier is necessary. Ref: dac4171 ("runc-dmz: reduce memfd binary cloning cost with small C binary") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

If we leak a file descriptor referencing the host filesystem, an attacker could use a /proc/self/fd magic-link as the source for execve to execute a host binary in the container. This would allow the binary itself (or a process inside the container in the 'runc exec' case) to write to a host binary, leading to a container escape. The simple solution is to make sure we close all file descriptors immediately before the execve(2) step. Doing this earlier can lead to very serious issues in Go (as file descriptors can be reused, any (*os.File) reference could start silently operating on a different file) so we have to do it as late as possible. Unfortunately, there are some Go runtime file descriptors that we must not close (otherwise the Go scheduler panics randomly). The only way of being sure which file descriptors cannot be closed is to sneakily go:linkname the runtime internal "internal/poll.IsPollDescriptor" function. This is almost certainly not recommended but there isn't any other way to be absolutely sure, while also closing any other possible files. In addition, we can keep the logrus forwarding logfd open because you cannot execve a pipe and the contents of the pipe are so restricted (JSON-encoded in a format we pick) that it seems unlikely you could even construct shellcode. Closing the logfd causes issues if there is an error returned from execve. In mainline runc, runc-dmz protects us against this attack because the intermediate execve(2) closes all of the O_CLOEXEC internal runc file descriptors and thus runc-dmz cannot access them to attack the host. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

We auto-close this file descriptor in the final exec step, but it's probably a good idea to not possibly leak the file descriptor to "runc init" (we've had issues like this in the past) especially since it is a directory handle from the host mount namespace. In practice, on runc 1.1 this does leak to "runc init" but on main the handle has a low enough file descriptor that it gets clobbered by the ForkExec of "runc init". OPEN_TREE_CLONE would let us protect this handle even further, but the performance impact of creating an anonymous mount namespace is probably not worth it. Also, switch to using an *os.File for the handle so if it goes out of scope during setup (i.e. an error occurs during setup) it will get cleaned up by the GC. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Given the core issue in GHSA-xr7r-f8xq-vfvv was that we were unknowingly leaking file descriptors to "runc init", it seems prudent to make sure we proactively prevent this in the future. The solution is to simply mark all non-stdio file descriptors as O_CLOEXEC before we spawn "runc init". For libcontainer library users, this could result in unrelated files being marked as O_CLOEXEC -- however (for the same reason we are doing this for runc), for security reasons those files should've been marked as O_CLOEXEC anyway. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

We close the logfd before execve so there's no need to special case it. In addition, it turns out that (*os.File).Fd() doesn't handle the case where the file was closed and so it seems suspect to use that kind of check. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

This is a security fix for CVE-2024-21626. See the advisory[1] for more details. Aleksa Sarai (6): init: don't special-case logrus fds libcontainer: mark all non-stdio fds O_CLOEXEC before spawning init cgroup: plug leaks of /sys/fs/cgroup handle init: close internal fds before execve setns init: do explicit lookup of execve argument early init: verify after chdir that cwd is inside the container Hang Jiang (1): Fix File to Close [1]: GHSA-xr7r-f8xq-vfvv Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 LGTMs: cyphar AkihiroSuda kolyshkin lifubang

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

runc: release 1.1.12

Signed-off-by: Ameya Gawde <agawde@mirantis.com>

dependabot Bot and others added 29 commits December 11, 2023 12:24

Merge pull request opencontainers#4140 from thaJeztah/1.1_backport_up…

4f13093

…date_securejoin [1.1 backport] build(deps): bump github.com/cyphar/filepath-securejoin

Merge pull request opencontainers#4131 from harche/backport

75d99b4

[1.1] feat: add swapOnlyUsage in MemoryStats

Merge pull request opencontainers#4144 from cyphar/1.1-ns-path-handling

930fde5

[1.1] *: fix several issues with userns path handling

VERSION: release 1.1.11

4bccb38

Signed-off-by: lfbzhm <lifubang@acmcoder.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

VERSION: back to development

7887736

Signed-off-by: lfbzhm <lifubang@acmcoder.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

merge opencontainers#4160 into opencontainers/runc:release-1.1

452f520

lfbzhm (2): VERSION: back to development VERSION: release 1.1.11 LGTMs: AkihiroSuda cyphar

keyring: update cyphar@cyphar.com key expiry

d561e5d

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

keyring: update AkihiroSuda key expiry

d0b1a37

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

merge opencontainers#4177 into opencontainers/runc:release-1.1

099ff69

Aleksa Sarai (2): keyring: update AkihiroSuda key expiry keyring: update cyphar@cyphar.com key expiry LGTMs: AkihiroSuda lifubang

Fix File to Close

506552a

(This is a cherry-pick of 937ca10.) Signed-off-by: hang.jiang <hang.jiang@daocloud.io> Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

VERSION: release 1.1.12

51d5e94

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

VERSION: back to development

29d6d87

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Merge pull request from GHSA-xr7r-f8xq-vfvv

a9833ff

runc: release 1.1.12

Merge remote-tracking branch 'Mirantis/release-1.1-m' into sync-0216

41acd13

Bump version to Mirantis ver 1.1.12-m1

13cb2b0

Signed-off-by: Ameya Gawde <agawde@mirantis.com>

ameyag requested a review from aepifanov February 16, 2024 21:52

ameyag merged commit 77d7dd5 into Mirantis:release-1.1-m Feb 19, 2024

ameyag deleted the sync-0216 branch February 19, 2024 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync 1.1 to 1.1-m#3

Sync 1.1 to 1.1-m#3
ameyag merged 29 commits intoMirantis:release-1.1-mfrom
ameyag:sync-0216

ameyag commented Feb 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ameyag commented Feb 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants