Enable debugger support for containers #315

ishitatsuyuki · 2018-04-03T13:48:10Z

(untested)

jankeromnes · 2018-04-03T15:40:21Z

Thanks a lot for working on enabling rr in Janitor containers!

@jld could you please confirm that adding CAP_SYS_PTRACE and using seccomp=unconfined on all our Docker containers isn't too risky from a security standpoint? (Taking into account that our users are sudo in their containers, but we don't want them to be root in our Docker servers.)

@rocallahan does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers? (Assuming that /proc/sys/kernel/perf_event_paranoid is already 1 on the server.)

jankeromnes · 2018-04-03T15:43:47Z

lib/docker.js

+      HostConfig: {
+        CapAdd: ['SYS_PTRACE'],
+        PortBindings: {},
+        SecurityOpt: ['seccomp=unconfined']


I don't know much about seccomp, but isn't unconfined a bit drastic? Couldn't we instead tweak the seccomp profile (a.k.a. the seccomp policy JSON file) to enable just the required syscalls and nothing more, as suggested in #232 (comment)?

ptrace allows seccomp to be bypassed, so constraining anything doesn't improve security.

ptrace allows seccomp to be bypassed

That used to be true but I don't think it is anymore.

Looks like you're right; the hole was patched for each architecture and got testing with torvalds/linux@58d0a86.

To sum up: I don't see any significant security issues with unconfined, and maintaining a policy file is hard.

rocallahan · 2018-04-03T19:58:45Z

does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers?

Yes.

jankeromnes · 2018-04-06T17:04:03Z

Before we move forward with this, I'd like to be certain that adding CAP_SYS_PTRACE and using seccomp=unconfined on all Janitor's Docker containers isn't too risky from a security standpoint.

Ping @jld ?

jankeromnes · 2018-04-27T06:26:58Z

I'd really like to find a solution where we only enable the syscalls actually needed for rr, instead of all 44 potentially dangerous syscalls that are blocked by default.

The default seccomp profile includes these lines:

		{
			"names": [
				"kcmp",
				"process_vm_readv",
				"process_vm_writev",
				"ptrace"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_PTRACE"
				]
			},
			"excludes": {}
		}

Am I interpreting these correctly if I conclude that adding CAP_SYS_PTRACE will also make seccomp allow the syscalls ptrace, kcmp, process_vm_readv and process_vm_writev? EDIT: No, these syscalls are blocked by both seccomp and the dropped SYS_PTRACE.

@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.

My preferred outcome would be to fork the default seccomp profile to make rr work, and propose our changes upstream or rebase our fork from time to time.

rocallahan · 2018-04-27T06:34:56Z

@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.

That should work.

We don't need userfaultfd.

jankeromnes · 2018-04-27T06:54:59Z

So here are the syscalls used by rr during a very simple record & replay session, and their seccomp specificities if any:

access: (allowed)
arch_prctl: "arches": [ "amd64", "x32" ]
bind: (allowed)
brk: (allowed)
chmod: (allowed)
clone: "caps": [ "CAP_SYS_ADMIN" ]
close: (allowed)
execve: (allowed)
exit_group: (allowed)
fcntl: (allowed)
fstat: (allowed)
ftruncate: (allowed)
futex: (allowed)
getrlimit: (allowed)
getsockopt: (allowed)
getxattr: (allowed)
ioctl: (allowed)
kill: (allowed)
link: (allowed)
listen: (allowed)
lseek: (allowed)
lstat: (allowed)
mkdir: (allowed)
mmap: (allowed)
mprotect: (allowed)
munmap: (allowed)
open: (allowed)
perf_event_open: "caps": [ "CAP_SYS_ADMIN" ]
pipe2: (allowed)
poll: (allowed)
prctl: (allowed)
pread64: (allowed)
ptrace: "caps": [ "CAP_SYS_PTRACE" ]
pwrite64: (allowed)
read: (allowed)
readlink: (allowed)
recvmsg: (allowed)
rename: (allowed)
rt_sigaction: (allowed)
rt_sigprocmask: (allowed)
rt_sigreturn: (allowed)
sched_setaffinity: (allowed)
set_robust_list: (allowed)
set_tid_address: (allowed)
setitimer: (allowed)
setrlimit: (allowed)
socket: (allowed)
stat: (allowed)
statfs: (allowed)
symlink: (allowed)
syscall_446: ???
uname: (allowed)
unlink: (allowed)
wait4: (allowed)
write: (allowed)

So I conclude that to make rr work inside Docker containers, we need to enable the following 3 syscalls (annotated by the reason Docker blocks it by default):

clone: Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
perf_event_open: Tracing/profiling syscall, which could leak a lot of information on the host.
ptrace: Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.

(I have no idea what syscall_446 is all about.)

rocallahan · 2018-04-27T07:22:40Z

I don't think you need to worry about clone. rr does not create new namespaces (though it supports recording children that do). It uses clone to create processes and threads, but I'm sure your seccomp policy would allow that.

Don't worry about 446 (and rr's other internal syscalls, 442 to 447). rr uses them internally and I think blocking them via seccomp will not affect rr's usage.

rocallahan · 2018-04-27T07:23:23Z

Though you might need to whitelist them.

jankeromnes · 2018-04-27T07:27:46Z

Awesome, thanks @rocallahan for clarifying!

I think we can reasonably tweak Docker's default seccomp profile to authorize perf_event_open and ptrace, in addition to adding CAP_SYS_PTRACE.

ishitatsuyuki · 2018-04-27T08:15:29Z

There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.

jankeromnes · 2018-04-27T08:22:36Z

There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.

Well, we do have a Janitor clone on each Docker host, which we use to properly configure Docker (and also run an authenticated Docker proxy). Maybe we can commit a custom seccomp profile somewhere in the Janitor repository, update our host's checkouts, and point our Docker daemons to that file?

tschneidereit · 2018-08-26T14:19:54Z

@jankeromnes has there been any movement on this? Having rr support would be fantastic, but what I currently really need is the ability to set breakpoints in gdb, which unfortunately also requires ptrace and at least some seccomp changes.

jankeromnes · 2018-08-27T13:53:44Z

@tschneidereit Hello!

We can already add CAP_SYS_PTRACE as a first step.

Unfortunately, we're still blocked on creating a fork of Docker's default seccomp profile which just enables these two additional syscalls:

perf_event_open
ptrace

Note: This pull request would indeed allow rr to work (manually tested with @whimboo), but I believe that using seccomp=unconfined (i.e. completely disabling seccomp) is too risky from a security standpoint. (But maybe I'm wrong, e.g. I believe that Kubernetes disables Docker's seccomp altogether, or did at some point.)

nikopen · 2018-11-14T13:11:25Z

hi @jankeromnes , this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.

Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.

But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks? Also, is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?

jankeromnes · 2018-11-14T15:24:40Z

Hi @nikopen,

this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.

Hm, that's quite unfortunate, as it seems nobody here really understands the security implications of enabling these capabilities and syscalls in third-party controlled containers on shared infrastructure.

is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?

No, I believe these are the same thing (i.e. the capability or unit of Linux superuser privileges called "PTRACE").

Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.

Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE capability or the perf_event_open and ptrace syscalls by default).

But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks?

Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.

nikopen · 2018-11-14T16:42:18Z

Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE capability or the perf_event_open and ptrace syscalls by default).

Indeed, k8s disables Docker's seccomp, and it can be re-enabled via a specific annotation. The default ServiceAccount on an unconfigured 'vanilla' k8s installation is very open - allows pretty much everything, and then granular capabilities can be defined either via securityContext in a pod/deployment or via PodSecurityPolicies, to have unprivileged pods but allowing SYS_PTRACE et cetera.

Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.

That's interesting, is it related with this commit? torvalds/linux@58d0a86
I can see it entered the 4.8 kernel. I could mail the k8s security team to learn more about this domain, maybe they have something easily reachable to share.

rocallahan · 2018-11-14T21:27:04Z

That's interesting, is it related with this commit? torvalds/linux@58d0a86

As the commit message says, in the past a process could use ptrace() to negate any seccomp() policy imposed on it, by changing the syscall number after the seccomp policy had run. Therefore if you imposed any seccomp policy at all, you had to block ptrace() as well. That particular issue is fixed since 4.8 so that is no longer a reason to block ptrace() on those kernels.

Another reason to block ptrace() is that it's just a really complicated API which provides a considerable amount of attack surface through which an attacker might be able to exploit zero-day kernel bugs. It makes sense that if your containers don't need ptrace(), you should block it. This will always be true I guess.

I don't know of any bugs or features of ptrace() that enable container escapes or other security policy violations. I think if any were known, they would be treated as security bugs and fixed. At this point I think it should be treated like any other API that is thought to be safe but adds significant attack surface.

jankeromnes · 2019-05-06T15:15:53Z

Update:

Docker's default seccomp profile now seems to authorize the ptrace syscall if kernel >= 4.8:

{
	"names": [
		"ptrace"
	],
	"action": "SCMP_ACT_ALLOW",
	"args": null,
	"comment": "",
	"includes": {
		"minKernel": "4.8"
	},
	"excludes": {}
},

(Source.)

If my interpretation is correct, this means that in order to allow the ptrace syscall in Docker containers, you no longer need a custom seccomp profile -- just a recent Docker (that has moby/moby@1124543) and Linux kernel (>= 4.8).

However, as far as I know, you still need CAP_SYS_PTRACE for ptrace to actually work in containers.

Also, a custom seccomp profile is still required here, because rr also needs perf_event_open (which is still blocked by Docker's default seccomp profile).

Enable debugger support for containers

f480a6b

ishitatsuyuki requested a review from jankeromnes April 3, 2018 13:48

jankeromnes reviewed Apr 3, 2018

View reviewed changes

jankeromnes added the Docker label Apr 3, 2018

jankeromnes mentioned this pull request Sep 5, 2018

[Chromium] run failure w/ No usable sandbox #383

Open

jankeromnes mentioned this pull request Nov 16, 2018

Enable debugger support for containers #399

Merged

jankeromnes mentioned this pull request May 6, 2019

GDB isn't working gitpod-io/gitpod#544

Closed

jankeromnes mentioned this pull request May 2, 2022

Support rr debugger (record-replay) by allowing the syscall perf_event_open in Gitpod workspaces gitpod-io/gitpod#9687

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable debugger support for containers #315

Enable debugger support for containers #315

ishitatsuyuki commented Apr 3, 2018

jankeromnes commented Apr 3, 2018

jankeromnes Apr 3, 2018

ishitatsuyuki Apr 3, 2018

rocallahan Apr 3, 2018

ishitatsuyuki Apr 4, 2018

ishitatsuyuki Apr 4, 2018 •

edited

rocallahan commented Apr 3, 2018

jankeromnes commented Apr 6, 2018

jankeromnes commented Apr 27, 2018 •

edited

rocallahan commented Apr 27, 2018

jankeromnes commented Apr 27, 2018

rocallahan commented Apr 27, 2018

rocallahan commented Apr 27, 2018

jankeromnes commented Apr 27, 2018

ishitatsuyuki commented Apr 27, 2018

jankeromnes commented Apr 27, 2018 •

edited

tschneidereit commented Aug 26, 2018

jankeromnes commented Aug 27, 2018 •

edited

nikopen commented Nov 14, 2018 •

edited

jankeromnes commented Nov 14, 2018

nikopen commented Nov 14, 2018 •

edited

rocallahan commented Nov 14, 2018

jankeromnes commented May 6, 2019 •

edited

Enable debugger support for containers #315

Are you sure you want to change the base?

Enable debugger support for containers #315

Conversation

ishitatsuyuki commented Apr 3, 2018

jankeromnes commented Apr 3, 2018

jankeromnes Apr 3, 2018

Choose a reason for hiding this comment

ishitatsuyuki Apr 3, 2018

Choose a reason for hiding this comment

rocallahan Apr 3, 2018

Choose a reason for hiding this comment

ishitatsuyuki Apr 4, 2018

Choose a reason for hiding this comment

ishitatsuyuki Apr 4, 2018 • edited

Choose a reason for hiding this comment

rocallahan commented Apr 3, 2018

jankeromnes commented Apr 6, 2018

jankeromnes commented Apr 27, 2018 • edited

rocallahan commented Apr 27, 2018

jankeromnes commented Apr 27, 2018

rocallahan commented Apr 27, 2018

rocallahan commented Apr 27, 2018

jankeromnes commented Apr 27, 2018

ishitatsuyuki commented Apr 27, 2018

jankeromnes commented Apr 27, 2018 • edited

tschneidereit commented Aug 26, 2018

jankeromnes commented Aug 27, 2018 • edited

nikopen commented Nov 14, 2018 • edited

jankeromnes commented Nov 14, 2018

nikopen commented Nov 14, 2018 • edited

rocallahan commented Nov 14, 2018

jankeromnes commented May 6, 2019 • edited

ishitatsuyuki Apr 4, 2018 •

edited

jankeromnes commented Apr 27, 2018 •

edited

jankeromnes commented Apr 27, 2018 •

edited

jankeromnes commented Aug 27, 2018 •

edited

nikopen commented Nov 14, 2018 •

edited

nikopen commented Nov 14, 2018 •

edited

jankeromnes commented May 6, 2019 •

edited