Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable debugger support for containers #315

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ishitatsuyuki
Copy link
Contributor

(untested)

@jankeromnes
Copy link
Member

Thanks a lot for working on enabling rr in Janitor containers!

@jld could you please confirm that adding CAP_SYS_PTRACE and using seccomp=unconfined on all our Docker containers isn't too risky from a security standpoint? (Taking into account that our users are sudo in their containers, but we don't want them to be root in our Docker servers.)

@rocallahan does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers? (Assuming that /proc/sys/kernel/perf_event_paranoid is already 1 on the server.)

HostConfig: {
CapAdd: ['SYS_PTRACE'],
PortBindings: {},
SecurityOpt: ['seccomp=unconfined']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about seccomp, but isn't unconfined a bit drastic? Couldn't we instead tweak the seccomp profile (a.k.a. the seccomp policy JSON file) to enable just the required syscalls and nothing more, as suggested in #232 (comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptrace allows seccomp to be bypassed, so constraining anything doesn't improve security.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptrace allows seccomp to be bypassed

That used to be true but I don't think it is anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're right; the hole was patched for each architecture and got testing with torvalds/linux@58d0a86.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sum up: I don't see any significant security issues with unconfined, and maintaining a policy file is hard.

@rocallahan
Copy link

does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers?

Yes.

@jankeromnes
Copy link
Member

Before we move forward with this, I'd like to be certain that adding CAP_SYS_PTRACE and using seccomp=unconfined on all Janitor's Docker containers isn't too risky from a security standpoint.

Ping @jld ?

@jankeromnes
Copy link
Member

jankeromnes commented Apr 27, 2018

I'd really like to find a solution where we only enable the syscalls actually needed for rr, instead of all 44 potentially dangerous syscalls that are blocked by default.

The default seccomp profile includes these lines:

		{
			"names": [
				"kcmp",
				"process_vm_readv",
				"process_vm_writev",
				"ptrace"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_PTRACE"
				]
			},
			"excludes": {}
		}

Am I interpreting these correctly if I conclude that adding CAP_SYS_PTRACE will also make seccomp allow the syscalls ptrace, kcmp, process_vm_readv and process_vm_writev? EDIT: No, these syscalls are blocked by both seccomp and the dropped SYS_PTRACE.

@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.

My preferred outcome would be to fork the default seccomp profile to make rr work, and propose our changes upstream or rebase our fork from time to time.

@rocallahan
Copy link

@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.

That should work.

We don't need userfaultfd.

@jankeromnes
Copy link
Member

So here are the syscalls used by rr during a very simple record & replay session, and their seccomp specificities if any:

access: (allowed)
arch_prctl: "arches": [ "amd64", "x32" ]
bind: (allowed)
brk: (allowed)
chmod: (allowed)
clone: "caps": [ "CAP_SYS_ADMIN" ]
close: (allowed)
execve: (allowed)
exit_group: (allowed)
fcntl: (allowed)
fstat: (allowed)
ftruncate: (allowed)
futex: (allowed)
getrlimit: (allowed)
getsockopt: (allowed)
getxattr: (allowed)
ioctl: (allowed)
kill: (allowed)
link: (allowed)
listen: (allowed)
lseek: (allowed)
lstat: (allowed)
mkdir: (allowed)
mmap: (allowed)
mprotect: (allowed)
munmap: (allowed)
open: (allowed)
perf_event_open: "caps": [ "CAP_SYS_ADMIN" ]
pipe2: (allowed)
poll: (allowed)
prctl: (allowed)
pread64: (allowed)
ptrace: "caps": [ "CAP_SYS_PTRACE" ]
pwrite64: (allowed)
read: (allowed)
readlink: (allowed)
recvmsg: (allowed)
rename: (allowed)
rt_sigaction: (allowed)
rt_sigprocmask: (allowed)
rt_sigreturn: (allowed)
sched_setaffinity: (allowed)
set_robust_list: (allowed)
set_tid_address: (allowed)
setitimer: (allowed)
setrlimit: (allowed)
socket: (allowed)
stat: (allowed)
statfs: (allowed)
symlink: (allowed)
syscall_446: ???
uname: (allowed)
unlink: (allowed)
wait4: (allowed)
write: (allowed)

So I conclude that to make rr work inside Docker containers, we need to enable the following 3 syscalls (annotated by the reason Docker blocks it by default):

  • clone: Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
  • perf_event_open: Tracing/profiling syscall, which could leak a lot of information on the host.
  • ptrace: Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.

(I have no idea what syscall_446 is all about.)

@rocallahan
Copy link

I don't think you need to worry about clone. rr does not create new namespaces (though it supports recording children that do). It uses clone to create processes and threads, but I'm sure your seccomp policy would allow that.

Don't worry about 446 (and rr's other internal syscalls, 442 to 447). rr uses them internally and I think blocking them via seccomp will not affect rr's usage.

@rocallahan
Copy link

Though you might need to whitelist them.

@jankeromnes
Copy link
Member

Awesome, thanks @rocallahan for clarifying!

I think we can reasonably tweak Docker's default seccomp profile to authorize perf_event_open and ptrace, in addition to adding CAP_SYS_PTRACE.

@ishitatsuyuki
Copy link
Contributor Author

There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.

@jankeromnes
Copy link
Member

jankeromnes commented Apr 27, 2018

There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.

Well, we do have a Janitor clone on each Docker host, which we use to properly configure Docker (and also run an authenticated Docker proxy). Maybe we can commit a custom seccomp profile somewhere in the Janitor repository, update our host's checkouts, and point our Docker daemons to that file?

@tschneidereit
Copy link

@jankeromnes has there been any movement on this? Having rr support would be fantastic, but what I currently really need is the ability to set breakpoints in gdb, which unfortunately also requires ptrace and at least some seccomp changes.

@jankeromnes
Copy link
Member

jankeromnes commented Aug 27, 2018

@tschneidereit Hello!

We can already add CAP_SYS_PTRACE as a first step.

Unfortunately, we're still blocked on creating a fork of Docker's default seccomp profile which just enables these two additional syscalls:

  • perf_event_open
  • ptrace

Note: This pull request would indeed allow rr to work (manually tested with @whimboo), but I believe that using seccomp=unconfined (i.e. completely disabling seccomp) is too risky from a security standpoint. (But maybe I'm wrong, e.g. I believe that Kubernetes disables Docker's seccomp altogether, or did at some point.)

@nikopen
Copy link

nikopen commented Nov 14, 2018

hi @jankeromnes , this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.

Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.

But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks? Also, is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?

@jankeromnes
Copy link
Member

Hi @nikopen,

this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.

Hm, that's quite unfortunate, as it seems nobody here really understands the security implications of enabling these capabilities and syscalls in third-party controlled containers on shared infrastructure.

is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?

No, I believe these are the same thing (i.e. the capability or unit of Linux superuser privileges called "PTRACE").

Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.

Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE capability or the perf_event_open and ptrace syscalls by default).

But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks?

Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.

@nikopen
Copy link

nikopen commented Nov 14, 2018

Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE capability or the perf_event_open and ptrace syscalls by default).

Indeed, k8s disables Docker's seccomp, and it can be re-enabled via a specific annotation. The default ServiceAccount on an unconfigured 'vanilla' k8s installation is very open - allows pretty much everything, and then granular capabilities can be defined either via securityContext in a pod/deployment or via PodSecurityPolicies, to have unprivileged pods but allowing SYS_PTRACE et cetera.

Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.

That's interesting, is it related with this commit? torvalds/linux@58d0a86
I can see it entered the 4.8 kernel. I could mail the k8s security team to learn more about this domain, maybe they have something easily reachable to share.

@rocallahan
Copy link

That's interesting, is it related with this commit? torvalds/linux@58d0a86

As the commit message says, in the past a process could use ptrace() to negate any seccomp() policy imposed on it, by changing the syscall number after the seccomp policy had run. Therefore if you imposed any seccomp policy at all, you had to block ptrace() as well. That particular issue is fixed since 4.8 so that is no longer a reason to block ptrace() on those kernels.

Another reason to block ptrace() is that it's just a really complicated API which provides a considerable amount of attack surface through which an attacker might be able to exploit zero-day kernel bugs. It makes sense that if your containers don't need ptrace(), you should block it. This will always be true I guess.

I don't know of any bugs or features of ptrace() that enable container escapes or other security policy violations. I think if any were known, they would be treated as security bugs and fixed. At this point I think it should be treated like any other API that is thought to be safe but adds significant attack surface.

@jankeromnes
Copy link
Member

jankeromnes commented May 6, 2019

Update:

Docker's default seccomp profile now seems to authorize the ptrace syscall if kernel >= 4.8:

{
	"names": [
		"ptrace"
	],
	"action": "SCMP_ACT_ALLOW",
	"args": null,
	"comment": "",
	"includes": {
		"minKernel": "4.8"
	},
	"excludes": {}
},

(Source.)

If my interpretation is correct, this means that in order to allow the ptrace syscall in Docker containers, you no longer need a custom seccomp profile -- just a recent Docker (that has moby/moby@1124543) and Linux kernel (>= 4.8).

However, as far as I know, you still need CAP_SYS_PTRACE for ptrace to actually work in containers.

Also, a custom seccomp profile is still required here, because rr also needs perf_event_open (which is still blocked by Docker's default seccomp profile).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants