support cgroup v2 (unified hierarchy) #654

sols1 · 2016-03-17T20:32:34Z

cgroup v2 (unified hierarchy) is now official in 4.5:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8

cgroup v2 should have more sensible behavior:

https://www.youtube.com/watch?v=PzpG40WiEfM

moby/moby#16238

cyphar · 2016-03-22T01:17:21Z

cgroupv2 still doesn't support many of the cgroup controllers we need for runc. The most important one is the device "cgroup", which is a hard requirement for security. As far as I can see, CPU still hasn't been implemented either. Also, many of the other cgroups provide us with protections against other resource exhaustion attacks.

sols1 · 2016-04-27T00:03:50Z

It is possible to do cgroup v2 for some controllers and cgroup v1 for others, which are still not available for cgroup v2.

Memory is the most difficult resource to manage and that's what is fixed in cgroup v2.

The device cgroup seems to be fairly straightforward to convert to cgroup v2: add device permissions to existing single hierarchy.

cyphar · 2016-04-27T00:19:49Z

The other issue is that we need to be running on a distribution which supports cgroupv2 as the default setup with systemd (which is essentially none of them). We can't really use cgroupv2 otherwise because it would require either:

Moving all of the processes in the system to the v2 equivalent. But because of the internal node (and threadgroup) constraints this won't be pretty and we'd be changing distro policy.
Moving just the subtree to the v2 equivalent. While this is technically allowed, the documentation makes it clear that it's a development tool and shouldn't be used for production purposes.

For me, one of the biggest benefits of cgroupv2 is that cgroup namespaces make more sense on v2. Unfortunately, cgroup namespaces don't implement features that would make them useful at the moment (see #774 and #781). So there's that.

And yes, we can use both v2 and v1 at the same time, but that doesn't make the implementation any nicer (now we'd have to use two managers with two different "cgroup paths").

rodionos · 2016-05-01T20:21:39Z

For context, Ubuntu 16 LTS is on kernel version 4.4
https://wiki.ubuntu.com/XenialXerus/ReleaseNotes#Linux_kernel_4.4

sols1 · 2016-05-24T19:34:52Z

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

For example, Paralles/Virtuozzo used containers in production for 10+ years and they ended up back porting memory cgroup v2 to the old kernel that they used (RHEL6, if I'm not mistaken).

Also, as far as I understand Google used containers in production for a long time and they had some kernel patches to deal with memory accounting and management.

cyphar · 2016-05-24T22:16:59Z

@sols1

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but ...

cgroup namespaces was a benefit of cgroupv2 😉. The general issue with cgroupv2 is that there just aren't enough controllers enabled for us to be able to use it properly (at a minimum, we'd need the freezer and device cgroups), and using both cgroupv2 and cgroupv1 together will make the implementation more complicated than it needs to be. On the plus side, we don't need the net_* controllers in cgroupv2 (they won't ever be added to cgroupv2) because you can now specify iptable rules by cgroup path (which AFAIK is namespaced by cgroups).

I'd be happy to work on kernel patches to add support for the controllers, but I'd recommend pushing upstream to get more controllers enabled for cgroupv2 -- they just aren't feature complete for us right now and I don't feel good about adding hacks to our cgroup management implementation to deal with cgroupv2's shortcomings.

but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

I understand, but there's also the problem that I'm not sure how we could test our use of cgroupv2 because systemd uses the cgroupv1 hierarchy on almost every distribution (I tried to switch to cgroupv2 on my laptop while my system was running -- it did not end well).

justincormack · 2016-10-19T10:54:29Z

@cyphar we are in the merge window for 4.9 which will be next LTS, so it is getting quite late to get support in for the next few years for most distros - any chance of looking at the kernel patches?

I am happy to help testing, it should be fairly easy on Alpine Linux as it does not use systemd so can change more easily.

sols1 · 2016-10-19T17:18:49Z

RancherOS (https://github.com/rancher/os) is another option. It does not use systemd and even systemd emulation was removed AFAIK.

cyphar · 2016-10-20T08:22:40Z

I haven't really had a chance to work on kernel patches recently. However, I did try a few months ago to implement freezer so it worked with cgroupv2 -- as far as I can tell it's not really that trivial to do. Namely there are some edge cases that made the handling non-clear. And I looked at the devices code but its quite a bit more complicated than the freezer code.

I might take look sometime next month, but I can't really guarantee anything (I've been swamped quite recently).

hustcat · 2016-11-29T07:00:22Z

Buffer io throttle is another biggest benefits of cgroupv2.

rhatdan · 2017-01-09T16:55:00Z

Rawhide just moved to CgroupV2. Causing docker/runc to blow up.

https://bugzilla.redhat.com/show_bug.cgi?id=1411286

docker run -ti fedora bash
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"cgroup\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay/e1432a26e33bebbc27619c9802d9218f3da8938b7f1696ca9be0890a2e75ac65/merged\\\\\\\" at \\\\\\\"/sys/fs/cgroup\\\\\\\" caused \\\\\\\"no subsystem for mount\\\\\\\"\\\"\"\n".

uname -r

4.10.0-0.rc2.git4.1.fc26.x86_64

config: Bring "unique... within this map" back together

This condition landed in 27a05de (Add text about extensions, 2016-06-26, opencontainers#510) with subsequent wording tweaks in 3f0440b (config.md: add empty limit for key of annotations, Dec 28 10:35:19 2016, opencontainers#645) and 2c8feeb (config: Bring "unique... within this map" back together, 2017-01-12, opencontainers#654). However, since eeaccfa (glossary: Make objects explicitly unordered and forbid duplicate names, 2016-09-27, opencontainers#584) we forbid duplicate keys on *all* objects (not just annotations), so this PR removes the redundant annotation-specific condition. Signed-off-by: W. Trevor King <wking@tremily.us>

webczat · 2017-10-06T17:52:59Z

isn't cpu controller merged for 4.14 already?

cyphar · 2017-10-07T07:36:18Z

4.14 isn't out yet 😉. CPU and memory have been merged, but there's still some disagreements over some bits (I still have to read through some patches I saw on the ML).

@brauner (from the LXC project) gave a nice talk about the more generic issues about cgroupv2: https://www.youtube.com/watch?v=P6Xnm0IhiSo .

webczat · 2017-10-07T14:14:52Z

I thought that cpu controller is merged so there are no more questions. Memory controller was there for a longer time wasn't it? 4.14 is not out but it is at rc so probably nothing significant can change

sargun · 2017-11-13T20:39:11Z

4.14 is out now.

cyphar · 2017-11-14T05:44:24Z

My reservations about cgroupv2's shortcomings (and the issues with the "hybrid" mode of operation) still hold. Not to mention that (last I tried) I wasn't able to get a system to boot with cgroupv2 enabled -- which doesn't bode well for testing any of that code.

redbaron · 2018-05-23T21:28:37Z

Is there any news/development regarding cgroups v2?

cyphar · 2018-05-25T04:35:48Z

Not really. freezer/devices is still not enabled on cgroupv2 and there are still arguments about the threaded mode of operation that was merged in 4.14.

sargun · 2018-05-25T04:38:25Z

You don't need freezer.

…

On Thu, May 24, 2018 at 9:35 PM, Aleksa Sarai ***@***.***> wrote: Not really. freezer is still not enabled on cgroupv2 and there are still arguments about the threaded mode of operation that was merged in 4.14. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#654 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAtyRLHcgmthQNcwVoQoHYAEn_YwbnHyks5t14ougaJpZM4HzVBa> .

cyphar · 2018-06-03T11:59:09Z

You don't need it, but you do want it. The main problem is that we'd still need to have a hybrid mode (which is something I've always felt uncomfortable with the idea of).

sargun · 2018-09-24T16:02:16Z

@cyphar For users who do not use freezer (because they have PID namespaces) and they aren't trying to take live snapshots, do you think it's reasonable to have cgroupv2 support, and be able to have runc use the cgroupv2 "alternate" mode?

cyphar · 2018-09-25T05:33:32Z

I don't mind having a pure-cgroupv2 implementation, but I don't think it would be ultimately useful. As far as I know, no distribution actually uses cgroupv2 controllers "for real" (to be fair, we are also probably the reason it hasn't happened yet). I unfortunately think that we must have a hybrid implementation otherwise we won't be able to implement the cgroup parts OCI spec fully on ordinary systems (I mean, we can error out and that's compliant but it's not correct). Maybe for a first step pure-cgroupv2 would be fine but I'm not 100% on that.

But my main concern is that this actually is going to be harder than you might think to implement. @brauner gave a talk about this last year, specifically in the context of LXC and container runtimes in general. The no-internal-process constraint in particular means that container runtimes will have to do a very large amount of dodgy things in order to be able to run containers inside a new cgroup (you have to move the processes from any parent cgroups into a new leaf-node). In addition, subtree_control gives you quite a few headaches because some parent cgroup could limit your ability to create new

In the Docker case this won't be as awful (though it will still be bad) because you can just create a new cgroup at /docker/FOO which will avoid some of the internal-process constraint issues (it's very unlikely that the cgroup is completely unused and so / will not be a leaf node). But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2 -- especially since in cgroupv2 they have the same problem as us with the internal-process constraint.

alban · 2018-09-25T08:54:17Z

we won't be able to implement the cgroup parts OCI spec fully on ordinary systems

I agree, the current OCI spec has been written with cgroup-v1 in mind... the device cgroup and the network classID are tied to cgroup-v1.

In cgroup-v2, the same features can be achieved with some equivalents for device cgroup and net_cls but that's different API.

So in my opinion, the OCI spec would need an update for cgroup-v2... either include some croup-v2 concepts or be abstracted.

But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2

Do you refer to the systemd in the container, on the host, or using the container runtime systemd-nspawn?

For reference, systemd (on the host) supports 3 options for container runtimes with cgroup-v2.

sargun · 2018-09-25T12:08:17Z

Yeah, I think there are two threads here:

We need to change the OCI spec to accomodate cgroupv2, and not be as "prescriptive" about how cgroups are implemented.
We need a cgroupv2 engine

I think that the engine should ideally have pluggable backends. The first one should probably just make RPCs to systemd to create slices and scopes. For example, in our system today, we run all containers under /containers.slice. I can imagine something like this:

/containers.slice/..
        (The following scopes are created by systemd with Delegate=true)
        /container-1.scope (Resource constraints exist here)
        /container-2.scope

It might make sense for us to do our own cgroup control eventually, but given how poorly systemd plays with others, and how much investment goes into it, I see no reason to reinvent the wheel.

arianvp · 2019-06-23T19:02:18Z

FYI systemd-nspawn actually implements the OCI spec since the latest release and I assume it works with cgroups v2 (I would be surprised if it didn't) so perhaps that isn't such a large blocker as we thought? https://github.com/systemd/systemd/blob/916f595c7cbe5dd5028a23a17a245ef19e8f6a29/NEWS#L628

cyphar · 2019-08-25T17:20:26Z

Fedora 31 is switching to cgroupv2 entirely (and will start using crun as a result -- because it supports cgroupv2). I guess now it's do or die (work is being done in #2114).

mrunalp · 2019-08-25T20:20:01Z

@cyphar @giuseppe @filbranden and others interested in this, I think we need to settle on what we want to do in the runtime spec soon. I am leaning towards having a separate cgroupv2 struct in the spec and then allowing conversion in runc if needed. wdyt?
We can add this to the agenda for the OCI call if we think it is better discussed in sync there.

timchenxiaoyu · 2019-10-11T01:45:41Z

does cgroup v2 support limit page cache used ???

Werkov · 2019-10-11T07:32:31Z

@timchenxiaoyu Yes. In the sense that v1 controller limits page cache too.

AkihiroSuda · 2019-10-22T20:00:27Z

@cyphar I think this can be closed and now we should create separate issues for remaining tasks

crosbymichael · 2019-10-23T14:05:10Z

Ok, i'll close this and we will work out of individual issues remaining

Jamlee · 2019-10-25T06:23:08Z

so, where is the new issue about cgroup ?

AkihiroSuda · 2019-10-25T06:36:51Z

Basic support for cpu, cpuset, memory, pids, io (blkio), and freezer controllers is already done. (#2113)

The major remaining issues are:

the lack of the support for the v2 device controller (CRITICAL): cgroup2: eBPF device controller #2144 (PR: cgroup2: port over eBPF device controller from crun #2145)
the lack of runc ps: cgroup2: implement runc ps #2149
the lack of CI: CI: use Vagrant on Travis (mainly for cgroup2) #2124

I think maintainers should set cgroup2 labels so that people can easily find them.

Maybe we should also discuss the design of Manager and Subsystem structs: #2148 (comment)

Keruspe · 2019-10-25T07:42:42Z

What about a milestone?

AkihiroSuda · 2019-10-29T14:21:00Z

I wrote a blog about this: https://medium.com/nttlabs/cgroup-v2-596d035be4d7

There is no official milestone, but I think it will almost reach feature-complete when #2144 and #2149 gets merged.

AkihiroSuda · 2019-11-01T04:34:23Z

Rootful mode seems almost feature-complete now.

Rootless still doesn't work for cgroup2: #2163

Docker fails to start with "Devices cgroup isn't mounted." According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in>

Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243. According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in> Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

AkihiroSuda · 2020-01-14T05:13:36Z

I compiled the list of leftover TODOs: #2209

Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243. According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in> Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

Last blocker got fixed, so updating runc should be last thing to do. opencontainers/runc#654

Docker fails to start with "Devices cgroup isn't mounted" as of systemd 243. According to the systemd documentation: systemd now defaults to the "unified" cgroup hierarchy setup during build-time, i.e. -Ddefault-hierarchy=unified is now the build-time default. Previously, -Ddefault-hierarchy=hybrid was the default. [...] Downstream production distributions might want to continue to use -Ddefault-hierarchy=hybrid (or even =legacy) for their builds as unfortunately the popular container managers have not caught up with the kernel API changes. Changing this option to "hybrid" or "legacy" fixes the Docker startup. Reference: opencontainers/runc#654 Signed-off-by: Christian Stewart <christian@paral.in> Tested-by: Jérémy Rosen <jeremy.rosen@smile.fr> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>

cyphar mentioned this issue Oct 18, 2016

Add CPU hotplug support #1119

Closed

stefanberger pushed a commit to stefanberger/runc that referenced this issue Sep 8, 2017

Merge pull request opencontainers#654 from wking/unique-within-this-map

b10c0b2

config: Bring "unique... within this map" back together

AkihiroSuda mentioned this issue Oct 17, 2018

Allow running dockerd as a non-root user (Rootless mode) moby/moby#38050

Merged

mskarbek mentioned this issue Oct 17, 2018

container does not start because of systemd support for managing cgroups is not available containers/podman#1534

Closed

AkihiroSuda mentioned this issue Dec 22, 2018

[rootless] question: plan for supporting cgroups? containers/podman#1429

Closed

thaJeztah mentioned this issue Jan 30, 2019

Docker does not run with unified cgroup hierarchy moby/moby#16238

Closed

croissanne mentioned this issue Aug 31, 2019

Fedora 31 cockpit-project/cockpit#12619

Merged

3 tasks

arianvp mentioned this issue Sep 4, 2019

systemd: 242 -> 243 NixOS/nixpkgs#68096

Merged

10 tasks

sandys mentioned this issue Oct 22, 2019

Add buildah/podman for managing images k3s-io/k3s#488

Closed

crosbymichael closed this as completed Oct 23, 2019

paralin mentioned this issue Nov 11, 2019

Docker fails to start with new unified cgroupv2 fs skiffos/SkiffOS#100

Closed

1 task

rektide mentioned this issue Dec 21, 2019

Support cgroups v2 k3s-io/k3s#900

Closed

tpgxyz added a commit to OpenMandrivaAssociation/systemd that referenced this issue Jun 17, 2020

switching to default-hierarchy=unified

86cb4a2

Last blocker got fixed, so updating runc should be last thing to do. opencontainers/runc#654

tpgxyz added a commit to OpenMandrivaAssociation/systemd that referenced this issue May 7, 2021

switching to default-hierarchy=unified

15c1ce9

Last blocker got fixed, so updating runc should be last thing to do. opencontainers/runc#654

utam0k mentioned this issue May 24, 2021

Support for cgroups v2 containers/youki#32

Closed

support cgroup v2 (unified hierarchy) #654

support cgroup v2 (unified hierarchy) #654

Comments

sols1 commented Mar 17, 2016

moby/moby#16238

cyphar commented Mar 22, 2016

sols1 commented Apr 27, 2016

cyphar commented Apr 27, 2016 • edited

rodionos commented May 1, 2016

sols1 commented May 24, 2016

cyphar commented May 24, 2016 • edited

justincormack commented Oct 19, 2016

sols1 commented Oct 19, 2016

cyphar commented Oct 20, 2016

hustcat commented Nov 29, 2016

rhatdan commented Jan 9, 2017

uname -r

webczat commented Oct 6, 2017

cyphar commented Oct 7, 2017

webczat commented Oct 7, 2017 via email

sargun commented Nov 13, 2017

cyphar commented Nov 14, 2017

redbaron commented May 23, 2018

cyphar commented May 25, 2018 • edited

sargun commented May 25, 2018 via email

cyphar commented Jun 3, 2018 • edited

sargun commented Sep 24, 2018 • edited

cyphar commented Sep 25, 2018 • edited

alban commented Sep 25, 2018

sargun commented Sep 25, 2018 • edited

arianvp commented Jun 23, 2019 • edited

cyphar commented Aug 25, 2019 • edited

mrunalp commented Aug 25, 2019

timchenxiaoyu commented Oct 11, 2019

Werkov commented Oct 11, 2019

AkihiroSuda commented Oct 22, 2019

crosbymichael commented Oct 23, 2019

Jamlee commented Oct 25, 2019

AkihiroSuda commented Oct 25, 2019 • edited

Keruspe commented Oct 25, 2019 via email

AkihiroSuda commented Oct 29, 2019

AkihiroSuda commented Nov 1, 2019

AkihiroSuda commented Jan 14, 2020

cyphar commented Apr 27, 2016 •

edited

cyphar commented May 24, 2016 •

edited

cyphar commented May 25, 2018 •

edited

cyphar commented Jun 3, 2018 •

edited

sargun commented Sep 24, 2018 •

edited

cyphar commented Sep 25, 2018 •

edited

sargun commented Sep 25, 2018 •

edited

arianvp commented Jun 23, 2019 •

edited

cyphar commented Aug 25, 2019 •

edited

AkihiroSuda commented Oct 25, 2019 •

edited