Patch set to support systemd as PID1 in container #13525

rhatdan · 2015-05-28T01:02:21Z

This patch set combines a few patches to allow systemd to fully function in a non privileged container

I am closeing down some pull requests for /run and journald and putting them all behind a

docker run --systemd option.

This patch set includes a couple of patches that should get merged without using the systemd pull.

jessfraz · 2015-05-28T23:54:07Z

i just can't condone a --systemd flag

jessfraz · 2015-05-28T23:54:51Z

whats next --sysvinit, --upstart, --i-just-wrote-my-own-init-system-in-a-weekend-give-me-a-flag

rhatdan · 2015-05-29T00:52:15Z

--systemd is the primary init system on almost all linux distributions and ubunto has said they will adopt it. Alot of the changes are defaults for standard init systems including the proper signals to send to an init daemon.

rhatdan · 2015-05-29T11:29:59Z

Ubuntu (sic) has already replaced Upstream with systemd:

https://wiki.ubuntu.com/VividVervet/ReleaseNotes/#Boot_and_service_management

It's just not in a long-term release yet.

rhatdan · 2015-05-29T13:14:36Z

I have no problem with changing the option to something like --os or --init, but this patch set is basically following these standards defined by systemd maintainers.

https://wiki.freedesktop.org/www/Software/systemd/ContainerInterface/

jessfraz · 2015-05-29T13:21:12Z

"Do not drop CAP_SYS_ADMIN from the container. "

Seems secure....

On Friday, May 29, 2015, Daniel J Walsh notifications@github.com wrote:

I have no problem with changing the option to something like --os or
--init, but this patch set is basically following these standards defined
by systemd maintainers.

https://wiki.freedesktop.org/www/Software/systemd/ContainerInterface/

—
Reply to this email directly or view it on GitHub
#13525 (comment).

shaded-enmity · 2015-05-29T13:27:49Z

@jfrazelle why the straw man?

Or do you think that integrating one process manager (which the Docker daemon is) with system-wide service/process manager is bad? I don't get it. Especially in the context of this:
https://github.com/docker/docker/tree/master/contrib/init

So yeah, support natively both systemd and upstart and let users choose?
I'd just change the option to --init=[none|systemd|upstart] and all is well, isn't it?

jessfraz · 2015-05-29T13:33:00Z

I think there is another way we can make everyone happy, and the Init flags
were a joke chill out

On Friday, May 29, 2015, Causa Sui notifications@github.com wrote:

@jfrazelle https://github.com/jfrazelle why the straw man?

Or do you think that integrating one process manager (which the Docker
daemon is) with system-wide service/process manager is bad? I don't get
it. Especially in the context of this:
https://github.com/docker/docker/tree/master/contrib/init

So yeah, support natively both systemd and upstart and let users choose?
I'd just change the option to --init=[none|systemd|upstart] and all is
well, isn't it?

—
Reply to this email directly or view it on GitHub
#13525 (comment).

rhatdan · 2015-05-29T14:02:47Z

"Do not drop CAP_SYS_ADMIN from the container. "

Not sure what you mean, I don't want CAP_SYS_ADMIN in the container.

rhatdan · 2015-05-29T14:06:34Z

How about we go with

--subsystem=systemd
or
--boot=systemd

Then if some other init system comes along, they could implement it without haveing to add options.

jessfraz · 2015-05-29T14:16:14Z

If you don't want it in the container then you should drop it

On Friday, May 29, 2015, Daniel J Walsh <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

"Do not drop CAP_SYS_ADMIN from the container. "

Not sure what you mean, I don't want CAP_SYS_ADMIN in the container.

—
Reply to this email directly or view it on GitHub
#13525 (comment).

rhatdan · 2015-05-29T14:27:57Z

It is dropped by default. I don't think we are communicating well. Maybe we should go to IRC. I don't understand what you are trying to say.

jessfraz · 2015-05-29T14:38:58Z

I think it's fine don't worry it's just interesting wording to warn users
not to drop a cap that isn't even added

On Friday, May 29, 2015, Daniel J Walsh notifications@github.com wrote:

It is dropped by default. I don't think we are communicating well. Maybe
we should go to IRC. I don't understand what you are trying to say.

—
Reply to this email directly or view it on GitHub
#13525 (comment).

shaded-enmity · 2015-05-29T16:08:35Z

@jfrazelle Well, making everyone happy is utopic thought :)

I think that a program that does OS level virtualization should be aware about the innards of the OS it is virtualizing to simply provide the most seamless experience possible.

BTW: By default, when you create user namespace you get a full capability set, therefore it is suggested not to drop that capability. The fact that Docker drops it by default is nice, yet completely unrelated to the underlying kernel technologies, since Docker doesn't even use user namespace in the first place and therefore "relies" on the fact that CAP_SYS_ADMIN is inherited from the root account the Daemon is running as.

rhatdan · 2015-05-29T16:56:17Z

Ok, you were reading lennarts article. I was wondering where these quotes were coming from. I want to run a non-privileged container, so we will not follow that part of his description as well as allowing the container to do mknod. I believe they currently disable Pr

LalatenduMohanty · 2015-05-29T16:59:33Z

Honestly I don't see this as an issue (i.e. the idea) i.e. init subsystem specific flags . Init systems are critical piece for os/processes. I don't think anyone writes init system just like that. For example if we look at history of Linux(may be Unix too), how many times new init systems came to major Linux/GNU distributions.
So an optional flag makes the Docker experience better , then why not.

ncoghlan · 2015-06-02T04:32:01Z

I don't think it makes sense for Docker to go down the path of supporting arbitrary init systems running inside containers, which I believe is the implication @jfrazelle was reacting to. That expectation of having to eventually adjust Docker to handle the different init systems is implied by the "--systemd" name for the flag, as well as by the alternative suggestions of making Docker aware of multiple init systems and configuring the container differently based on which init system is being selected.

However, I do think it makes sense for Docker to better support the "multi-service container" use case, where an existing application that isn't particularly amenable to containerisation can be run inside a multi-service container, allow it to be managed and deployed using Docker based container management and orchestration tools, rather than having to have separate tooling to manage full VMs for such applications.

With a flag name like "--multiservice", this feature would let Docker define for init system developers what they need to do to integrate with Docker as the init system running inside a multi-service container. The fact that systemd already supports working this way shows that the defined interface is sufficient to get a multi-service container working correctly. Other init systems are likely to work better in a container configured this way, and if they still don't work completely, then it will be up to the developers of those systems to either update their init system to support Docker properly (as the systemd developers have), or else to propose changes to the way Docker configures multi-service containers.

rhatdan · 2015-06-10T15:23:42Z

Reworked this patch set to use --init=systemd rather then --systemd. Also added some more man page changes and completion code. This would allow additional init systems should they come along to add patches to run in their defined containerized world.

I would like to at least get the ok on the CLI change, so we could announce this to the Fedora world.

thaJeztah · 2015-06-14T17:52:44Z

ping @icecrime @tiborvass @jfrazelle @crosbymichael for #13525 (comment)

arthurzenika · 2015-06-23T07:46:35Z

we're interrested in this too, thanks @rhatdan for this pull request.

gw0 · 2015-07-03T00:39:28Z

The main thing this patch does is sets the default graceful stop signal at create/run time. So why not name the parameter --signal?

As far as I can see not just process managers, but many daemons use different signals for graceful shutdown, so it doesn't make sense to enforce SIGTERM (check out Handling Signals section). Also check out systemd and docker discussion of this bug:

"When you send PID 1 a SIGTERM, then this means that PID 1 shall reexecute. It always has meant that, regardless if you look at sysvinit, upstart or systemd. It's one of the special semantics that PID 1 has."

Otherwise with the official docker-upstart image graceful shutdown can be achieved with SIGPWR:

# docker kill --signal="PWR" xxx1 && sleep 10 && docker kill xxx1

Systemd and others also use SIGPWR for graceful shutdown (if prepared correctly, see configure_debian_systemd()). Also newer versions of lxc-stop have SIGPWR as halt signal:

"By default, it will request a clean shutdown of the container by sending lxc.haltsignal (defaults to SIGPWR) to the container's init process, waiting up to 60 seconds for the container to exit, and then returning. If the container fails to cleanly exit in 60 seconds, it will be sent the lxc.stopsignal (defaults to SIGKILL) to force it to shut down. A request to reboot will send the lxc.rebootsignal (defaults to SIGINT) to the container's init process."

To me it seems that it would make most sense to at least put the default graceful stop (now SIGTERM) and default graceful stop timeout before killing (now only in stop command) as settings into Dockerfile and at create/run time (eg. docker create --stop-signal SIGPWR --stop-timeout 60 ...). But also overridable at stop time as timeout already is. (At stop time there could also be a --ask-before-kill option to ask the user what to do.)

mitar · 2015-07-03T09:17:08Z

--stop-signal is a great idea!

rhatdan · 2015-07-13T19:03:28Z

Sometimes I think people are clouding their dislike of systemd rather then doing what is best for the users. systemd is the default init system, which will be running on the majority of systems that docker will be running on going forward.

Having a systemd running properly within the container, without having everyone need to learn some fancy recipe is the goal.

Perhaps this should be moved to runc going forward, but we still would prefer docker to tell the container it will be running a traditional init system as pid1.

timthelion · 2015-07-27T18:49:12Z

docs/reference/run.md

@@ -201,6 +202,14 @@ more advanced use case would be changing the host's hostname from a container.
 > **Note**: `--uts="host"` gives the container full access to change the
 > hostname of the host and is therefore considered insecure.

+## INIT settings (--init)
+   Default to docker standard method of running containers.


Defaults to Docker's standard method of running contianers.

Though this is still confusing.

Suggest some other wording?

## INIT settings (--init) Specify alternate method of running containers.

I would write:

INIT settings (--init)

Enable a pre-configured profile for running init systems within containers.
Default: No profile is enabled.

While you're working on that, please add a blank line after the heading (## INIT..), otherwise it may not render properly when generating the docs

rhatdan · 2015-07-27T19:38:01Z

Changed the docs

Signed-off-by: Sally O'Malley <somalley@redhat.com> Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

This patch will use the new PreMount and PostMount hooks to "tar" up the contents of the base image on top of tmpfs mount points. We want to add support for /run and /tmp on tmpfs and Michael Chrosby idea to add --tmpfs PATH would require a patch like this. Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

This tells systemd what to name the /etc/machine-id file which links the journald within the container to to external log Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

…ystem mode If you are running a container with an init system like systemd as pid 1 this i option will modify the default container environment to follow the standard defined for that init system. Specifying --init=systemd will run the container with the following changes: /run mounted as a tmpfs /sys/fs/cgroup volume mounted from the host into the container read-only /var/log/journal/UUID volume mounted from the host into the container to allow journald data from the container to be seen on the host. Changes the default signals send from docker stop to the init process (systemd) Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan) Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

calavera · 2015-08-07T21:53:51Z

We've discussed this change with several maintainers and we reached the agreement that we do NOT want to add a flag for a specific init system to docker.

We're going to provider the proper primitives to make this work, like allowing to stop a container with a specific signal and allowing to mount tmpfs.

Closing this PR. Thank you all for your feedback.

timthelion · 2015-08-07T22:02:13Z

@calavera It will take longer to add each of these capabilities in a general way, but I agree that it will be worth the effort.

rhatdan · 2015-08-08T09:57:51Z

I totally disagree with this decision. I don't see a clean way to do this without at least getting the container ID information into the container. The complexity of doing this correctly is going to be a huge headache to users, and going to cause lots of bugs when systemd/docker does not work correctly.

I responded to emails discussing this issue with my concerns and then never heard back on them.

thaJeztah · 2015-08-08T10:07:54Z

@icecrime @calavera would it be a option to schedule a hangout with @rhatdan ? I've seen this issue, and several related proposals/attempts (e.g. the container_ uuid) appear for over a year; is there a ways to reach some consensus, or better discuss the approach @calavera proposes?

timthelion · 2015-08-08T10:10:40Z

@thaJeztah would you mind linkng me to @calavera's approach? I am not in the in club I guess ;)

timthelion · 2015-08-08T10:13:01Z

@thaJeztah Or do you simply refer to his closing comments?

thaJeztah · 2015-08-08T10:25:24Z

@timthelion sorry, yes, just meant the approach in his comment (not "in the loop" either, if there's any "loop", lol)

Just hoping we can find a solution that's satisfactory to all involved here ❤️

calavera · 2015-08-10T17:22:21Z

@timthelion There are two PRs open already to address some of the issues that this change tried to solve:

Stop signaling: #15307
Tmpfs: #13587

We already mount the cgroups directory inside the container as read-only, and other issues raised here can be work around with options that we already have.

rhatdan · 2015-08-10T18:52:04Z

The unification of the UUID is the one that can not be fixed. Having separate UUID to refer to the docker container, is a user nightmare.

This patch along with machinectl patches basically allow a user to refer to a container based on the UUID.

The end goal is to be able to do

journalctl -M UUID

And then would see the journal from inside of the container.

This requieres RegisterMachine to register the UUID, /var/log/journal/UUID from the host mounted into the container, /etc/machine-id generated by systemd to match the UUID, which can only happen if $container_uuid=UUID before systemd starts.

timthelion · 2015-08-10T20:00:39Z

@rhatdan Then can you make a pull request that would create a --expose-container-uuid-to-container CLI option to docker run which would set $DOCKER_CONTAINER_UUID at container creation time? That should be pretty strait forward, no?

rhatdan · 2015-08-10T20:09:51Z

I have had a pull request to do this for over a year

#7685

but it is not that simple.

Since we need to know the name before the container starts to create the /var/log/docker/UUID directory to mount into the container. We also need the RegisterMachine patch to be added to to docker to register the UUID before the container starts.

#13526

jeffmccune · 2015-11-20T21:45:35Z

@calavera

We've discussed this change with several maintainers and we reached the agreement that we do NOT want to add a flag for a specific init system to docker.

Could you explain why please? I'd like to see this added for more efficient system level testing using containers running systemd rather than full VM's running systemd. I've read this comment thread a few times and it's not clear why the patch has been rejected.

GordonTheTurtle added the status/0-triage label May 28, 2015

This was referenced May 28, 2015

New implementation of /run support #8478

Closed

Implement journalctl to see journald data within a docker container #10994

Closed

cpuguy83 added status/1-design-review and removed status/0-triage labels May 28, 2015

rhatdan mentioned this pull request May 28, 2015

Tar up contents of child directory onto tmpfs if mounted over #13212

Closed

rhatdan force-pushed the systemd branch from 0c483d0 to d3a7938 Compare May 28, 2015 01:09

rhatdan force-pushed the systemd branch from d3a7938 to fd78923 Compare June 10, 2015 15:18

rhatdan force-pushed the systemd branch 2 times, most recently from 4e63cde to 09cb2fe Compare July 20, 2015 20:30

tiborvass added the status/needs-attention Calls for a collective discussion during a review session label Jul 23, 2015

rhatdan force-pushed the systemd branch 3 times, most recently from a87edd7 to 7031529 Compare July 24, 2015 18:26

timthelion reviewed Jul 27, 2015
View reviewed changes

rhatdan force-pushed the systemd branch from 7031529 to b9e3855 Compare July 27, 2015 19:36

rhatdan added 5 commits July 28, 2015 09:10

RegisterMachine truncate container names to first 64 chars

4836835

Signed-off-by: Sally O'Malley <somalley@redhat.com> Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

Set container_uuid environment variable.

7779652

This tells systemd what to name the /etc/machine-id file which links the journald within the container to to external log Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

Merge branch 'master' of github.com:docker/docker into systemd

c79fa46

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

rhatdan force-pushed the systemd branch from b9e3855 to c79fa46 Compare July 28, 2015 18:39

calavera closed this Aug 7, 2015

immarvin mentioned this pull request Mar 7, 2016

[FVT]xCAT in docker :genimage failed xcat2/xcat-core#663

Closed

Patch set to support systemd as PID1 in container #13525

Patch set to support systemd as PID1 in container #13525

Conversation

rhatdan commented May 28, 2015

jessfraz commented May 28, 2015

jessfraz commented May 28, 2015

rhatdan commented May 29, 2015

rhatdan commented May 29, 2015

rhatdan commented May 29, 2015

jessfraz commented May 29, 2015

shaded-enmity commented May 29, 2015

jessfraz commented May 29, 2015

rhatdan commented May 29, 2015

rhatdan commented May 29, 2015

jessfraz commented May 29, 2015

rhatdan commented May 29, 2015

jessfraz commented May 29, 2015

shaded-enmity commented May 29, 2015

rhatdan commented May 29, 2015

LalatenduMohanty commented May 29, 2015

ncoghlan commented Jun 2, 2015

rhatdan commented Jun 10, 2015

thaJeztah commented Jun 14, 2015

arthurzenika commented Jun 23, 2015

gw0 commented Jul 3, 2015

mitar commented Jul 3, 2015

rhatdan commented Jul 13, 2015

timthelion Jul 27, 2015

Choose a reason for hiding this comment

rhatdan Jul 27, 2015

Choose a reason for hiding this comment

timthelion Jul 27, 2015

Choose a reason for hiding this comment

INIT settings (--init)

thaJeztah Jul 27, 2015

Choose a reason for hiding this comment

rhatdan commented Jul 27, 2015

calavera commented Aug 7, 2015

timthelion commented Aug 7, 2015

rhatdan commented Aug 8, 2015

thaJeztah commented Aug 8, 2015

timthelion commented Aug 8, 2015

timthelion commented Aug 8, 2015

thaJeztah commented Aug 8, 2015

calavera commented Aug 10, 2015

rhatdan commented Aug 10, 2015

timthelion commented Aug 10, 2015

rhatdan commented Aug 10, 2015

jeffmccune commented Nov 20, 2015