Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch set to support systemd as PID1 in container #13525

Closed
wants to merge 5 commits into from

Conversation

rhatdan
Copy link
Contributor

@rhatdan rhatdan commented May 28, 2015

This patch set combines a few patches to allow systemd to fully function in a non privileged container

I am closeing down some pull requests for /run and journald and putting them all behind a

docker run --systemd option.

This patch set includes a couple of patches that should get merged without using the systemd pull.

@jessfraz
Copy link
Contributor

i just can't condone a --systemd flag

@jessfraz
Copy link
Contributor

whats next --sysvinit, --upstart, --i-just-wrote-my-own-init-system-in-a-weekend-give-me-a-flag

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

--systemd is the primary init system on almost all linux distributions and ubunto has said they will adopt it. Alot of the changes are defaults for standard init systems including the proper signals to send to an init daemon.

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

Ubuntu (sic) has already replaced Upstream with systemd:

https://wiki.ubuntu.com/VividVervet/ReleaseNotes/#Boot_and_service_management

It's just not in a long-term release yet.

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

I have no problem with changing the option to something like --os or --init, but this patch set is basically following these standards defined by systemd maintainers.

https://wiki.freedesktop.org/www/Software/systemd/ContainerInterface/

@jessfraz
Copy link
Contributor

"Do not drop CAP_SYS_ADMIN from the container. "

Seems secure....

On Friday, May 29, 2015, Daniel J Walsh notifications@github.com wrote:

I have no problem with changing the option to something like --os or
--init, but this patch set is basically following these standards defined
by systemd maintainers.

https://wiki.freedesktop.org/www/Software/systemd/ContainerInterface/


Reply to this email directly or view it on GitHub
#13525 (comment).

@shaded-enmity
Copy link

@jfrazelle why the straw man?

Or do you think that integrating one process manager (which the Docker daemon is) with system-wide service/process manager is bad? I don't get it. Especially in the context of this:
https://github.com/docker/docker/tree/master/contrib/init

So yeah, support natively both systemd and upstart and let users choose?
I'd just change the option to --init=[none|systemd|upstart] and all is well, isn't it?

@jessfraz
Copy link
Contributor

I think there is another way we can make everyone happy, and the Init flags
were a joke chill out

On Friday, May 29, 2015, Causa Sui notifications@github.com wrote:

@jfrazelle https://github.com/jfrazelle why the straw man?

Or do you think that integrating one process manager (which the Docker
daemon is) with system-wide service/process manager is bad? I don't get
it. Especially in the context of this:
https://github.com/docker/docker/tree/master/contrib/init

So yeah, support natively both systemd and upstart and let users choose?
I'd just change the option to --init=[none|systemd|upstart] and all is
well, isn't it?


Reply to this email directly or view it on GitHub
#13525 (comment).

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

"Do not drop CAP_SYS_ADMIN from the container. "

Not sure what you mean, I don't want CAP_SYS_ADMIN in the container.

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

How about we go with

--subsystem=systemd
or
--boot=systemd

Then if some other init system comes along, they could implement it without haveing to add options.

@jessfraz
Copy link
Contributor

If you don't want it in the container then you should drop it

On Friday, May 29, 2015, Daniel J Walsh <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

"Do not drop CAP_SYS_ADMIN from the container. "

Not sure what you mean, I don't want CAP_SYS_ADMIN in the container.


Reply to this email directly or view it on GitHub
#13525 (comment).

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

It is dropped by default. I don't think we are communicating well. Maybe we should go to IRC. I don't understand what you are trying to say.

@jessfraz
Copy link
Contributor

I think it's fine don't worry it's just interesting wording to warn users
not to drop a cap that isn't even added

On Friday, May 29, 2015, Daniel J Walsh notifications@github.com wrote:

It is dropped by default. I don't think we are communicating well. Maybe
we should go to IRC. I don't understand what you are trying to say.


Reply to this email directly or view it on GitHub
#13525 (comment).

@shaded-enmity
Copy link

@jfrazelle Well, making everyone happy is utopic thought :)

I think that a program that does OS level virtualization should be aware about the innards of the OS it is virtualizing to simply provide the most seamless experience possible.

BTW: By default, when you create user namespace you get a full capability set, therefore it is suggested not to drop that capability. The fact that Docker drops it by default is nice, yet completely unrelated to the underlying kernel technologies, since Docker doesn't even use user namespace in the first place and therefore "relies" on the fact that CAP_SYS_ADMIN is inherited from the root account the Daemon is running as.

@rhatdan
Copy link
Contributor Author

rhatdan commented May 29, 2015

Ok, you were reading lennarts article. I was wondering where these quotes were coming from. I want to run a non-privileged container, so we will not follow that part of his description as well as allowing the container to do mknod. I believe they currently disable Pr

@LalatenduMohanty
Copy link
Contributor

Honestly I don't see this as an issue (i.e. the idea) i.e. init subsystem specific flags . Init systems are critical piece for os/processes. I don't think anyone writes init system just like that. For example if we look at history of Linux(may be Unix too), how many times new init systems came to major Linux/GNU distributions.
So an optional flag makes the Docker experience better , then why not.

@ncoghlan
Copy link

ncoghlan commented Jun 2, 2015

I don't think it makes sense for Docker to go down the path of supporting arbitrary init systems running inside containers, which I believe is the implication @jfrazelle was reacting to. That expectation of having to eventually adjust Docker to handle the different init systems is implied by the "--systemd" name for the flag, as well as by the alternative suggestions of making Docker aware of multiple init systems and configuring the container differently based on which init system is being selected.

However, I do think it makes sense for Docker to better support the "multi-service container" use case, where an existing application that isn't particularly amenable to containerisation can be run inside a multi-service container, allow it to be managed and deployed using Docker based container management and orchestration tools, rather than having to have separate tooling to manage full VMs for such applications.

With a flag name like "--multiservice", this feature would let Docker define for init system developers what they need to do to integrate with Docker as the init system running inside a multi-service container. The fact that systemd already supports working this way shows that the defined interface is sufficient to get a multi-service container working correctly. Other init systems are likely to work better in a container configured this way, and if they still don't work completely, then it will be up to the developers of those systems to either update their init system to support Docker properly (as the systemd developers have), or else to propose changes to the way Docker configures multi-service containers.

@rhatdan
Copy link
Contributor Author

rhatdan commented Jun 10, 2015

Reworked this patch set to use --init=systemd rather then --systemd. Also added some more man page changes and completion code. This would allow additional init systems should they come along to add patches to run in their defined containerized world.

I would like to at least get the ok on the CLI change, so we could announce this to the Fedora world.

@thaJeztah
Copy link
Member

@arthurzenika
Copy link

we're interrested in this too, thanks @rhatdan for this pull request.

@gw0
Copy link

gw0 commented Jul 3, 2015

The main thing this patch does is sets the default graceful stop signal at create/run time. So why not name the parameter --signal?

As far as I can see not just process managers, but many daemons use different signals for graceful shutdown, so it doesn't make sense to enforce SIGTERM (check out Handling Signals section). Also check out systemd and docker discussion of this bug:

"When you send PID 1 a SIGTERM, then this means that PID 1 shall reexecute. It always has meant that, regardless if you look at sysvinit, upstart or systemd. It's one of the special semantics that PID 1 has."

Otherwise with the official docker-upstart image graceful shutdown can be achieved with SIGPWR:

# docker kill --signal="PWR" xxx1 && sleep 10 && docker kill xxx1

Systemd and others also use SIGPWR for graceful shutdown (if prepared correctly, see configure_debian_systemd()). Also newer versions of lxc-stop have SIGPWR as halt signal:

"By default, it will request a clean shutdown of the container by sending lxc.haltsignal (defaults to SIGPWR) to the container's init process, waiting up to 60 seconds for the container to exit, and then returning. If the container fails to cleanly exit in 60 seconds, it will be sent the lxc.stopsignal (defaults to SIGKILL) to force it to shut down. A request to reboot will send the lxc.rebootsignal (defaults to SIGINT) to the container's init process."

To me it seems that it would make most sense to at least put the default graceful stop (now SIGTERM) and default graceful stop timeout before killing (now only in stop command) as settings into Dockerfile and at create/run time (eg. docker create --stop-signal SIGPWR --stop-timeout 60 ...). But also overridable at stop time as timeout already is. (At stop time there could also be a --ask-before-kill option to ask the user what to do.)

@mitar
Copy link

mitar commented Jul 3, 2015

--stop-signal is a great idea!

@rhatdan
Copy link
Contributor Author

rhatdan commented Jul 13, 2015

Sometimes I think people are clouding their dislike of systemd rather then doing what is best for the users. systemd is the default init system, which will be running on the majority of systems that docker will be running on going forward.

Having a systemd running properly within the container, without having everyone need to learn some fancy recipe is the goal.

Perhaps this should be moved to runc going forward, but we still would prefer docker to tell the container it will be running a traditional init system as pid1.

@rhatdan rhatdan force-pushed the systemd branch 2 times, most recently from 4e63cde to 09cb2fe Compare July 20, 2015 20:30
@tiborvass tiborvass added the status/needs-attention Calls for a collective discussion during a review session label Jul 23, 2015
@rhatdan rhatdan force-pushed the systemd branch 3 times, most recently from a87edd7 to 7031529 Compare July 24, 2015 18:26
@@ -201,6 +202,14 @@ more advanced use case would be changing the host's hostname from a container.
> **Note**: `--uts="host"` gives the container full access to change the
> hostname of the host and is therefore considered insecure.

## INIT settings (--init)
Default to docker standard method of running containers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults to Docker's standard method of running contianers.

Though this is still confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest some other wording?

## INIT settings (--init)
Specify alternate method of running containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write:

INIT settings (--init)

Enable a pre-configured profile for running init systems within containers.
Default: No profile is enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're working on that, please add a blank line after the heading (## INIT..), otherwise it may not render properly when generating the docs

@rhatdan
Copy link
Contributor Author

rhatdan commented Jul 27, 2015

Changed the docs

Signed-off-by: Sally O'Malley <somalley@redhat.com>

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
This patch will use the new PreMount and PostMount hooks to "tar"
up the contents of the base image on top of tmpfs mount points.

We want to add support for /run and /tmp on tmpfs and Michael Chrosby idea to
add --tmpfs PATH would require a patch like this.

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
This tells systemd what to name the /etc/machine-id file
which links the journald within the container to to external log

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
…ystem mode

If you are running a container with an init system like systemd as pid 1 this i
option will modify the default container environment to follow the standard
defined for that init system.

Specifying --init=systemd will run the container with the following changes:

 /run mounted as a tmpfs
 /sys/fs/cgroup volume mounted from the host into the container read-only
 /var/log/journal/UUID volume mounted from the host into the container to allow
journald data from the container to be seen on the host.
 Changes the default signals send from docker stop to the init process (systemd)

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
@calavera
Copy link
Contributor

calavera commented Aug 7, 2015

We've discussed this change with several maintainers and we reached the agreement that we do NOT want to add a flag for a specific init system to docker.

We're going to provider the proper primitives to make this work, like allowing to stop a container with a specific signal and allowing to mount tmpfs.

Closing this PR. Thank you all for your feedback.

@calavera calavera closed this Aug 7, 2015
@timthelion
Copy link
Contributor

@calavera It will take longer to add each of these capabilities in a general way, but I agree that it will be worth the effort.

@rhatdan
Copy link
Contributor Author

rhatdan commented Aug 8, 2015

I totally disagree with this decision. I don't see a clean way to do this without at least getting the container ID information into the container. The complexity of doing this correctly is going to be a huge headache to users, and going to cause lots of bugs when systemd/docker does not work correctly.

I responded to emails discussing this issue with my concerns and then never heard back on them.

@thaJeztah
Copy link
Member

@icecrime @calavera would it be a option to schedule a hangout with @rhatdan ? I've seen this issue, and several related proposals/attempts (e.g. the container_ uuid) appear for over a year; is there a ways to reach some consensus, or better discuss the approach @calavera proposes?

@timthelion
Copy link
Contributor

@thaJeztah would you mind linkng me to @calavera's approach? I am not in the in club I guess ;)

@timthelion
Copy link
Contributor

@thaJeztah Or do you simply refer to his closing comments?

@thaJeztah
Copy link
Member

@timthelion sorry, yes, just meant the approach in his comment (not "in the loop" either, if there's any "loop", lol)

Just hoping we can find a solution that's satisfactory to all involved here ❤️

@calavera
Copy link
Contributor

@timthelion There are two PRs open already to address some of the issues that this change tried to solve:

Stop signaling: #15307
Tmpfs: #13587

We already mount the cgroups directory inside the container as read-only, and other issues raised here can be work around with options that we already have.

@rhatdan
Copy link
Contributor Author

rhatdan commented Aug 10, 2015

The unification of the UUID is the one that can not be fixed. Having separate UUID to refer to the docker container, is a user nightmare.

This patch along with machinectl patches basically allow a user to refer to a container based on the UUID.

The end goal is to be able to do

journalctl -M UUID

And then would see the journal from inside of the container.

This requieres RegisterMachine to register the UUID, /var/log/journal/UUID from the host mounted into the container, /etc/machine-id generated by systemd to match the UUID, which can only happen if $container_uuid=UUID before systemd starts.

@timthelion
Copy link
Contributor

@rhatdan Then can you make a pull request that would create a --expose-container-uuid-to-container CLI option to docker run which would set $DOCKER_CONTAINER_UUID at container creation time? That should be pretty strait forward, no?

@rhatdan
Copy link
Contributor Author

rhatdan commented Aug 10, 2015

I have had a pull request to do this for over a year

#7685

but it is not that simple.

Since we need to know the name before the container starts to create the /var/log/docker/UUID directory to mount into the container. We also need the RegisterMachine patch to be added to to docker to register the UUID before the container starts.

#13526

@jeffmccune
Copy link

@calavera

We've discussed this change with several maintainers and we reached the agreement that we do NOT want to add a flag for a specific init system to docker.

Could you explain why please? I'd like to see this added for more efficient system level testing using containers running systemd rather than full VM's running systemd. I've read this comment thread a few times and it's not clear why the patch has been rejected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-attention Calls for a collective discussion during a review session status/1-design-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet