Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Systemd support #211

Merged
merged 2 commits into from Feb 19, 2019
Merged

Systemd support #211

merged 2 commits into from Feb 19, 2019

Conversation

rgulewich
Copy link
Contributor

@rgulewich rgulewich commented Dec 20, 2018

For all containers:

  • Mount /run as tmpfs (default size 128 MiB)

For systemd labeled containers (those running and image with the
com.netflix.titus.systemd label set to "true"):

  • Mount /run/lock as its own tmpfs mount
  • Tini exec's the container's init command so that it runs as pid 1
  • Run them using the standard apparmor and seccomp profiles: no
    CAP_SYS_ADMIN requirement

Other notes:

  • This requires that cgroup namespaces are enabled in docker, otherwise
    the systemd container will fail to come up due to not being able to
    create new cgroups.
  • Move to Bionic for the systemd test image: the version of systemd that
    ships with it is able to start without CAP_SYS_ADMIN

@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #211 into master will increase coverage by 0.16%.
The diff coverage is 67.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #211      +/-   ##
==========================================
+ Coverage   35.65%   35.82%   +0.16%     
==========================================
  Files          71       71              
  Lines        9031     9061      +30     
==========================================
+ Hits         3220     3246      +26     
+ Misses       5412     5409       -3     
- Partials      399      406       +7
Impacted Files Coverage Δ
executor/runtime/types/types.go 67.24% <ø> (ø) ⬆️
executor/runtime/docker/capabilities.go 82.35% <100%> (+1.1%) ⬆️
executor/runtime/docker/seccomp/seccomp.go 28.81% <100%> (+2.65%) ⬆️
executor/runtime/docker/docker.go 55.15% <57.89%> (-0.09%) ⬇️
executor/runtime/docker/docker_linux.go 60.68% <68.18%> (+1.73%) ⬆️

@coveralls
Copy link

coveralls commented Dec 21, 2018

Pull Request Test Coverage Report for Build 3233

  • 5 of 58 (8.62%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.07%) to 25.784%

Changes Missing Coverage Covered Lines Changed/Added Lines %
executor/runtime/docker/capabilities.go 3 6 50.0%
executor/runtime/docker/docker_linux.go 0 24 0.0%
executor/runtime/docker/docker.go 0 26 0.0%
Totals Coverage Status
Change from base Build 3218: -0.07%
Covered Lines: 2985
Relevant Lines: 11577

💛 - Coveralls

@rgulewich rgulewich force-pushed the systemd branch 3 times, most recently from 99fde5e to 4bb18a7 Compare January 28, 2019 17:26
Copy link
Contributor

@sargun sargun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only functional bug in here is the concurrency bug with inspecting the image and modifying a shared object. The rest are stylistic.

I guess we cannot add a full end-to-end test until we get the new Docker, correct?

.circleci/config.yml Show resolved Hide resolved
@@ -58,9 +59,12 @@ func setupAdditionalCapabilities(c *runtimeTypes.Container, hostCfg *container.H
if c.TitusInfo.GetAllowNestedContainers() {
apparmorProfile = "docker-nested"
seccompProfile = "nested-container.json"
c.Env["TINI_UNSHARE"] = trueString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rip out the nested containers code if you want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we never going to use it again? I wasn't sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might, but for now, it's cruft that's not supported, and should never be set. Probably worth just getting rid of the code and throwing an error if we find it's set to true.

@@ -468,6 +471,20 @@ func (r *DockerRuntime) dockerConfig(c *runtimeTypes.Container, binds []string,
// Maybe set cfs bandwidth has to be called _after_
maybeSetCFSBandwidth(r.dockerCfg.cfsBandwidthPeriod, c, hostCfg)

// Always setup tmpfs: it's needed to ensure Metatron credentials don't persist across reboots and for SystemD to work
tmpFsSize := int64(defaultRunTmpFsSize)
if hostCfg.Memory < tmpFsSize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. It should never happen, since container memory should never be so low. In fact, if system memory = tmpfs memory, aren't we effectively leaking double the RAM?

We "trust" the master to do validation on resource dimensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would also simplify the code a little bit, because then you could just hard code the below sizes, rather than fmt.Sprintf'ing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine. When you pass in tmpfs volumes to docker at create time, the used tmpfs bytes count toward the cgroup's limit, so it doesn't double count.

if systemdBool, ok := imageInfo.Config.Labels[systemdImageLabel]; ok {
val, err := strconv.ParseBool(systemdBool)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap this error so we know where it's coming from (IIRC, the function is errors.wrapf

// Use image labels to determine if the container should be configured to run SystemD
func setSystemdRunning(imageInfo types.ImageInspect, c *runtimeTypes.Container) error {
l := log.WithField("imageName", c.QualifiedImageName())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not log the value here, rather than below, so if the value is "corrupted" we can at least figure out what's going on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this logger should descend from the one already in Prepare

}

l.Info("SystemD image label not set: not configuring container to run SystemD")
c.IsSystemD = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the default anyways, so this is redundant?

@@ -910,7 +947,7 @@ func (r *DockerRuntime) Prepare(parentCtx context.Context, c *runtimeTypes.Conta
}

myImageInfo = imageInfo
return nil
return setSystemdRunning(*imageInfo, c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot do it in here because that's a concurrency violation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this modifies "c" when other people could be using it

@rgulewich
Copy link
Contributor Author

I guess we cannot add a full end-to-end test until we get the new Docker, correct?

@sargun - Correct. In the meantime, I added a separate functional test for this.

For all containers:
- Mount /run as tmpfs (default size 128 MiB)

For systemd labeled containers (those running and image with the
`com.netflix.titus.systemd` label set to "true"):
- Mount `/run/lock` as its own tmpfs mount
- Tini exec's the container's init command so that it runs as pid 1
- Run them using the standard apparmor and seccomp profiles: no
  CAP_SYS_ADMIN requirement

Other notes:
- This requires that cgroup namespaces are enabled in docker, otherwise
  the systemd container will fail to come up due to not being able to
  create new cgroups.
- Move to Bionic for the systemd test image: the version of systemd that
  ships with it is able to start without CAP_SYS_ADMIN
@rgulewich rgulewich force-pushed the systemd branch 2 times, most recently from ba1f980 to 20f14ca Compare February 18, 2019 22:26
@@ -61,6 +61,7 @@ func NewContainer(taskID string, titusInfo *titus.ContainerInfo, resources *runt
TitusInfo: titusInfo,
Resources: resources,
Env: env,
IsSystemD: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't false the default, any reason to set it?

l.Infof("SystemD image label set to %s", systemdBool)
val, err := strconv.ParseBool(systemdBool)
if err != nil {
return errors.Wrap(err, "Error parsing SystemD image label")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemD -> systemd

l := log.WithField("imageName", c.QualifiedImageName())

if systemdBool, ok := imageInfo.Config.Labels[systemdImageLabel]; ok {
l.Infof("SystemD image label set to %s", systemdBool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use withfield.

return nil
}

l.Info("SystemD image label not set: not configuring container to run SystemD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Also, does it make sense to log on the negative?

@@ -1699,6 +1728,11 @@ func (r *DockerRuntime) setupPostStartLogDirTiniHandleConnection2(parentCtx cont
return err
}

if err := setCgroupOwnership(parentCtx, c, cred); err != nil {
log.Error("Unable to setup container nesting: ", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error message correct, given we've removed nesting (or at least begun to remove it), shouldn't it be something like "Unable to delegate cgroups?"? also, use withErr

@@ -13,6 +13,9 @@ import (
"time"
"unsafe"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space.

return nil
}

cgroupPath := filepath.Join("/proc/", strconv.FormatInt(int64(cred.pid), 10), "cgroup")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stronv.itoa?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants