Systemd support #211
Systemd support #211
Conversation
Codecov Report
@@ Coverage Diff @@
## master #211 +/- ##
==========================================
+ Coverage 35.65% 35.82% +0.16%
==========================================
Files 71 71
Lines 9031 9061 +30
==========================================
+ Hits 3220 3246 +26
+ Misses 5412 5409 -3
- Partials 399 406 +7
|
Pull Request Test Coverage Report for Build 3233
💛 - Coveralls |
99fde5e
to
4bb18a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only functional bug in here is the concurrency bug with inspecting the image and modifying a shared object. The rest are stylistic.
I guess we cannot add a full end-to-end test until we get the new Docker, correct?
@@ -58,9 +59,12 @@ func setupAdditionalCapabilities(c *runtimeTypes.Container, hostCfg *container.H | |||
if c.TitusInfo.GetAllowNestedContainers() { | |||
apparmorProfile = "docker-nested" | |||
seccompProfile = "nested-container.json" | |||
c.Env["TINI_UNSHARE"] = trueString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rip out the nested containers code if you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we never going to use it again? I wasn't sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might, but for now, it's cruft that's not supported, and should never be set. Probably worth just getting rid of the code and throwing an error if we find it's set to true.
executor/runtime/docker/docker.go
Outdated
@@ -468,6 +471,20 @@ func (r *DockerRuntime) dockerConfig(c *runtimeTypes.Container, binds []string, | |||
// Maybe set cfs bandwidth has to be called _after_ | |||
maybeSetCFSBandwidth(r.dockerCfg.cfsBandwidthPeriod, c, hostCfg) | |||
|
|||
// Always setup tmpfs: it's needed to ensure Metatron credentials don't persist across reboots and for SystemD to work | |||
tmpFsSize := int64(defaultRunTmpFsSize) | |||
if hostCfg.Memory < tmpFsSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. It should never happen, since container memory should never be so low. In fact, if system memory = tmpfs memory, aren't we effectively leaking double the RAM?
We "trust" the master to do validation on resource dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would also simplify the code a little bit, because then you could just hard code the below sizes, rather than fmt.Sprintf'ing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine. When you pass in tmpfs volumes to docker at create time, the used tmpfs bytes count toward the cgroup's limit, so it doesn't double count.
executor/runtime/docker/docker.go
Outdated
if systemdBool, ok := imageInfo.Config.Labels[systemdImageLabel]; ok { | ||
val, err := strconv.ParseBool(systemdBool) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap this error so we know where it's coming from (IIRC, the function is errors.wrapf
// Use image labels to determine if the container should be configured to run SystemD | ||
func setSystemdRunning(imageInfo types.ImageInspect, c *runtimeTypes.Container) error { | ||
l := log.WithField("imageName", c.QualifiedImageName()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not log the value here, rather than below, so if the value is "corrupted" we can at least figure out what's going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this logger should descend from the one already in Prepare
executor/runtime/docker/docker.go
Outdated
} | ||
|
||
l.Info("SystemD image label not set: not configuring container to run SystemD") | ||
c.IsSystemD = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the default anyways, so this is redundant?
executor/runtime/docker/docker.go
Outdated
@@ -910,7 +947,7 @@ func (r *DockerRuntime) Prepare(parentCtx context.Context, c *runtimeTypes.Conta | |||
} | |||
|
|||
myImageInfo = imageInfo | |||
return nil | |||
return setSystemdRunning(*imageInfo, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot do it in here because that's a concurrency violation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this modifies "c" when other people could be using it
@sargun - Correct. In the meantime, I added a separate functional test for this. |
f742d1a
to
be0ef7c
Compare
For all containers: - Mount /run as tmpfs (default size 128 MiB) For systemd labeled containers (those running and image with the `com.netflix.titus.systemd` label set to "true"): - Mount `/run/lock` as its own tmpfs mount - Tini exec's the container's init command so that it runs as pid 1 - Run them using the standard apparmor and seccomp profiles: no CAP_SYS_ADMIN requirement Other notes: - This requires that cgroup namespaces are enabled in docker, otherwise the systemd container will fail to come up due to not being able to create new cgroups. - Move to Bionic for the systemd test image: the version of systemd that ships with it is able to start without CAP_SYS_ADMIN
ba1f980
to
20f14ca
Compare
executor/runtime/container.go
Outdated
@@ -61,6 +61,7 @@ func NewContainer(taskID string, titusInfo *titus.ContainerInfo, resources *runt | |||
TitusInfo: titusInfo, | |||
Resources: resources, | |||
Env: env, | |||
IsSystemD: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't false the default, any reason to set it?
executor/runtime/docker/docker.go
Outdated
l.Infof("SystemD image label set to %s", systemdBool) | ||
val, err := strconv.ParseBool(systemdBool) | ||
if err != nil { | ||
return errors.Wrap(err, "Error parsing SystemD image label") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SystemD -> systemd
executor/runtime/docker/docker.go
Outdated
l := log.WithField("imageName", c.QualifiedImageName()) | ||
|
||
if systemdBool, ok := imageInfo.Config.Labels[systemdImageLabel]; ok { | ||
l.Infof("SystemD image label set to %s", systemdBool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use withfield.
executor/runtime/docker/docker.go
Outdated
return nil | ||
} | ||
|
||
l.Info("SystemD image label not set: not configuring container to run SystemD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. Also, does it make sense to log on the negative?
executor/runtime/docker/docker.go
Outdated
@@ -1699,6 +1728,11 @@ func (r *DockerRuntime) setupPostStartLogDirTiniHandleConnection2(parentCtx cont | |||
return err | |||
} | |||
|
|||
if err := setCgroupOwnership(parentCtx, c, cred); err != nil { | |||
log.Error("Unable to setup container nesting: ", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this error message correct, given we've removed nesting (or at least begun to remove it), shouldn't it be something like "Unable to delegate cgroups?"? also, use withErr
@@ -13,6 +13,9 @@ import ( | |||
"time" | |||
"unsafe" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space.
return nil | ||
} | ||
|
||
cgroupPath := filepath.Join("/proc/", strconv.FormatInt(int64(cred.pid), 10), "cgroup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stronv.itoa?
For all containers:
For systemd labeled containers (those running and image with the
com.netflix.titus.systemd
label set to "true"):/run/lock
as its own tmpfs mountCAP_SYS_ADMIN requirement
Other notes:
the systemd container will fail to come up due to not being able to
create new cgroups.
ships with it is able to start without CAP_SYS_ADMIN