New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Layered Docker Images #47411

Merged
merged 5 commits into from Oct 1, 2018

Conversation

Projects
None yet
10 participants
@grahamc
Member

grahamc commented Sep 26, 2018

Create a many-layered Docker Image.

Implements much less than buildImage:

  • Doesn't support specific uids/gids
  • Doesn't support runninng commands after building
  • Doesn't require qemu
  • Doesn't create mutable copies of the files in the path
  • Doesn't support parent images

If you want those feature, I recommend using buildImage.

Notably, it does support:

  • Caching low level, common paths based on a graph traversial
    algorithm, see later in this description.

  • Configurable number of layers. If you're not using AUFS or not
    extending the image, you can specify a larger number of layers at
    build time:

    pkgs.dockerTools.buildLayeredImage {
      name = "hello";
      maxLayers = 128;
      config.Cmd = [ "${pkgs.gitFull}/bin/git" ];
    };
    
  • Parallelized creation of the layers, improving build speed.

  • The contents of the image includes the closure of the configuration,
    so you don't have to specify paths in contents and config.

    With buildImage, paths referred to by the config were not included
    automatically in the image. Thus, if you wanted to call Git, you
    had to specify it twice:

    pkgs.dockerTools.buildImage {
      name = "hello";
      contents = [ pkgs.gitFull ];
      config.Cmd = [ "${pkgs.gitFull}/bin/git" ];
    };
    

    buildLayeredImage on the other hand includes the runtime closure of
    the config when calculating the contents of the image:

    pkgs.dockerTools.buildImage {
      name = "hello";
      config.Cmd = [ "${pkgs.gitFull}/bin/git" ];
    };
    

Minor Problems

  • If any of the store paths change, every layer will be rebuilt in
    the nix-build. However, beacuse the layers are bit-for-bit
    reproducable, when these images are loaded in to Docker they will
    match existing layers and not be imported or uploaded twice.

Common Questions

  • Aren't Docker layers ordered?

    No. People who have used a Dockerfile before assume Docker's
    Layers are inherently ordered. However, this is not true -- Docker
    layers are content-addressable and are not explicitly layered until
    they are composed in to an Image.

  • What happens if I have more than maxLayers of store paths?

    The first (maxLayers-2) most "popular" paths will have their own
    individual layers, then layer #(maxLayers-1) will contain all the
    remaining "unpopular" paths, and finally layer #(maxLayers) will
    contain the Image configuration.

Popularity Contest Algorithm

Using a simple algorithm, convert the references to a path in to a
sorted list of dependent paths based on how often they're referenced
and how deep in the tree they live. Equally-"popular" paths are then
sorted by name.

The existing writeReferencesToFile prints the paths in a simple
ascii-based sorting of the paths.

Sorting the paths by graph improves the chances that the difference
between two builds appear near the end of the list, instead of near
the beginning. This makes a difference for Nix builds which export a
closure for another program to consume, if that program implements its
own level of binary diffing.

For an example, Docker Images. If each store path is a separate layer
then Docker Images can be very efficiently transfered between systems,
and we get very good cache reuse between images built with the same
version of Nixpkgs. However, since Docker only reliably supports a
small number of layers (42) it is important to pick the individual
layers carefully. By storing very popular store paths in the first 40
layers, we improve the chances that the next Docker image will share
many of those layers.*

Given the dependency tree:

A - B - C - D -\
 \   \   \      \
  \   \   \      \
   \   \ - E ---- F
    \- G

Nodes which have multiple references are duplicated:

A - B - C - D - F
 \   \   \
  \   \   \- E - F
   \   \
    \   \- E - F
     \
      \- G

Each leaf node is now replaced by a counter defaulted to 1:

A - B - C - D - (F:1)
 \   \   \
  \   \   \- E - (F:1)
   \   \
    \   \- E - (F:1)
     \
      \- (G:1)

Then each leaf counter is merged with its parent node, replacing the
parent node with a counter of 1, and each existing counter being
incremented by 1. That is to say - D - (F:1) becomes - (D:1, F:2):

A - B - C - (D:1, F:2)
 \   \   \
  \   \   \- (E:1, F:2)
   \   \
    \   \- (E:1, F:2)
     \
      \- (G:1)

Then each leaf counter is merged with its parent node again, merging
any counters, then incrementing each:

A - B - (C:1, D:2, E:2, F:5)
 \   \
  \   \- (E:1, F:2)
   \
    \- (G:1)

And again:

A - (B:1, C:2, D:3, E:4, F:8)
 \
  \- (G:1)

And again:

(A:1, B:2, C:3, D:4, E:5, F:9, G:2)

and then paths have the following "popularity":

A     1
B     2
C     3
D     4
E     5
F     9
G     2

and the popularity contest would result in the paths being printed as:

F
E
D
C
B
G
A
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nox --run "nox-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Fits CONTRIBUTING.md.

This PR was written while working for Target.

cc @adelbertc @shlevy @stew

graham-at-target added some commits Sep 25, 2018

referencesByPopularity: init to sort packages by a cachability heuristic
Using a simple algorithm, convert the references to a path in to a
sorted list of dependent paths based on how often they're referenced
and how deep in the tree they live. Equally-"popular" paths are then
sorted by name.

The existing writeReferencesToFile prints the paths in a simple
ascii-based sorting of the paths.

Sorting the paths by graph improves the chances that the difference
between two builds appear near the end of the list, instead of near
the beginning. This makes a difference for Nix builds which export a
closure for another program to consume, if that program implements its
own level of binary diffing.

For an example, Docker Images. If each store path is a separate layer
then Docker Images can be very efficiently transfered between systems,
and we get very good cache reuse between images built with the same
version of Nixpkgs. However, since Docker only reliably supports a
small number of layers (42) it is important to pick the individual
layers carefully. By storing very popular store paths in the first 40
layers, we improve the chances that the next Docker image will share
many of those layers.*

Given the dependency tree:

    A - B - C - D -\
     \   \   \      \
      \   \   \      \
       \   \ - E ---- F
        \- G

Nodes which have multiple references are duplicated:

    A - B - C - D - F
     \   \   \
      \   \   \- E - F
       \   \
        \   \- E - F
         \
          \- G

Each leaf node is now replaced by a counter defaulted to 1:

    A - B - C - D - (F:1)
     \   \   \
      \   \   \- E - (F:1)
       \   \
        \   \- E - (F:1)
         \
          \- (G:1)

Then each leaf counter is merged with its parent node, replacing the
parent node with a counter of 1, and each existing counter being
incremented by 1. That is to say `- D - (F:1)` becomes `- (D:1, F:2)`:

    A - B - C - (D:1, F:2)
     \   \   \
      \   \   \- (E:1, F:2)
       \   \
        \   \- (E:1, F:2)
         \
          \- (G:1)

Then each leaf counter is merged with its parent node again, merging
any counters, then incrementing each:

    A - B - (C:1, D:2, E:2, F:5)
     \   \
      \   \- (E:1, F:2)
       \
        \- (G:1)

And again:

    A - (B:1, C:2, D:3, E:4, F:8)
     \
      \- (G:1)

And again:

    (A:1, B:2, C:3, D:4, E:5, F:9, G:2)

and then paths have the following "popularity":

    A     1
    B     2
    C     3
    D     4
    E     5
    F     9
    G     2

and the popularity contest would result in the paths being printed as:

    F
    E
    D
    C
    B
    G
    A

* Note: People who have used a Dockerfile before assume Docker's
Layers are inherently ordered. However, this is not true -- Docker
layers are content-addressable and are not explicitly layered until
they are composed in to an Image.
dockerTools.buildLayeredImage: init
Create a many-layered Docker Image.

Implements much less than buildImage:

 - Doesn't support specific uids/gids
 - Doesn't support runninng commands after building
 - Doesn't require qemu
 - Doesn't create mutable copies of the files in the path
 - Doesn't support parent images

If you want those feature, I recommend using buildLayeredImage as an
input to buildImage.

Notably, it does support:

 - Caching low level, common paths based on a graph traversial
   algorithm, see referencesByPopularity in
   0a80233
 - Configurable number of layers. If you're not using AUFS or not
   extending the image, you can specify a larger number of layers at
   build time:

       pkgs.dockerTools.buildLayeredImage {
         name = "hello";
         maxLayers = 128;
         config.Cmd = [ "${pkgs.gitFull}/bin/git" ];
       };

 - Parallelized creation of the layers, improving build speed.
 - The contents of the image includes the closure of the configuration,
   so you don't have to specify paths in contents and config.

   With buildImage, paths referred to by the config were not included
   automatically in the image. Thus, if you wanted to call Git, you
   had to specify it twice:

       pkgs.dockerTools.buildImage {
         name = "hello";
         contents = [ pkgs.gitFull ];
         config.Cmd = [ "${pkgs.gitFull}/bin/git" ];
       };

   buildLayeredImage on the other hand includes the runtime closure of
   the config when calculating the contents of the image:

       pkgs.dockerTools.buildImage {
         name = "hello";
         config.Cmd = [ "${pkgs.gitFull}/bin/git" ];
       };

Minor Problems

 - If any of the store paths change, every layer will be rebuilt in
   the nix-build. However, beacuse the layers are bit-for-bit
   reproducable, when these images are loaded in to Docker they will
   match existing layers and not be imported or uploaded twice.

Common Questions

 - Aren't Docker layers ordered?

   No. People who have used a Dockerfile before assume Docker's
   Layers are inherently ordered. However, this is not true -- Docker
   layers are content-addressable and are not explicitly layered until
   they are composed in to an Image.

 - What happens if I have more than maxLayers of store paths?

   The first (maxLayers-2) most "popular" paths will have their own
   individual layers, then layer #(maxLayers-1) will contain all the
   remaining "unpopular" paths, and finally layer #(maxLayers) will
   contain the Image configuration.

@grahamc grahamc requested a review from nlewo Sep 26, 2018

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc

grahamc Sep 26, 2018

Member

Docs rendered:
image

Member

grahamc commented Sep 26, 2018

Docs rendered:
image

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc
Member

grahamc commented Sep 26, 2018

@srhb

This comment has been minimized.

Show comment
Hide comment
@srhb

srhb Sep 27, 2018

Contributor

I'm not sure I understand exactly how your "automatic closure inclusion" differs from dockerTools.buildImage and the example doesn't really clear it up.

Using buildImage and starting a container from this image outputs "Hello, world!" as expected:

pkgs.dockerTools.buildImage {
  name = "hello";
  config.Cmd = [ "${pkgs.hello}/bin/hello" ];
}

Also adding it to contents just creates the moral "buildEnv" in /

Contributor

srhb commented Sep 27, 2018

I'm not sure I understand exactly how your "automatic closure inclusion" differs from dockerTools.buildImage and the example doesn't really clear it up.

Using buildImage and starting a container from this image outputs "Hello, world!" as expected:

pkgs.dockerTools.buildImage {
  name = "hello";
  config.Cmd = [ "${pkgs.hello}/bin/hello" ];
}

Also adding it to contents just creates the moral "buildEnv" in /

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc

grahamc Sep 27, 2018

Member

Great catch, @srhb -- fixed up my docs. Not sure where I got that idea.

Member

grahamc commented Sep 27, 2018

Great catch, @srhb -- fixed up my docs. Not sure where I got that idea.

@vdemeester

Looks really promising… gotta find some time to try it out though 👼
@grahamc I wonder if at some point we could/should build OCI images (that said… docker cannot import oci image as of today 😓)

@nlewo

This comment has been minimized.

Show comment
Hide comment
@nlewo

nlewo Sep 27, 2018

Member

@grahamc It would be nice to add a basic test (build and run a layered image) in nixos/test/docker-tools.nix. Otherwise, LGTM.

@vdemeester Docker cannot, but Skopeo can convert a OCI image to a Docker image! I already tried a bit to use Buildah to build an image instead of our shell scripts, but still wip...

Member

nlewo commented Sep 27, 2018

@grahamc It would be nice to add a basic test (build and run a layered image) in nixos/test/docker-tools.nix. Otherwise, LGTM.

@vdemeester Docker cannot, but Skopeo can convert a OCI image to a Docker image! I already tried a bit to use Buildah to build an image instead of our shell scripts, but still wip...

@vdemeester

This comment has been minimized.

Show comment
Hide comment
@vdemeester

vdemeester Sep 27, 2018

Contributor

@nlewo there is also umoci to build OCI images. But yeah, forgot about using skopeo in that use 👼. I wanted to investigate using it to simplify the scripts too 😝

Contributor

vdemeester commented Sep 27, 2018

@nlewo there is also umoci to build OCI images. But yeah, forgot about using skopeo in that use 👼. I wanted to investigate using it to simplify the scripts too 😝

@roberth

This comment has been minimized.

Show comment
Hide comment
@roberth

roberth Sep 27, 2018

Contributor

Nice work @grahamc!
I wonder how it performs for various kinds of images, depending on languages used, use of data dependencies, etc. Do you have any info about this?

It seems to me that small packages might be a problem if there are too many, but it may not be a problem in practice. If it is, there may be another heuristic to cluster the small ones in a clever way.
Well, that's just me speculating.

It looks very compelling!

Contributor

roberth commented Sep 27, 2018

Nice work @grahamc!
I wonder how it performs for various kinds of images, depending on languages used, use of data dependencies, etc. Do you have any info about this?

It seems to me that small packages might be a problem if there are too many, but it may not be a problem in practice. If it is, there may be another heuristic to cluster the small ones in a clever way.
Well, that's just me speculating.

It looks very compelling!

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc

grahamc Sep 27, 2018

Member

It would be nice to add a basic test (build and run a layered image) in nixos/test/docker-tools.nix.

Test added!

It seems to me that small packages might be a problem if there are too many, but it may not be a problem in practice. If it is, there may be another heuristic to cluster the small ones in a clever way.

Unfortunately trying to combine any paths makes it much less likely that different images would share anything. Small packages don't seem to be such a problem or cause issue.

I've tested this with images containing python, fortran, java, haskell, php, bash... sometimes all at once :)

Member

grahamc commented Sep 27, 2018

It would be nice to add a basic test (build and run a layered image) in nixos/test/docker-tools.nix.

Test added!

It seems to me that small packages might be a problem if there are too many, but it may not be a problem in practice. If it is, there may be another heuristic to cluster the small ones in a clever way.

Unfortunately trying to combine any paths makes it much less likely that different images would share anything. Small packages don't seem to be such a problem or cause issue.

I've tested this with images containing python, fortran, java, haskell, php, bash... sometimes all at once :)

@adisbladis

This comment has been minimized.

Show comment
Hide comment
@adisbladis

adisbladis Sep 28, 2018

Member

It seems to me that small packages might be a problem if there are too many, but it may not be a problem in practice. If it is, there may be another heuristic to cluster the small ones in a clever way.
Well, that's just me speculating.

I did run into this issue building images based on nodejs with ~800 packages in one image.
I'm thinking that this PR is a good general approach to the problem but we may need to take a different approach in some pathological cases.

Member

adisbladis commented Sep 28, 2018

It seems to me that small packages might be a problem if there are too many, but it may not be a problem in practice. If it is, there may be another heuristic to cluster the small ones in a clever way.
Well, that's just me speculating.

I did run into this issue building images based on nodejs with ~800 packages in one image.
I'm thinking that this PR is a good general approach to the problem but we may need to take a different approach in some pathological cases.

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc

grahamc Sep 28, 2018

Member
Member

grahamc commented Sep 28, 2018

@nlewo

nlewo approved these changes Sep 28, 2018

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc

grahamc Sep 28, 2018

Member

I did run into this issue building images based on nodejs with ~800 packages in one image.

I tried building images with large numbers of dependencies like quassel-webserver and azure-cli. Not 800, but a large number none the less.

I found that base layers were well selected: glibc, nodejs, ncurses, etc. I think if you want greater control on what layers are created, buildImage is for you.

Member

grahamc commented Sep 28, 2018

I did run into this issue building images based on nodejs with ~800 packages in one image.

I tried building images with large numbers of dependencies like quassel-webserver and azure-cli. Not 800, but a large number none the less.

I found that base layers were well selected: glibc, nodejs, ncurses, etc. I think if you want greater control on what layers are created, buildImage is for you.

@nlewo nlewo merged commit 56b4db9 into NixOS:master Oct 1, 2018

8 checks passed

grahamcofborg-eval ^.^!
Details
grahamcofborg-eval-check-meta config.nix: checkMeta = true
Details
grahamcofborg-eval-nixos-manual nix-instantiate ./nixos/release.nix -A manual
Details
grahamcofborg-eval-nixos-options nix-instantiate ./nixos/release.nix -A options
Details
grahamcofborg-eval-nixpkgs-manual nix-instantiate ./pkgs/top-level/release.nix -A manual
Details
grahamcofborg-eval-nixpkgs-tarball nix-instantiate ./pkgs/top-level/release.nix -A tarball
Details
grahamcofborg-eval-nixpkgs-unstable-jobset nix-instantiate ./pkgs/top-level/release.nix -A unstable
Details
grahamcofborg-eval-package-list nix-env -qa --json --file .
Details
@nlewo

This comment has been minimized.

Show comment
Hide comment
@nlewo

nlewo Oct 1, 2018

Member

Thanks!

Member

nlewo commented Oct 1, 2018

Thanks!

@moretea

This comment has been minimized.

Show comment
Hide comment
@moretea

moretea Oct 1, 2018

Contributor

Great! I was not aware that there was a practical limit on the number of layers in an image.

I wrote https://github.com/ContainerSolutions/nixpkgs-overlay/blob/master/docker-tools.nix in last April. It differs by building the layers as separate derivations that can be cached by Nix. That might speed up building images quite a bit actually, by doing the minimal amount of work possible.

It also offers to types of outputs: a directory of image layers, and a final tar file that is accepted by docker load.
My plan is to write a simple script that will upload the layers from this layers directory to a Docker Registry directly soonish.

Contributor

moretea commented Oct 1, 2018

Great! I was not aware that there was a practical limit on the number of layers in an image.

I wrote https://github.com/ContainerSolutions/nixpkgs-overlay/blob/master/docker-tools.nix in last April. It differs by building the layers as separate derivations that can be cached by Nix. That might speed up building images quite a bit actually, by doing the minimal amount of work possible.

It also offers to types of outputs: a directory of image layers, and a final tar file that is accepted by docker load.
My plan is to write a simple script that will upload the layers from this layers directory to a Docker Registry directly soonish.

@graham-at-target

This comment has been minimized.

Show comment
Hide comment
@graham-at-target

graham-at-target Oct 1, 2018

Contributor

Indeed, a hard limit at 125: https://github.com/moby/moby/blob/b3e9f7b13b0f0c414fa6253e1f17a86b2cff68b5/layer/layer_store.go#L23-L26 it looks like your code there uses IFD. is that right?

Contributor

graham-at-target commented Oct 1, 2018

Indeed, a hard limit at 125: https://github.com/moby/moby/blob/b3e9f7b13b0f0c414fa6253e1f17a86b2cff68b5/layer/layer_store.go#L23-L26 it looks like your code there uses IFD. is that right?

@graham-at-target graham-at-target deleted the graham-at-target:multi-layered-images-crafted branch Oct 1, 2018

@moretea

This comment has been minimized.

Show comment
Hide comment
@moretea

moretea Oct 1, 2018

Contributor

Hm. I only tried small programs so far ;) Too bad that docker has this this limitation...

My code also only relies on writeReferencesToFile. See about here.

Contributor

moretea commented Oct 1, 2018

Hm. I only tried small programs so far ;) Too bad that docker has this this limitation...

My code also only relies on writeReferencesToFile. See about here.

@graham-at-target

This comment has been minimized.

Show comment
Hide comment
@graham-at-target

graham-at-target Oct 1, 2018

Contributor

Ah, yeah, unfortunately that does do a build and then imports it, so we can't use it in Nixpkgs.

Contributor

graham-at-target commented Oct 1, 2018

Ah, yeah, unfortunately that does do a build and then imports it, so we can't use it in Nixpkgs.

@Nadrieril

This comment has been minimized.

Show comment
Hide comment
@Nadrieril

Nadrieril Oct 1, 2018

Contributor

@graham-at-target Cool work ! I just read your blog post on this, and I was wondering: would it be possible to merge together some of the very common layers if they are often used together ?
For example, I would guess most images will want bash, so having the first layer be bash+its deps would save a few layers compared to having bash split into glibc/ncurses/readline. This wouldn't be much worse for cache hits since most people who want glibc also want bash. In a big project, the few layers saved that way can be used to cache more dependencies, which might significantly improve build time when e.g. changing only the app code.
Finding an algorithm that computes this seems hard though, in particular since it doesn't seem to be guessable from a single project.
Does that make sense ? I'm guessing that your solution is already a significant improvement, I might be just nitpicking.

Contributor

Nadrieril commented Oct 1, 2018

@graham-at-target Cool work ! I just read your blog post on this, and I was wondering: would it be possible to merge together some of the very common layers if they are often used together ?
For example, I would guess most images will want bash, so having the first layer be bash+its deps would save a few layers compared to having bash split into glibc/ncurses/readline. This wouldn't be much worse for cache hits since most people who want glibc also want bash. In a big project, the few layers saved that way can be used to cache more dependencies, which might significantly improve build time when e.g. changing only the app code.
Finding an algorithm that computes this seems hard though, in particular since it doesn't seem to be guessable from a single project.
Does that make sense ? I'm guessing that your solution is already a significant improvement, I might be just nitpicking.

@adisbladis

This comment has been minimized.

Show comment
Hide comment
@adisbladis

adisbladis Oct 2, 2018

Member

@Nadrieril I had the same thought, that it's probably better to explicitly cut the image into layers at certain known points like nodejs, python3, glibc etc.

It's by no means done since I didn't have much time yet, here is the current progress: https://gist.github.com/adisbladis/777c7d8240be35faa107bc8d6c869a9f.

I plan to add some minor features and port all the python code to nix before I consider it done.

Member

adisbladis commented Oct 2, 2018

@Nadrieril I had the same thought, that it's probably better to explicitly cut the image into layers at certain known points like nodejs, python3, glibc etc.

It's by no means done since I didn't have much time yet, here is the current progress: https://gist.github.com/adisbladis/777c7d8240be35faa107bc8d6c869a9f.

I plan to add some minor features and port all the python code to nix before I consider it done.

@grahamc

This comment has been minimized.

Show comment
Hide comment
@grahamc

grahamc Oct 2, 2018

Member
Member

grahamc commented Oct 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment