Complete build on "wide" public cloud? #30525

lukego · 2017-10-18T00:13:30Z

Is there a well-known way to build large nix expressions (e.g. whole nixpkgs) on "wide" public cloud compute resources? For example, to run each individual build in a separate Google Compute Engine preemptible VM instance, with thousands of VMs in total. Parallelism would be bounded only by the number of preemptible instances that are available on the cloud at the time and the dependencies between nix derivations.

I am curious about whether solution(s) for this already exist and how heavyweight they are (e.g. uses only basic nix commands, or requiring nixops, or requiring Hydra, etc.)

Generally musing that with preemptible VMs now being billed in one-second increments it may be time to consider a 1:1 scheduler from builds to machines both for simplicity (outsource scheduling to cloud provider) and for performance (more elastic capacity for lower build latency.)

edolstra · 2017-10-18T09:48:26Z

Even with 1-second billing, it's still advantageous to reuse instances to prevent having to download the same dependencies over and over again. Hydra doesn't currently schedule builds in a way to minimize downloads, but there will be a lot of overlap between dependencies anyway (e.g. stdenv will be the same for most builds).

hydra.nixos.org uses a script called hydra-provisioner that creates EC2 spot instances depending on the current queue size. hydra-provisioner uses NixOps, but any provisioning method will do; the only requirement is that the machines are reachable via SSH and have Nix installed (doesn't even have to be NixOS). The use of NixOps is not ideal for 1-second billing because the deployment can take a few minutes. It would be better to pre-bake an AMI with the required configuration and to use spot fleet to spin up instances.

edolstra · 2017-10-18T09:53:16Z

If we're really talking about thousands of machines, then Hydra is not really appropriate due to its centralized architecture. This would require a different architecture to be scalable. E.g. each worker node independently fetches work items from a queue, where a work item is something like "build the foo attribute from repo git://bar", and independently uploads the results to S3.

lukego · 2017-10-19T06:34:27Z

Thanks for playing along on the thought experiment.

Yes it seems like a different architecture would be suitable. One machine evaluates a Nix expression to find the set of required derivations, this dependency tree is stored in a common location with the status of each one, and a set of build slaves do "work stealing" to build derivations whose dependencies are done. Right?

lukego · 2018-01-24T12:51:10Z

Here is a second iteration of this idea. This is based on a braindump over at lukego/live#22.

Goal: Build a set of benchmarks in parallel using cloud resources. The benchmarks are plain Nix derivations but they are expected to be numerous (thousands), to be independent (no build dependencies from one benchmark to another), and to have similar dependencies.

Solution:

Build a NixOS image with a nix store containing all dependencies for all benchmarks (union).
Create a Google Pub/Sub message queue with one event per benchmark that needs to run.
Execute builds for the events automatically using a Google Managed Instance Group with a suitable instance type ("Scaling Based on a Queue-Based Workload").

This way the software dependencies are built locally, the GCE instances boot with a pre-populated Nix store, and the Google infrastructure is responsible for starting and stopping VM instances based on e.g. goal of a 2:1 ratio of pending builds to VMs. The results of the benchmark derivations would need to be scavenged back into our own Nix store e.g. via a Pub/Sub stream of results.

This would hopefully execute the full benchmark campaign in O(log n) time and it may be possible to improve this to O(1) depending on the auto-scaling policy. This could be attractive with a basic price of $0.02 per hour for 2-vCPU instances (suitable for single-core benchmarks) and per-second billing increments.

Risks:

Can GCE be depended on to supply ~1000 small VM instances at short notice?

I also considered the new Hetzner Cloud. This is about half the cost of GCE but does not seem suitable due to coarse per-hour billing increments. The whole idea here is to reduce hours/days/weeks of CPU-time down to minutes of wall-clock time and the billing model would need to support this.

Whaddayareckon?

lukego · 2018-01-24T15:42:56Z

Here is a third iteration that's intended to break the dependency on Google Compute Engine:

Evaluate benchmarks.nix to get the set of evaluations to build in parallel.
Build a NixOS image with all prerequisites in the store.
Statically partition (shard) the evaluations into N subsets.
Use NixOps to deploy VMs build1..buildN to each build one of the subsets.
Poll the VMs. If the builds have all finished then recover the results into our nix store. If the VM failed to start - or was killed via preemption - then try to start it again.
Finish when each VM has successfully completed its builds.

That should work on any cloud provider supported by NixOps and it should be O(1) rather than O(log N) because the number of VMs is not scaled down prematurely as the queue shrinks.

stale · 2020-06-05T04:19:38Z

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.

lukego mentioned this issue Jan 24, 2018

Live #22: Braindump on elastic capacity for CI benchmarking lukego/live#22

Open

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete build on "wide" public cloud? #30525

Complete build on "wide" public cloud? #30525

lukego commented Oct 18, 2017

edolstra commented Oct 18, 2017

edolstra commented Oct 18, 2017

lukego commented Oct 19, 2017

lukego commented Jan 24, 2018 •

edited

Loading

lukego commented Jan 24, 2018 •

edited

Loading

stale bot commented Jun 5, 2020

Complete build on "wide" public cloud? #30525

Complete build on "wide" public cloud? #30525

Comments

lukego commented Oct 18, 2017

edolstra commented Oct 18, 2017

edolstra commented Oct 18, 2017

lukego commented Oct 19, 2017

lukego commented Jan 24, 2018 • edited Loading

lukego commented Jan 24, 2018 • edited Loading

stale bot commented Jun 5, 2020

lukego commented Jan 24, 2018 •

edited

Loading

lukego commented Jan 24, 2018 •

edited

Loading