Skip to content

Commit

Permalink
docs: Propose build batching for Bazel.
Browse files Browse the repository at this point in the history
  • Loading branch information
sethnelson-tecton committed May 21, 2024
1 parent f386e6c commit 7701bc4
Showing 1 changed file with 138 additions and 0 deletions.
138 changes: 138 additions & 0 deletions docs-v2/design_proposals/concurrent-bazel-builds.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Concurrent Bazel Builds

* Author(s): Seth Nelson
* Design Shepherd:
* Date: 2024/05/21
* Status: [Reviewed/Cancelled/Under implementation/Complete]

## Background

Bazel supports high levels of concurrency in its action graph model; however, Skaffold is not currently able to
take advantage of that concurrency to build multiple targets because it is modeled around a separate builder
invocation per artifact. This does not play nicely with Bazel's workspace lock model; invoking multiple builds
in parallel as separate processes forces them to serialize. This effectively renders Skaffold's `--build-concurrency`
null and void (see https://github.com/GoogleContainerTools/skaffold/issues/6047), and results in longer overall
build times than if multiple target's action execution could interleave.

This document proposes changes to Skaffold which will enable multiple targets to be build in a single Bazel invocation.


## Design

### Where to break the "one artifact per build" abstraction

As part of this design, I considered three places to introduce the concept of multi-artifact builds:

1. For all builds - in the `builder_mux`, group builds before scheduling.
2. For the local builder pipeline - allow any underlying artifactBuilder to implement a `BuildMultiple` method,
and wrap that in a `BatchingBuilder` which can queue up multiple incoming builds before sending groups into `BuildMultiple`.
3. For Bazel only, where batching happens as an implementation detail with `bazel/build.go`.

Ultimately, this design proposes #3.

* While #1 results (in my opinion) in the cleanest abstraction model (after all, builder_mux is the demux point for
turning a list of artifacts into a set of builds to be scheduled), it was rejected due to its invasiveness and
complexity. Implementing this at the `builder_mux` level results in a
ton of method forking (into multi-artifact variants); it also introduces significant scheduling complexity (to avoid
cycles introduced by batching build targets). This complexity did not feel warranted for a feature that is likely
to only be used by Bazel.
* #2 was rejected because it would incorrectly (continue to) couple the compilation vs load/push phases of Bazel's `artifactBuilder` implementation.
We only want to group up the actual `bazel build` invocation, not the subsequent `docker.Push`/`b.loadImage` which
is also part of the `Build` implementation. Batching at the `Build` implementation level would (a) reduces the speed gains, because we would serialize the loads/pushes,
and (b) adds implementation complexity, because we need to handle BatchingBuilders with different `artifactBuilder`
constructor parameters (like loadImages).


### Design Overview

This design proposes implementing build batching for Bazel builds by altering the Bazel Builder to build multiple
JARs, and then multiplexing contemporaneous builds (e.g. waiting for a period of time for builds targets to
"collect" and then building them all together).

* This changes no existing externally-visible methods/interfaces.
* This does not _require_ config changes, though we could consider (a) initially implementing this functionality hidden behind a configuration flag, for rollout safety, and (b) configuring the batch window.
* This design naturally respects the `--build_concurrency` flag, which limits how many builds will be scheduled contemporaneously.

### Design Elements

#### Bazel Builder can build multiple TARs

This is an in-place change; we replace `buildTar` with `buildTars`, which builds up a list of `buildTargets`.

#### computeBatchKey function

A `computeBatchKey` function which returns the same value if two different builds can be grouped together. Elements of
the Bazel batch key include:
* The sorted list of build args
* The (currently, single) bazel platform flag
* The artifact's workspace

#### batchingBuilder

A BatchingBuilder wraps an underlying multi-artifact build function (the `buildTars` function), and maintains a map of
non-started `MultiArtifactBuild` objects (per batch key).

```go
type batchingBuilder struct {
builder multiJarBuilder
batchSize time.Duration
builds map[string]*MultiArtifactBuild
mu sync.Mutex
}
```

```go
type MultiArtifactBuild struct {
ctx context.Context
artifacts []*latest.Artifact
platforms platform.Matcher
workspace string
mu sync.Mutex
doneCond chan bool
running bool
out io.Writer
digest map[*latest.BazelArtifact]string
err error
}
```

When it is asked to build an artifact, it first checks to see if there is an existing build for the batch key.
* If there is, it simply adds the artifact to the build's `artifacts` list.
* If there is not, it creates a new MultiArtifact build, and kicks off a goroutine that will wait `batchSize` before
invoking its `multiJarBuilder`.

#### How this impacts logging and metrics:

Given our choice in where to break the "one artifact per build" abstraction, this design presents some rough edges
around logging and metrics that are worth acknowledging:
* If any build in a batch fails, they are all reflected as failed to the Events API / console output.
* This is fundamentally unavoidable in any implementation that involves a single bazel invocation building multiple
targets.
* Only one build's `OutputWriter` is actually used (the first one to create the `MultiArtifactBuild` object).
* This
presents some rough edges in my POC that I could use some help ironing out - depending on scheduling order, sometimes
it fails to stream output to the console until the build is complete.

### Open Questions

**\<Should we include either of the configuration knobs listed in the Design Overview>**

Resolution: __Not Yet Resolved__


## Implementation plan

I believe this change is well-contained enough to be implemented in a single PR, please let me know what you think.
You can find a functional proof-of-concept here: https://github.com/GoogleContainerTools/skaffold/compare/main...sethnelson-tecton:skaffold:sethn/parallel-builds

___


## Integration test plan

New test cases (in addition to status quo test cases of unbatched builds):

1. Batched builds that can all batch together (2 artifacts, no dependencies, build-concurrency>=2)
1. Batched builds that should not batch together due to concurrency limits (4 artifacts, build-concurrency>=2)
1. Batched builds that should not batch together due to scheduling (2 artifacts, one depends on the other, build-concurrency>=2)
1. Batched builds with different build args

0 comments on commit 7701bc4

Please sign in to comment.