Background
Comparing bazel-diff against ewhauser/bazel-differ and bazel-contrib/target-determinator — both ship a persistent on-disk cache for hash output, keyed by git revision (plus bazel binary SHA + bazel version + workspace SHA for TD). bazel-diff today re-runs generate-hashes from scratch on every invocation, even when CI is replaying the same commit hour after hour.
What ewhauser/bazel-differ does
internal/cache/disk_cache.go — file path layout <cacheDir>/<sha[0:2]>/<sha>/hashes.json.
cmd/get_targets.go — the wrapper command does git checkout + lookup-or-compute + diff + a downstream query in one go, so cache hits skip the Bazel call entirely.
What target-determinator does
pkg/cache.go (in their tree) — cache entry key includes bazel-binary SHA256, bazel-version string, git tree SHA, target pattern, and the cquery expression. This is the more correct keying — bazel-version changes alone can flip our skylarkEnvironmentHashCode, and a bazel binary swap (e.g. local bazelisk upgrade) silently changes outputs.
Why we'd want it
- CI hosts that retry the same commit (re-runs, rebases that re-push the same tree, matrix flakes) currently re-do the full
bazel query graph walk every time. On a graph with tens of thousands of targets, that's ~minutes per re-run, paid in cloud cost and developer wait time.
- Local dev: a developer iterating on a branch often
git checkouts back to the same base SHA between experiments; caching lets the second generate-hashes against that base be ~instant.
Sketch
- New flag(s) on
generate-hashes (and optionally a wrapper command):
--cacheDir=<path> — opt-in; absent means today's behavior.
- Implicit key fragments to include in the entry:
git rev-parse HEAD, bazel binary SHA256, bazel --version string, the set of --bazelStartupOptions / --bazelCommandOptions / --cqueryCommandOptions / --useCquery / --fineGrainedHashExternalRepos* / --ignoredRuleHashingAttributes (all of which materially affect the output), the workspace path (for cross-workspace safety).
- Cache invalidation is purely "don't trust on key mismatch"; no TTL, no GC (let the user manage
cacheDir size).
- The cache lives alongside the existing hash JSON, not inside it — so consumers that read
hashes.json directly are unaffected.
Open questions
- Should the cache be at the command level (one entry per
generate-hashes invocation) or per-target (allowing partial-incremental hashing)? TD does the former; per-target would be a bigger lift but unlocks proper incremental hashing.
- Atomic writes: standard "write temp + rename" so concurrent CI on the same workspace doesn't tear.
- Cache hit logging: surface to verbose / structured logs so CI can confirm hits.
This is a feature, not a correctness fix — filing as an issue so we can discuss before anyone starts on it.
Background
Comparing bazel-diff against
ewhauser/bazel-differandbazel-contrib/target-determinator— both ship a persistent on-disk cache for hash output, keyed by git revision (plus bazel binary SHA + bazel version + workspace SHA for TD). bazel-diff today re-runsgenerate-hashesfrom scratch on every invocation, even when CI is replaying the same commit hour after hour.What ewhauser/bazel-differ does
internal/cache/disk_cache.go— file path layout<cacheDir>/<sha[0:2]>/<sha>/hashes.json.cmd/get_targets.go— the wrapper command doesgit checkout+ lookup-or-compute +diff+ a downstream query in one go, so cache hits skip the Bazel call entirely.What target-determinator does
pkg/cache.go(in their tree) — cache entry key includes bazel-binary SHA256, bazel-version string, git tree SHA, target pattern, and the cquery expression. This is the more correct keying — bazel-version changes alone can flip ourskylarkEnvironmentHashCode, and a bazel binary swap (e.g. local bazelisk upgrade) silently changes outputs.Why we'd want it
bazel querygraph walk every time. On a graph with tens of thousands of targets, that's ~minutes per re-run, paid in cloud cost and developer wait time.git checkouts back to the same base SHA between experiments; caching lets the secondgenerate-hashesagainst that base be ~instant.Sketch
generate-hashes(and optionally a wrapper command):--cacheDir=<path>— opt-in; absent means today's behavior.git rev-parse HEAD, bazel binary SHA256,bazel --versionstring, the set of--bazelStartupOptions/--bazelCommandOptions/--cqueryCommandOptions/--useCquery/--fineGrainedHashExternalRepos*/--ignoredRuleHashingAttributes(all of which materially affect the output), the workspace path (for cross-workspace safety).cacheDirsize).hashes.jsondirectly are unaffected.Open questions
generate-hashesinvocation) or per-target (allowing partial-incremental hashing)? TD does the former; per-target would be a bigger lift but unlocks proper incremental hashing.This is a feature, not a correctness fix — filing as an issue so we can discuss before anyone starts on it.