orcadev tool (benchmarking, dev setup, etc.)#176
Merged
Conversation
orcadev is a multi-purpose CLI for working with a running orca dev
cluster. It replaces the older orcaseed seed-only tool: every
orcaseed capability (synthetic-blob generation, single-file upload,
origin listing, bulk delete) is reachable here as a subcommand,
plus a broader debugging surface.
Subcommands
-----------
upload Seed the origin (file or N synthetic blobs). Supports
both awss3 (LocalStack) and azureblob (Azurite / real
Azure) drivers - orcaseed only spoke azureblob.
list Enumerate origin objects.
delete Bulk delete origin objects (interactive by default).
roundtrip Upload data, fetch through orca's edge, and compare a
streaming SHA-256 of the source bytes against a
streaming SHA-256 of the response. Headline correctness
check. Supports --range, --repeat, --cleanup, and
--dump-diff (side-by-side hex dump of the first
differing bytes on mismatch).
cache Inspect or clear the cachestore.
list Enumerate cachestore chunks (raw paths).
inspect Given (bucket, key), compute the canonical chunk
paths via internal/orca/chunk and HEAD each in the
cachestore; print per-chunk presence + size.
clear Bulk delete chunks by prefix or by object.
bench Parallel GET throughput / latency benchmark.
Emits human-friendly text on stdout plus optional
JSON (--output json or --json-out PATH) with a
log-spaced latency histogram (configurable bounds and
bucket count; default 50 buckets across 100us..10s).
JSON schema is versioned (schema_version=1) for
cross-run comparison.
scenario Canned end-to-end scenarios: cold-warm, range-stress,
empty-object, etag-change. Same JSON envelope as bench
so a CI pipeline can chain them.
All subcommands accept --config <orca.yaml> to populate origin and
cachestore coordinates from the same YAML the orca daemon consumes.
Per-flag overrides win over the YAML value.
Dev-harness changes
-------------------
LocalStack now exposes a NodePort (default 30200) mirroring the
existing Azurite NodePort 30100, so the host-side tool can talk to
the cachestore + awss3 origin without a kubectl port-forward. The
kind extraPortMappings are extended accordingly. A new render flag
LocalstackNodePort is passed through hack/orca/Makefile's
render-dev target.
Makefile targets renamed
------------------------
seed-azure -> dev-azure
seed-generate -> data-generate
seed-upload -> data-upload
seed-list -> data-list
seed-delete -> data-delete
Plus six new orcadev-specific targets: roundtrip, cache-list,
cache-inspect, cache-clear, bench, scenario. The SEED_ARGS variable
is renamed ARGS for action-oriented clarity.
The orcaseed source tree under hack/cmd/orcaseed/ is deleted in
this commit; orcaseed's TestParseSize coverage moves to orcadev's
size_test.go.
Verification
------------
go build ./... clean
go test -count=1 ./hack/cmd/orcadev/... clean
golangci-lint run -c .golangci.yaml ./hack/cmd/orcadev/... clean
Smoke-tested against the existing dev cluster: upload + list +
delete cycle against Azurite works end-to-end; bench correctly
reports connection refused when LocalStack is not yet on the new
NodePort (existing dev clusters need a 'make -C hack/orca deploy'
to pick up the NodePort change).
Adds a new Step 8 to hack/orca/quickstart.md walking through the
common orcadev workflows against a running dev cluster:
8a Roundtrip - one-command SHA-256 correctness check,
including a sample --dump-diff mismatch
output for triage.
8b Benchmarks - throughput + latency runs, plus the
jq-based comparison workflow using
--json-out + --label across iterations.
8c Scenarios - one-line invocations of cold-warm,
range-stress, empty-object, etag-change.
8d Cache inspection - cache-list / cache-inspect / cache-clear
for inspect-and-clear-between-experiments
workflows.
The existing 'Tear down' section becomes Step 9. dev-harness.md's
'Exercise the cache' section gets a one-line pointer to the new
step so readers landing there for benchmarking find their way to
the tutorial.
Replaces the one-shot Kubernetes Jobs that previously created the
LocalStack S3 buckets ('orca-cache', 'orca-origin') and the Azurite
container ('orca-test') with mechanisms that re-fire on every
emulator restart.
Failure mode being fixed
------------------------
LocalStack and Azurite both run with ephemeral state (emptyDir +
PERSISTENCE=0 / no persistence mode). When their pods restart (OOM,
eviction, manual delete, kind node restart) state is wiped. The
existing 'orca-buckets-init' and 'orca-azurite-container-init' Jobs
are Kubernetes Jobs - they ran once at first deploy and could not
re-run after emulator restarts. Result: orca pods CrashLoopBackOff
forever with 'NoSuchBucket' on the cachestore versioning probe;
manual recovery was required.
Mechanism
---------
LocalStack: 'localstack-init-buckets' ConfigMap mounted at
/etc/localstack/init/ready.d/init-buckets.sh (defaultMode 0755).
LocalStack 3.x's native init-hooks pattern rescans this directory on
every container start. The script idempotently creates both buckets
and re-checks cachestore-versioning is unset (orca's versioningGate
requirement).
Azurite: 'container-ensurer' sidecar in the same Pod, running a
30-second forever-loop that calls 'az storage container create'
idempotently. Talks to Azurite over loopback (127.0.0.1:10000) so
it doesn't depend on cluster DNS.
Both files lose their separate init-Job templates
(02-init-job.yaml.tmpl, 04-azurite-init.yaml.tmpl); 'make
deploy-localstack' and 'make deploy-azurite' lose the Job
apply + wait steps and gain bucket/container readiness polls so
operators see explicit success when the init mechanisms have run.
After this change, 'make -C hack/orca deploy-localstack' and
'deploy-azurite' are clean idempotent recovery targets: re-running
either against a stale cluster heals the buckets without a full
'orca-down && orca-up'.
Verification
------------
Live-tested against an existing kind cluster:
- Applied the new LocalStack manifest; init-hook ran and created
both buckets within 5s of container Ready.
- Deleted the LocalStack pod; new pod's init-hook recreated both
buckets within 5s of restart.
- Applied the new Azurite manifest with the container-ensurer
sidecar; sidecar created 'orca-test' within 30s.
- Deleted the Azurite pod; new sidecar recreated the container
within 30s.
- Three orca pods that had been CrashLoopBackOff for 9 days with
NoSuchBucket reached 3/3 Ready after a rollout-restart.
Initial sidecar memory limit of 64Mi was too low (Azure CLI is
Python-based and loads ~150MB of modules); bumped to 256Mi limit /
128Mi request, sized to be comfortably above measured RSS.
Templates render and pass kubectl --dry-run validation; no Go code
changed; CI surface unaffected.
…data-random
Replaces the existing 'upload --generate --prefix' flag with '--name'
(same semantics in --file mode; in --generate mode the per-blob
index is appended). Indices are now 1-based, so:
upload --generate --count 1 --name foo -> blob 'foo1'
upload --generate --count 3 --name foo -> blobs 'foo1','foo2','foo3'
upload --generate --count 5 -> blobs 'synth1..synth5'
(default --name is 'synth')
Adds a new make target wrapping the singleton-blob case:
make -C hack/orca data-random NAME=foo SIZE=10MiB
which uploads exactly one random blob literally named 'foo1' of
SIZE bytes. Useful for seeding a key to feed 'bench KEY=foo1' or
'roundtrip --key foo1' without dropping a file on local disk first.
Breaking change vs. the prior flag surface: --prefix is removed,
not deprecated. Acceptable since PR #176 (which introduced --prefix
into orcadev) has not yet merged. The default output blob-name
shape changes from 'synth-0' to 'synth1' (no dash separator;
indices start at 1 rather than 0).
Determinism note: under --seed, blob i's content is now derived
from (seed + i) with i in 1..count, replacing the previous
0..count-1 range. Within the new world the byte-identical-across-
runs guarantee holds.
Drive-by fix: awss3 Put now buffers the body into a bytes.Reader
before calling PutObject. The aws-sdk-go-v2 SigV4 signer requires
a seekable body to pre-compute X-Amz-Content-SHA256; the previous
implementation passed an unseekable io.Reader from newRandomReader
and failed at runtime with 'failed to seek body to start, request
stream is not seekable'. Caught while live-testing the new
data-random target against LocalStack.
Verification
------------
go build / go test / golangci-lint: clean (0 issues)
Live against Azurite NodePort 30100:
make data-random NAME=foo SIZE=1MiB -> foo1 (1.00 MiB)
make data-generate ARGS='--count 3 --name multi --size 256KiB'
-> multi1,multi2,multi3
make data-generate ARGS='--count 2 --size 128KiB' (default name)
-> synth1, synth2
…dtrip/scenario
The orca edge listener (8443) is a ClusterIP-only Service in the
dev harness, so anything orcadev does that hits it (bench,
roundtrip, scenario) requires a kubectl port-forward. Today
operators have to run 'make -C hack/orca port-forward' in another
shell first; forgetting yields 'dial tcp 127.0.0.1:8443: connect:
connection refused' and a confusing user experience.
This change adds an auto-managed port-forward:
1. Before constructing edgeClient, ensureEdgeReachable probes
localhost:8443 with a 500ms TCP dial.
2. If the probe succeeds (operator already has a port-forward,
or anything else is bound) -> no-op, proceed.
3. If the probe fails AND --orca-url is the dev default
(localhost:8443 / 127.0.0.1:8443) AND --auto-port-forward
is true (default) -> spawn 'kubectl --context kind-orca-dev
-n unbounded-kube port-forward svc/orca 8443:8443' as a
subprocess, wait up to 10s for the 'Forwarding from' stdout
sentinel, then return a deferred cleanup that SIGTERMs the
subprocess.
4. If --auto-port-forward=false OR the URL is non-default ->
no-op (operator clearly knows what they're doing); the
original connection error surfaces through the actual
request.
New flags (both persistent, both opt-out flavored):
--auto-port-forward (bool, default true)
--kube-context (string, default 'kind-orca-dev')
A one-line 'auto port-forward: localhost:8443 -> svc/orca:8443'
is printed to stderr when the forward fires so operators aren't
surprised. Subprocess stderr is captured into an in-memory buffer
and surfaced in any error message (handles 'context not found',
'service not found', kubectl-not-on-PATH).
Tests cover the loopback probe (open and closed paths), the URL
host:port parser, and the 'Forwarding from' sentinel detector.
The kubectl subprocess itself is exercised live.
Verified live against the dev cluster:
- Without a port-forward: 'auto port-forward:' fires, orca
reachable, subprocess cleaned up on exit.
- With operator's own 'make -C hack/orca port-forward' running:
probe succeeds; no new forward spawned.
- --auto-port-forward=false: connection refused (today's
behavior preserved).
- --orca-url http://nonexistent:9999: auto-forward skipped;
DNS error pass-through.
Makefile help text and quickstart.md Step 8 prelude updated to
note the auto-managed behaviour; the manual port-forward target
remains available for long-lived sessions.
…mes opt-in
The dev harness was internally inconsistent: .env.example shipped
ORIGIN_DRIVER=awss3 and the Makefile fell back to awss3, but
quickstart.md instructed operators to override to azureblob and
the orcadev tool's tutorial assumed an Azurite origin. Operators
following quickstart-as-written ended up in azureblob mode while
the rest of the harness still defaulted to awss3, producing a
mismatch where orcadev's awss3 defaults seeded the wrong backend
and orca returned 404 on every fetch.
Flips the default everywhere:
hack/orca/.env.example
ORIGIN_DRIVER=awss3 -> azureblob
ORIGIN_ID=awss3-localstack -> azureblob-azurite
Reorder the three modes block: azureblob primary, awss3
secondary, real-Azure tertiary
hack/orca/Makefile
${ORIGIN_DRIVER:-awss3} -> :-azureblob
${ORIGIN_ID:-awss3-localstack} -> :-azureblob-azurite
Remove the deploy-azurite-maybe target entirely; both
LocalStack and Azurite are now always deployed regardless of
ORIGIN_DRIVER. The cost is a few MB of memory per cluster;
the benefit is eliminating the 'switched modes mid-cluster
left Azurite undeployed' bug class and making every mode
switch require only a ConfigMap change.
hack/cmd/orcadev/orcadev/config.go defaultGlobalFlags
originDriver: awss3 -> azureblob
originID: inttest-origin -> azureblob-azurite
originBucket: orca-origin -> orca-test
originEndpoint: localhost:30200 -> localhost:30100/devstoreaccount1/
(Cachestore fields unchanged; awss3 fields kept so the
--origin-driver=awss3 opt-in path still works without
additional credential flags.)
hack/orca/dev-harness.md
Origin modes table reordered; 'What you get' updated to
always-on Azurite; bring-up sequence step 7 updated to
'deploy-azurite' (no longer conditional); 'Switching origins'
inverted (was 'switch FROM awss3 TO azureblob'; now the
reverse); 'Recovery' lists both deploy-localstack and
deploy-azurite; troubleshooting section invokes the right
mode.
hack/orca/quickstart.md
Step 1 simplified: defaults are already azureblob-with-Azurite,
so the only optional edit is LOG_LEVEL=debug. Step 2 pod-list
drops the stale 'orca-buckets-init' and
'orca-azurite-container-init' Job rows (those Jobs were
removed in commit 886607c; the list shows the current
always-on Deployments instead).
hack/cmd/orcadev/orcadev/orcadev_test.go
Default-driver assertion flipped to expect 'azureblob'.
Drive-by: add a !.env.example exception to .gitignore so the
new defaults actually ship. Previously the .env.* rule silently
swallowed .env.example, leaving the file local-only on each
developer's machine and explaining why the .env.example shape has
been drifting from what quickstart.md actually recommends.
Real-Azure mode unchanged. Existing operators with a custom .env
in awss3 mode keep working (their .env overrides the new
default); they will need to pass --origin-driver=awss3 to orcadev
or update their .env to track the new default.
Verified live against the dev cluster:
make -C hack/orca render -> origin driver: azureblob
make -C hack/orca deploy -> Azurite + LocalStack + orca up
make -C hack/orca data-random NAME=defaulttest SIZE=1MiB
-> uploads to Azurite/orca-test
make -C hack/orca bench KEY=defaulttest1 ARGS='--duration 10s'
-> 107 MiB/s, 1078/1078 ok
make -C hack/orca roundtrip FILE=/tmp/r.bin
-> PASS, SHA-256 matches
make -C hack/orca cache-list -> chunks under azureblob-azurite/...
make -C hack/orca cache-inspect BUCKET=orca-test KEY=defaulttest1
-> 1/1 chunks (100%)
awss3 opt-in regression check -> uploads, lists, deletes OK
… of cancelling them
Before this change, '--duration N' used a single context.WithTimeout
for both the worker-loop gate AND every in-flight HTTP request.
When the timer fired, mid-flight requests were ripped out from
under the SDK and counted as 'context_canceled' / 'body_read_error'
in the errors-by-code summary. A typical 16-worker run reported
~16 errors at the tail of every benchmark - benign but indistinct
from a real failure mode.
Splits the loop into three contexts:
gateCtx - controls new-work admission. Expires at --duration.
Workers select on it before picking up another
request; when it fires they stop accepting work.
reqCtx - what HTTP calls actually use. Initially open even
after gateCtx closes; this is what lets in-flight
requests run to natural completion.
drainTimer (context.AfterFunc on gateCtx + time.AfterFunc on
the drain budget) - cancels reqCtx after the
configurable drain timeout if in-flight requests
still haven't finished.
So a typical 30-second bench run now exits at ~30s if everything
is healthy and at most 30s + --drain-timeout if something is
genuinely hung.
New flag:
--drain-timeout 10s how long to wait for in-flight requests
after the gate closes. Default 10s
(matches g.timeout; well above the
observed p99 for any reasonable run).
New JSON fields (schema_version still 1; additive non-breaking):
config.drain_timeout_seconds the configured drain budget
results.gate_seconds wall-clock the gate was open
results.drain_seconds wall-clock spent draining
(drain_seconds == 0 in --requests
mode since there is no gate phase)
elapsed_seconds remains the total (gate + drain), so consumers
keying off it continue to work.
Human output gains two lines mirroring the JSON fields.
In --requests N mode the new fields still behave sensibly: the
last worker through the reqLimit gate calls setGateClosed(),
producing gate_seconds == elapsed_seconds and a tiny
drain_seconds covering any concurrent workers still finishing
their last fetch.
Verified live against the dev cluster:
bench --duration 10s --concurrency 8 --range-size 256KiB --read-pattern random
-> 969 requests, 0 errors, gate 10.001s, drain 59ms.
The 16-error tail storm is gone.
bench --duration 5s --concurrency 8 --drain-timeout 1ms (...)
-> 8 errors (exactly --concurrency), drain budget exhausted.
Proves the drain cap fires when in-flight requests
overrun the budget.
bench --duration= --requests 50 --concurrency 4 (...)
-> 50 requests, 0 errors, gate 551ms, drain 9ms.
--requests mode unaffected.
…pers) - percentile: collapse percentile() to call percentileSorted() after the in-place sort, eliminating a copy of the ceiling-rank math. - upload.go: drop the i := i loop-variable shadow now that the module is on go 1.26 (per-iteration variable since 1.22). - roundtrip.go: replace the bespoke parseInt with strconv.ParseInt and the manual dash-search loop with strings.IndexByte / HasPrefix. - hash.go: add unquoteETag() helper and reuse at the 4 origin sites and 1 edge site that previously inlined strings.Trim(s, "\""). - list.go: route runDelete through the existing confirmPrompt helper. Introduce errConfirmAborted as a sentinel so runDelete preserves its 'aborted.' + exit 0 behaviour while cache.go's callers still surface a non-zero exit on decline. confirmPrompt also no longer prints a leading space when msg is empty.
…cenario env) - io.go: add emitJSONResult[T any] that bench.emitBenchResult and scenario.emitScenarioResult now delegate to. Encoding json into stdout / --json-out path was duplicated verbatim across the two. - cachestore.go: add buildS3Client() with the static credentials / checksum opt-out / endpoint / path-style configuration. Both newCachestoreClient and newAWSS3Origin call it; drops a half-screen of awsconfig.LoadDefaultConfig boilerplate from origin.go. - cachestore.go: add walkS3() pagination helper consumed by both cachestoreClient.List and awss3Origin.List. The visit callback returns false to short-circuit the walk for limit-bounded callers. - cache.go: add forEachChunk() that loops over chunk.Key indices, HEADing each path. clearByObject (cache) and clearScenarioObject (scenario) now drive the same loop. The latter previously had a hardcoded 1024-index cap; it now derives nChunks from the origin HEAD size via the new resolveObjectMetadata helper, with the 1024 fallback only used when size is unknown. - scenario.go: add newScenarioEnv + scenarioCleanup. Each of the four scenarios was opening with the same 8 lines of origin-client / ensure-bucket / edge-client construction and the same 5-line defer-cleanup-if-not-keepData block. Both are now one-liners.
… precedence - bench.go: extract pickBenchRange() from the worker hot loop in runBenchLoop. The original 25-line if/else cascade is now a single call site with three intent-named branches (full / random / sequential). Adds bench_test.go::TestPickBenchRange covering each branch including the wrap-at-boundary case for sequential and a 256-draw in-bounds check for random. - config.go: switch globalFlags.resolve from 'value differs from default => operator set the flag' to cobra Flag.Changed. This is strictly more correct: an operator who passes the literal default (e.g. --origin-bucket=orca-test) now wins over the YAML rather than silently letting the YAML override. resolve now takes a *cobra.Command instead of a context.Context (the ctx parameter was unused). Drops the apologetic 'good-enough' comment. - orcadev.go: pass cmd directly to resolve.
The dev-harness manifests under deploy/orca/dev/ used to include two one-shot Jobs (02-init-job.yaml.tmpl, 04-azurite-init.yaml.tmpl) that created the LocalStack S3 buckets and the Azurite container at first deploy. Commit 886607c ("dev-harness: self-healing LocalStack + Azurite bucket bootstrap") deleted both Jobs and replaced them with PostStart lifecycle hooks / a sidecar driven by an inline ConfigMap so the buckets/container are re-created on every emulator restart. That commit did not update TestDevManifestsRender, which still asserted a Job kind in the rendered output, leaving 'make' broken on this branch and main since the May 21 dev-harness change. The rendered kind set is now ConfigMap + Deployment + Service; update the assertion to match and document why Job is no longer expected. Unrelated drive-by fix included here so the orcadev refactor branch runs 'make' green end-to-end.
Adds an 'orcadev' target that runs 'test' then builds bin/orcadev from ./hack/cmd/orcadev. Recipe shape mirrors the existing forge target; the ORCADEV_BIN / ORCADEV_CMD variables sit alongside the target itself rather than the global variables header so the tool's build config travels with its only consumer. Also adds the tool to the .PHONY list (next to forge) and the manually curated help text under the Build section. Deliberately NOT added to the 'all' target: orcadev is a host-side operator tool used against an already-running orca harness, not a shipped binary, so the default 'make' shouldn't pay its build cost. hack/orca/Makefile keeps its 'go run hack/cmd/orcadev' invocations so the inner dev loop stays free of a separate compile step.
Replace the previous .env + Make-targets + multiple shell scripts flow with one install script (hack/orca/setup-orca.sh) plus thin kind cluster lifecycle helpers (kind-up.sh / kind-down.sh) and one quickstart (hack/orca/README.md). The install script is cluster-agnostic: it works against any kubectl context (kind, AKS, EKS, k3d, ...) with the same handful of flags. Defaults match the previously-supported dev shape: azureblob origin backed by in-cluster Azurite, S3 cachestore backed by in-cluster LocalStack, zero real cloud credentials required. Real-Azure mode is opted into via AZURE_STORAGE_* env vars. orcadev grows a --preset=dev flag (the default) that bundles the well-known dev coordinates. The auto-port-forward machinery now covers svc/orca + svc/azurite + svc/localstack so the same `bin/orcadev <verb>` invocation works on every cluster flavor without depending on kind NodePort hostPort mappings. Kube context default switches from kind-orca-dev to empty (current context). Removed: quickstart.md, dev-harness.md (folded into README.md); .env.example, orcadev-flags.sh (replaced by CLI flags + Go preset); kind-create.sh, kind-load.sh, down.sh, deploy-credentials.sh (folded into setup-orca.sh / kind-up.sh / kind-down.sh); rendered-dev/ directory (setup-orca.sh renders to a tempdir). Root Makefile gains orca-install (current context) and orca-kind-up / orca-kind-down (with orca-up / orca-down kept as back-compat aliases). hack/orca/Makefile shrinks to the three operational verbs (status, logs, port-forward); everything else is done directly with bin/orcadev <verb>.
…ward, anti-affinity) Addresses the seven dev-workflow review findings logged in designs/orca/dev-workflow-remediation-plan.md. 1. Drop kind NodePort: emulator Services switch to ClusterIP unconditionally; remove kind extraPortMappings. orcadev port-forwards everywhere now, so NodePort offered no value and was an AKS / shared-cluster footgun (fixed nodePort 30100/30200 could collide or be policy-blocked). 2. Hardened uninstall: setup-orca.sh --uninstall now deletes only resources matching app.kubernetes.io/name=orca or app.kubernetes.io/part-of=orca-dev labels and leaves the namespace intact. New --delete-namespace flag explicitly opts into removing the namespace (and every unrelated resource in it). orca-credentials Secret now carries the orca name label so the label-selector picks it up. 3. Universal port-forward coverage: every orcadev subcommand that touches origin / cachestore / edge now calls ensurePortForwards at the top of its RunE (upload, list, delete, cache list/inspect/ clear, roundtrip, bench, scenario). Previously only roundtrip/ bench/scenario opened forwards, so bin/orcadev upload --generate failed on AKS once kind NodePorts were dropped. TestEverySubcommandOpensPortForwards is the structural guard. 4. README accuracy: drop multi-object and range-large from the scenario list (not implemented), drop the same from the orcadev package docstring, update cache examples to the default azureblob container orca-test, document --delete-namespace and the existing-cluster anti-affinity behavior. 5. RequireAntiAffinity template knob in deploy/orca/04-deployment: true (kind) renders the strict requiredDuringScheduling block; false (non-kind) renders preferredDuringScheduling so clusters with fewer than 3 schedulable nodes still roll out. setup-orca.sh picks the right default by detecting kind-* contexts. 6. Tempdir trap consolidation in setup-orca.sh: single cleanup_paths stack + one EXIT trap, replacing the previous double-trap that leaked the kind image archive tempdir on --kind-load runs. 7. Safety rails: make orca-install errors when targeted at a non-kind context with the default ghcr.io/azure/orca:dev image (which wouldn't be pullable). setup-orca.sh --build without --kind-load now errors fast instead of building an image nothing uses. Validated on kind end-to-end: fresh cluster + setup-orca.sh + orcadev upload/list/cache/roundtrip/scenario/bench. AKS validation remains a follow-up owner task per the plan.
jveski
reviewed
May 28, 2026
The old roundtrip output buried the source and received SHA-256s
at the right edges of two long, differently-shaped lines, so a
human had to compare 64 hex characters across visual whitespace
to confirm they matched. The new layout puts each sha256 on its
own indented line under a short heading and appends a MATCH or
MISMATCH marker so the at-a-glance verdict is unambiguous:
source: orca-test.bin (5.00 MiB)
sha256: 45a643a91d90c4fb...
iter 0: status=200 bytes=5.00 MiB elapsed=151ms rate=33.08 MiB/s
sha256: 45a643a91d90c4fb... MATCH
PASS sha256=45a643a9... (3 iters)
The MISMATCH branch keeps the existing copy-paste summary block
(MISMATCH on iter N + source + received) so operators still get
full hashes for incident triage, but the per-iter MISMATCH marker
now appears at the same indentation as MATCH so visual scanning
catches it immediately.
Coverage: new roundtrip_output_test.go captures stderr through an
os.Pipe and asserts the heading + indented sha256 + marker + PASS
format end-to-end against a fake origin + httptest edge. Two new
helpers (shortHash, iterLabel) get their own table tests for
singular/plural and short-hash boundary cases.
Also fixes the stale setup-orca.sh "Next steps" block: it
previously printed `bin/orcadev roundtrip --file /tmp/test.bin`
without telling the user where /tmp/test.bin comes from. Now it
prints a `dd` line first and uses /tmp/orca-test.bin to match
the README example, and leads with `scenario cold-warm` which
needs no seed file at all.
The previous errors were terse:
Error: --file and --key are mutually exclusive
Error: one of --file or --key is required
This left the user staring at the help text trying to figure out
which mode they actually wanted. The new messages spell out the
source-of-truth contract for each mode and, in the mutually-exclusive
case, suggest the workaround ("upload it under a different name")
that the user actually wants if they were trying to compare a local
file against an existing origin object.
The two modes are:
--file PATH: source-of-truth is the local file. Upload it, fetch
it back through orca, compare bytes.
--key NAME: source-of-truth is the current origin object. Read
it from origin, fetch the same key through orca,
compare bytes.
Combining them would be ambiguous (which is the source?) so it is
rejected, but the error now says so in words the operator can act
on without re-reading the long help.
Adds TestRunRoundtrip_FlagErrors which locks both messages.
PR #176 review nit: the edge HTTP client used the zero-value http.Client which falls back to http.DefaultTransport. That transport caps MaxIdleConnsPerHost at 2, meaning every concurrent request beyond the first two against the single orca host had to pay a fresh TCP handshake (and TLS, in production deployments). The bench subcommand defaults to 8 workers and the README suggests --concurrency 16; scenarios spin a few dozen parallel range reads. Capping the keep-alive pool at 2 throttled all of these to the stdlib default, silently slowing benchmarks and hiding real performance characteristics. Fix clones http.DefaultTransport (so we inherit Dial / TLS / HTTP-2 defaults), raises MaxIdleConnsPerHost to 256 (more than any realistic dev concurrency), and explicitly pins MaxConnsPerHost=0 (unlimited) so a future reviewer can see we never silently throttle the operator's --concurrency choice. Two new tests lock the transport sizing and the 5x-timeout cap so a future refactor of newEdgeClient cannot quietly reintroduce the regression.
jveski
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this PR exists
Other developers are starting to build components that talk to Orca. They need a simple way to get Orca running in a Kubernetes cluster (kind on their laptop or an existing AKS / EKS / k3d cluster), put some test data into it, and run benchmarks or canned scenarios against it. Before this PR the answer was "read three READMEs, copy an env file, edit it, run several make targets across two directories, and hope the kind NodePorts work." That is not a coherent entrypoint.
What this PR adds
Three things, in roughly increasing order of footprint:
1. A developer tool:
bin/orcadev(inhack/cmd/orcadev/)One command-line tool for everything a developer wants to do against a running Orca:
orcadev uploadorcadev listorcadev deleteorcadev roundtriporcadev cache list / inspect / clearorcadev benchorcadev scenariocold-warm,range-stress,empty-object,etag-change).The tool figures out how to reach the cluster automatically: it opens
kubectl port-forwardconnections to the three services it needs (the Orca edge, Azurite for origin storage, LocalStack for the cache) and tears them down when the command finishes. The same commands work on kind and on any other cluster you can reach withkubectl.2. A single install script:
./hack/orca/setup-orca.shOne script with a small number of flags. It builds the Orca manifests on the fly, applies them to whatever kubectl context you point it at, waits for everything to be ready, and prints the next-step commands. The default install runs:
Two helper scripts handle the kind cluster lifecycle:
kind-up.sh(create a kind cluster suitable for Orca) andkind-down.sh(delete it). The root Makefile exposesmake orca-kind-up,make orca-kind-down,make orca-install, andmake orca-resetas muscle-memory wrappers.3. One quickstart README:
hack/orca/README.mdA single document that walks a developer from "I have nothing" to "I am running benchmarks." It replaces the previous mix of
quickstart.md,dev-harness.md, and a.env.examplefile (all deleted in this PR).What reviewers should look at first
In order of importance:
hack/orca/README.md. This is what new developers will read. If it doesn't make sense to a colleague who has never touched Orca, the PR has failed at its primary goal. Read it top to bottom and try the commands.hack/orca/setup-orca.sh. Especially the uninstall path (label-based delete + the--delete-namespaceopt-in) and the safety rails on--build/--kind-load. This script is the one thing developers will run on their clusters; mistakes here can delete unrelated resources.hack/cmd/orcadev/orcadev/portforward.goandport_forward_coverage_test.go. The port-forward logic is what makesbin/orcadevwork the same on kind and non-kind clusters. The coverage test asserts that every subcommand that talks to a storage service opens the right forwards first; please confirm this contract feels right.deploy/orca/04-deployment.yaml.tmplanddeploy/orca/dev/*.yaml.tmpl. Two template changes worth a careful read: the newRequireAntiAffinityknob in the Orca Deployment (kind keepsrequired, others getpreferred), and the switch from NodePort to ClusterIP for Azurite and LocalStack.Makefilechanges to the##@ Orcablock. New / renamed targets:orca-install,orca-kind-up,orca-kind-down,orca-reset. Old aliasesorca-up/orca-downare kept so existing muscle memory keeps working.designs/orca/dev-workflow-remediation-plan.md. The plan document captured during the second review round; useful context for "why is this structured this way."Files you can probably skim:
*_test.goadditions: structural guards, no production behavior.hack/cmd/orcadev/orcadev/*.gooutsideportforward.go/preset.go: the subcommand bodies are mostly the original orcadev work that was already on this branch before the review.How to try it