Skip to content

orcadev tool (benchmarking, dev setup, etc.)#176

Merged
plombardi89 merged 46 commits into
mainfrom
phlombar/orcadev-tool
May 29, 2026
Merged

orcadev tool (benchmarking, dev setup, etc.)#176
plombardi89 merged 46 commits into
mainfrom
phlombar/orcadev-tool

Conversation

@plombardi89
Copy link
Copy Markdown
Collaborator

@plombardi89 plombardi89 commented May 21, 2026

Why this PR exists

Other developers are starting to build components that talk to Orca. They need a simple way to get Orca running in a Kubernetes cluster (kind on their laptop or an existing AKS / EKS / k3d cluster), put some test data into it, and run benchmarks or canned scenarios against it. Before this PR the answer was "read three READMEs, copy an env file, edit it, run several make targets across two directories, and hope the kind NodePorts work." That is not a coherent entrypoint.

What this PR adds

Three things, in roughly increasing order of footprint:

1. A developer tool: bin/orcadev (in hack/cmd/orcadev/)

One command-line tool for everything a developer wants to do against a running Orca:

Command What it does
orcadev upload Put a file (or N random blobs) into the origin.
orcadev list Show what's in the origin.
orcadev delete Remove things from the origin.
orcadev roundtrip Upload a file, fetch it back through Orca, check the bytes match.
orcadev cache list / inspect / clear See or remove what's in Orca's cache.
orcadev bench Run a parallel-GET benchmark. Outputs human text and JSON for comparing runs.
orcadev scenario Run a canned end-to-end scenario (cold-warm, range-stress, empty-object, etag-change).

The tool figures out how to reach the cluster automatically: it opens kubectl port-forward connections to the three services it needs (the Orca edge, Azurite for origin storage, LocalStack for the cache) and tears them down when the command finishes. The same commands work on kind and on any other cluster you can reach with kubectl.

2. A single install script: ./hack/orca/setup-orca.sh

One script with a small number of flags. It builds the Orca manifests on the fly, applies them to whatever kubectl context you point it at, waits for everything to be ready, and prints the next-step commands. The default install runs:

  • Orca itself (3 replicas).
  • Azurite as the Azure Blob Storage origin.
  • LocalStack as the S3 cache store.
  • Zero real cloud credentials required.

Two helper scripts handle the kind cluster lifecycle: kind-up.sh (create a kind cluster suitable for Orca) and kind-down.sh (delete it). The root Makefile exposes make orca-kind-up, make orca-kind-down, make orca-install, and make orca-reset as muscle-memory wrappers.

3. One quickstart README: hack/orca/README.md

A single document that walks a developer from "I have nothing" to "I am running benchmarks." It replaces the previous mix of quickstart.md, dev-harness.md, and a .env.example file (all deleted in this PR).

What reviewers should look at first

In order of importance:

  1. hack/orca/README.md. This is what new developers will read. If it doesn't make sense to a colleague who has never touched Orca, the PR has failed at its primary goal. Read it top to bottom and try the commands.

  2. hack/orca/setup-orca.sh. Especially the uninstall path (label-based delete + the --delete-namespace opt-in) and the safety rails on --build / --kind-load. This script is the one thing developers will run on their clusters; mistakes here can delete unrelated resources.

  3. hack/cmd/orcadev/orcadev/portforward.go and port_forward_coverage_test.go. The port-forward logic is what makes bin/orcadev work the same on kind and non-kind clusters. The coverage test asserts that every subcommand that talks to a storage service opens the right forwards first; please confirm this contract feels right.

  4. deploy/orca/04-deployment.yaml.tmpl and deploy/orca/dev/*.yaml.tmpl. Two template changes worth a careful read: the new RequireAntiAffinity knob in the Orca Deployment (kind keeps required, others get preferred), and the switch from NodePort to ClusterIP for Azurite and LocalStack.

  5. Makefile changes to the ##@ Orca block. New / renamed targets: orca-install, orca-kind-up, orca-kind-down, orca-reset. Old aliases orca-up / orca-down are kept so existing muscle memory keeps working.

  6. designs/orca/dev-workflow-remediation-plan.md. The plan document captured during the second review round; useful context for "why is this structured this way."

Files you can probably skim:

  • All *_test.go additions: structural guards, no production behavior.
  • hack/cmd/orcadev/orcadev/*.go outside portforward.go / preset.go: the subcommand bodies are mostly the original orcadev work that was already on this branch before the review.

How to try it

# Build the tools.
make orca-build orcadev

# Bring up kind + install Orca + Azurite + LocalStack.
make orca-kind-up

# Verify.
kubectl --context kind-orca-dev -n unbounded-kube get pods

# Seed and exercise.
bin/orcadev upload --generate --count 5 --size 10MiB
bin/orcadev roundtrip --file /tmp/some-file.bin
bin/orcadev scenario cold-warm
bin/orcadev bench --key synth1 --duration 30s

# Tear down.
make orca-kind-down

orcadev is a multi-purpose CLI for working with a running orca dev
cluster. It replaces the older orcaseed seed-only tool: every
orcaseed capability (synthetic-blob generation, single-file upload,
origin listing, bulk delete) is reachable here as a subcommand,
plus a broader debugging surface.

Subcommands
-----------

  upload     Seed the origin (file or N synthetic blobs). Supports
             both awss3 (LocalStack) and azureblob (Azurite / real
             Azure) drivers - orcaseed only spoke azureblob.
  list       Enumerate origin objects.
  delete     Bulk delete origin objects (interactive by default).
  roundtrip  Upload data, fetch through orca's edge, and compare a
             streaming SHA-256 of the source bytes against a
             streaming SHA-256 of the response. Headline correctness
             check. Supports --range, --repeat, --cleanup, and
             --dump-diff (side-by-side hex dump of the first
             differing bytes on mismatch).
  cache      Inspect or clear the cachestore.
    list       Enumerate cachestore chunks (raw paths).
    inspect    Given (bucket, key), compute the canonical chunk
               paths via internal/orca/chunk and HEAD each in the
               cachestore; print per-chunk presence + size.
    clear      Bulk delete chunks by prefix or by object.
  bench      Parallel GET throughput / latency benchmark.
             Emits human-friendly text on stdout plus optional
             JSON (--output json or --json-out PATH) with a
             log-spaced latency histogram (configurable bounds and
             bucket count; default 50 buckets across 100us..10s).
             JSON schema is versioned (schema_version=1) for
             cross-run comparison.
  scenario   Canned end-to-end scenarios: cold-warm, range-stress,
             empty-object, etag-change. Same JSON envelope as bench
             so a CI pipeline can chain them.

All subcommands accept --config <orca.yaml> to populate origin and
cachestore coordinates from the same YAML the orca daemon consumes.
Per-flag overrides win over the YAML value.

Dev-harness changes
-------------------

LocalStack now exposes a NodePort (default 30200) mirroring the
existing Azurite NodePort 30100, so the host-side tool can talk to
the cachestore + awss3 origin without a kubectl port-forward. The
kind extraPortMappings are extended accordingly. A new render flag
LocalstackNodePort is passed through hack/orca/Makefile's
render-dev target.

Makefile targets renamed
------------------------

  seed-azure          -> dev-azure
  seed-generate       -> data-generate
  seed-upload         -> data-upload
  seed-list           -> data-list
  seed-delete         -> data-delete

Plus six new orcadev-specific targets: roundtrip, cache-list,
cache-inspect, cache-clear, bench, scenario. The SEED_ARGS variable
is renamed ARGS for action-oriented clarity.

The orcaseed source tree under hack/cmd/orcaseed/ is deleted in
this commit; orcaseed's TestParseSize coverage moves to orcadev's
size_test.go.

Verification
------------

  go build ./...                          clean
  go test -count=1 ./hack/cmd/orcadev/... clean
  golangci-lint run -c .golangci.yaml ./hack/cmd/orcadev/...  clean

Smoke-tested against the existing dev cluster: upload + list +
delete cycle against Azurite works end-to-end; bench correctly
reports connection refused when LocalStack is not yet on the new
NodePort (existing dev clusters need a 'make -C hack/orca deploy'
to pick up the NodePort change).
@plombardi89 plombardi89 requested a review from a team May 21, 2026 16:45
Adds a new Step 8 to hack/orca/quickstart.md walking through the
common orcadev workflows against a running dev cluster:

  8a Roundtrip          - one-command SHA-256 correctness check,
                          including a sample --dump-diff mismatch
                          output for triage.
  8b Benchmarks         - throughput + latency runs, plus the
                          jq-based comparison workflow using
                          --json-out + --label across iterations.
  8c Scenarios          - one-line invocations of cold-warm,
                          range-stress, empty-object, etag-change.
  8d Cache inspection   - cache-list / cache-inspect / cache-clear
                          for inspect-and-clear-between-experiments
                          workflows.

The existing 'Tear down' section becomes Step 9. dev-harness.md's
'Exercise the cache' section gets a one-line pointer to the new
step so readers landing there for benchmarking find their way to
the tutorial.
Replaces the one-shot Kubernetes Jobs that previously created the
LocalStack S3 buckets ('orca-cache', 'orca-origin') and the Azurite
container ('orca-test') with mechanisms that re-fire on every
emulator restart.

Failure mode being fixed
------------------------

LocalStack and Azurite both run with ephemeral state (emptyDir +
PERSISTENCE=0 / no persistence mode). When their pods restart (OOM,
eviction, manual delete, kind node restart) state is wiped. The
existing 'orca-buckets-init' and 'orca-azurite-container-init' Jobs
are Kubernetes Jobs - they ran once at first deploy and could not
re-run after emulator restarts. Result: orca pods CrashLoopBackOff
forever with 'NoSuchBucket' on the cachestore versioning probe;
manual recovery was required.

Mechanism
---------

LocalStack: 'localstack-init-buckets' ConfigMap mounted at
/etc/localstack/init/ready.d/init-buckets.sh (defaultMode 0755).
LocalStack 3.x's native init-hooks pattern rescans this directory on
every container start. The script idempotently creates both buckets
and re-checks cachestore-versioning is unset (orca's versioningGate
requirement).

Azurite: 'container-ensurer' sidecar in the same Pod, running a
30-second forever-loop that calls 'az storage container create'
idempotently. Talks to Azurite over loopback (127.0.0.1:10000) so
it doesn't depend on cluster DNS.

Both files lose their separate init-Job templates
(02-init-job.yaml.tmpl, 04-azurite-init.yaml.tmpl); 'make
deploy-localstack' and 'make deploy-azurite' lose the Job
apply + wait steps and gain bucket/container readiness polls so
operators see explicit success when the init mechanisms have run.

After this change, 'make -C hack/orca deploy-localstack' and
'deploy-azurite' are clean idempotent recovery targets: re-running
either against a stale cluster heals the buckets without a full
'orca-down && orca-up'.

Verification
------------

Live-tested against an existing kind cluster:

  - Applied the new LocalStack manifest; init-hook ran and created
    both buckets within 5s of container Ready.
  - Deleted the LocalStack pod; new pod's init-hook recreated both
    buckets within 5s of restart.
  - Applied the new Azurite manifest with the container-ensurer
    sidecar; sidecar created 'orca-test' within 30s.
  - Deleted the Azurite pod; new sidecar recreated the container
    within 30s.
  - Three orca pods that had been CrashLoopBackOff for 9 days with
    NoSuchBucket reached 3/3 Ready after a rollout-restart.

Initial sidecar memory limit of 64Mi was too low (Azure CLI is
Python-based and loads ~150MB of modules); bumped to 256Mi limit /
128Mi request, sized to be comfortably above measured RSS.

Templates render and pass kubectl --dry-run validation; no Go code
changed; CI surface unaffected.
…data-random

Replaces the existing 'upload --generate --prefix' flag with '--name'
(same semantics in --file mode; in --generate mode the per-blob
index is appended). Indices are now 1-based, so:

  upload --generate --count 1 --name foo  -> blob 'foo1'
  upload --generate --count 3 --name foo  -> blobs 'foo1','foo2','foo3'
  upload --generate --count 5             -> blobs 'synth1..synth5'
                                             (default --name is 'synth')

Adds a new make target wrapping the singleton-blob case:

  make -C hack/orca data-random NAME=foo SIZE=10MiB

which uploads exactly one random blob literally named 'foo1' of
SIZE bytes. Useful for seeding a key to feed 'bench KEY=foo1' or
'roundtrip --key foo1' without dropping a file on local disk first.

Breaking change vs. the prior flag surface: --prefix is removed,
not deprecated. Acceptable since PR #176 (which introduced --prefix
into orcadev) has not yet merged. The default output blob-name
shape changes from 'synth-0' to 'synth1' (no dash separator;
indices start at 1 rather than 0).

Determinism note: under --seed, blob i's content is now derived
from (seed + i) with i in 1..count, replacing the previous
0..count-1 range. Within the new world the byte-identical-across-
runs guarantee holds.

Drive-by fix: awss3 Put now buffers the body into a bytes.Reader
before calling PutObject. The aws-sdk-go-v2 SigV4 signer requires
a seekable body to pre-compute X-Amz-Content-SHA256; the previous
implementation passed an unseekable io.Reader from newRandomReader
and failed at runtime with 'failed to seek body to start, request
stream is not seekable'. Caught while live-testing the new
data-random target against LocalStack.

Verification
------------

  go build / go test / golangci-lint:  clean (0 issues)
  Live against Azurite NodePort 30100:
    make data-random NAME=foo SIZE=1MiB         -> foo1 (1.00 MiB)
    make data-generate ARGS='--count 3 --name multi --size 256KiB'
                                                -> multi1,multi2,multi3
    make data-generate ARGS='--count 2 --size 128KiB'  (default name)
                                                -> synth1, synth2
…dtrip/scenario

The orca edge listener (8443) is a ClusterIP-only Service in the
dev harness, so anything orcadev does that hits it (bench,
roundtrip, scenario) requires a kubectl port-forward. Today
operators have to run 'make -C hack/orca port-forward' in another
shell first; forgetting yields 'dial tcp 127.0.0.1:8443: connect:
connection refused' and a confusing user experience.

This change adds an auto-managed port-forward:

  1. Before constructing edgeClient, ensureEdgeReachable probes
     localhost:8443 with a 500ms TCP dial.
  2. If the probe succeeds (operator already has a port-forward,
     or anything else is bound) -> no-op, proceed.
  3. If the probe fails AND --orca-url is the dev default
     (localhost:8443 / 127.0.0.1:8443) AND --auto-port-forward
     is true (default) -> spawn 'kubectl --context kind-orca-dev
     -n unbounded-kube port-forward svc/orca 8443:8443' as a
     subprocess, wait up to 10s for the 'Forwarding from' stdout
     sentinel, then return a deferred cleanup that SIGTERMs the
     subprocess.
  4. If --auto-port-forward=false OR the URL is non-default ->
     no-op (operator clearly knows what they're doing); the
     original connection error surfaces through the actual
     request.

New flags (both persistent, both opt-out flavored):

  --auto-port-forward (bool, default true)
  --kube-context      (string, default 'kind-orca-dev')

A one-line 'auto port-forward: localhost:8443 -> svc/orca:8443'
is printed to stderr when the forward fires so operators aren't
surprised. Subprocess stderr is captured into an in-memory buffer
and surfaced in any error message (handles 'context not found',
'service not found', kubectl-not-on-PATH).

Tests cover the loopback probe (open and closed paths), the URL
host:port parser, and the 'Forwarding from' sentinel detector.
The kubectl subprocess itself is exercised live.

Verified live against the dev cluster:

  - Without a port-forward: 'auto port-forward:' fires, orca
    reachable, subprocess cleaned up on exit.
  - With operator's own 'make -C hack/orca port-forward' running:
    probe succeeds; no new forward spawned.
  - --auto-port-forward=false: connection refused (today's
    behavior preserved).
  - --orca-url http://nonexistent:9999: auto-forward skipped;
    DNS error pass-through.

Makefile help text and quickstart.md Step 8 prelude updated to
note the auto-managed behaviour; the manual port-forward target
remains available for long-lived sessions.
…mes opt-in

The dev harness was internally inconsistent: .env.example shipped
ORIGIN_DRIVER=awss3 and the Makefile fell back to awss3, but
quickstart.md instructed operators to override to azureblob and
the orcadev tool's tutorial assumed an Azurite origin. Operators
following quickstart-as-written ended up in azureblob mode while
the rest of the harness still defaulted to awss3, producing a
mismatch where orcadev's awss3 defaults seeded the wrong backend
and orca returned 404 on every fetch.

Flips the default everywhere:

  hack/orca/.env.example
    ORIGIN_DRIVER=awss3                 -> azureblob
    ORIGIN_ID=awss3-localstack          -> azureblob-azurite
    Reorder the three modes block: azureblob primary, awss3
    secondary, real-Azure tertiary

  hack/orca/Makefile
    ${ORIGIN_DRIVER:-awss3}              -> :-azureblob
    ${ORIGIN_ID:-awss3-localstack}       -> :-azureblob-azurite
    Remove the deploy-azurite-maybe target entirely; both
    LocalStack and Azurite are now always deployed regardless of
    ORIGIN_DRIVER. The cost is a few MB of memory per cluster;
    the benefit is eliminating the 'switched modes mid-cluster
    left Azurite undeployed' bug class and making every mode
    switch require only a ConfigMap change.

  hack/cmd/orcadev/orcadev/config.go defaultGlobalFlags
    originDriver:   awss3                  -> azureblob
    originID:       inttest-origin         -> azureblob-azurite
    originBucket:   orca-origin            -> orca-test
    originEndpoint: localhost:30200        -> localhost:30100/devstoreaccount1/
    (Cachestore fields unchanged; awss3 fields kept so the
    --origin-driver=awss3 opt-in path still works without
    additional credential flags.)

  hack/orca/dev-harness.md
    Origin modes table reordered; 'What you get' updated to
    always-on Azurite; bring-up sequence step 7 updated to
    'deploy-azurite' (no longer conditional); 'Switching origins'
    inverted (was 'switch FROM awss3 TO azureblob'; now the
    reverse); 'Recovery' lists both deploy-localstack and
    deploy-azurite; troubleshooting section invokes the right
    mode.

  hack/orca/quickstart.md
    Step 1 simplified: defaults are already azureblob-with-Azurite,
    so the only optional edit is LOG_LEVEL=debug. Step 2 pod-list
    drops the stale 'orca-buckets-init' and
    'orca-azurite-container-init' Job rows (those Jobs were
    removed in commit 886607c; the list shows the current
    always-on Deployments instead).

  hack/cmd/orcadev/orcadev/orcadev_test.go
    Default-driver assertion flipped to expect 'azureblob'.

Drive-by: add a !.env.example exception to .gitignore so the
new defaults actually ship. Previously the .env.* rule silently
swallowed .env.example, leaving the file local-only on each
developer's machine and explaining why the .env.example shape has
been drifting from what quickstart.md actually recommends.

Real-Azure mode unchanged. Existing operators with a custom .env
in awss3 mode keep working (their .env overrides the new
default); they will need to pass --origin-driver=awss3 to orcadev
or update their .env to track the new default.

Verified live against the dev cluster:

  make -C hack/orca render            -> origin driver: azureblob
  make -C hack/orca deploy            -> Azurite + LocalStack + orca up
  make -C hack/orca data-random NAME=defaulttest SIZE=1MiB
                                       -> uploads to Azurite/orca-test
  make -C hack/orca bench KEY=defaulttest1 ARGS='--duration 10s'
                                       -> 107 MiB/s, 1078/1078 ok
  make -C hack/orca roundtrip FILE=/tmp/r.bin
                                       -> PASS, SHA-256 matches
  make -C hack/orca cache-list        -> chunks under azureblob-azurite/...
  make -C hack/orca cache-inspect BUCKET=orca-test KEY=defaulttest1
                                       -> 1/1 chunks (100%)
  awss3 opt-in regression check       -> uploads, lists, deletes OK
… of cancelling them

Before this change, '--duration N' used a single context.WithTimeout
for both the worker-loop gate AND every in-flight HTTP request.
When the timer fired, mid-flight requests were ripped out from
under the SDK and counted as 'context_canceled' / 'body_read_error'
in the errors-by-code summary. A typical 16-worker run reported
~16 errors at the tail of every benchmark - benign but indistinct
from a real failure mode.

Splits the loop into three contexts:

  gateCtx   - controls new-work admission. Expires at --duration.
              Workers select on it before picking up another
              request; when it fires they stop accepting work.

  reqCtx    - what HTTP calls actually use. Initially open even
              after gateCtx closes; this is what lets in-flight
              requests run to natural completion.

  drainTimer (context.AfterFunc on gateCtx + time.AfterFunc on
              the drain budget) - cancels reqCtx after the
              configurable drain timeout if in-flight requests
              still haven't finished.

So a typical 30-second bench run now exits at ~30s if everything
is healthy and at most 30s + --drain-timeout if something is
genuinely hung.

New flag:

  --drain-timeout 10s   how long to wait for in-flight requests
                        after the gate closes. Default 10s
                        (matches g.timeout; well above the
                        observed p99 for any reasonable run).

New JSON fields (schema_version still 1; additive non-breaking):

  config.drain_timeout_seconds   the configured drain budget
  results.gate_seconds           wall-clock the gate was open
  results.drain_seconds          wall-clock spent draining
                                 (drain_seconds == 0 in --requests
                                 mode since there is no gate phase)

  elapsed_seconds remains the total (gate + drain), so consumers
  keying off it continue to work.

Human output gains two lines mirroring the JSON fields.

In --requests N mode the new fields still behave sensibly: the
last worker through the reqLimit gate calls setGateClosed(),
producing gate_seconds == elapsed_seconds and a tiny
drain_seconds covering any concurrent workers still finishing
their last fetch.

Verified live against the dev cluster:

  bench --duration 10s --concurrency 8 --range-size 256KiB --read-pattern random
    -> 969 requests, 0 errors, gate 10.001s, drain 59ms.
       The 16-error tail storm is gone.

  bench --duration 5s --concurrency 8 --drain-timeout 1ms (...)
    -> 8 errors (exactly --concurrency), drain budget exhausted.
       Proves the drain cap fires when in-flight requests
       overrun the budget.

  bench --duration= --requests 50 --concurrency 4 (...)
    -> 50 requests, 0 errors, gate 551ms, drain 9ms.
       --requests mode unaffected.
…pers)

- percentile: collapse percentile() to call percentileSorted() after
  the in-place sort, eliminating a copy of the ceiling-rank math.
- upload.go: drop the i := i loop-variable shadow now that the
  module is on go 1.26 (per-iteration variable since 1.22).
- roundtrip.go: replace the bespoke parseInt with strconv.ParseInt
  and the manual dash-search loop with strings.IndexByte / HasPrefix.
- hash.go: add unquoteETag() helper and reuse at the 4 origin sites
  and 1 edge site that previously inlined strings.Trim(s, "\"").
- list.go: route runDelete through the existing confirmPrompt
  helper. Introduce errConfirmAborted as a sentinel so runDelete
  preserves its 'aborted.' + exit 0 behaviour while cache.go's
  callers still surface a non-zero exit on decline. confirmPrompt
  also no longer prints a leading space when msg is empty.
…cenario env)

- io.go: add emitJSONResult[T any] that bench.emitBenchResult and
  scenario.emitScenarioResult now delegate to. Encoding json into
  stdout / --json-out path was duplicated verbatim across the two.
- cachestore.go: add buildS3Client() with the static credentials /
  checksum opt-out / endpoint / path-style configuration. Both
  newCachestoreClient and newAWSS3Origin call it; drops a half-screen
  of awsconfig.LoadDefaultConfig boilerplate from origin.go.
- cachestore.go: add walkS3() pagination helper consumed by both
  cachestoreClient.List and awss3Origin.List. The visit callback
  returns false to short-circuit the walk for limit-bounded callers.
- cache.go: add forEachChunk() that loops over chunk.Key indices,
  HEADing each path. clearByObject (cache) and clearScenarioObject
  (scenario) now drive the same loop. The latter previously had a
  hardcoded 1024-index cap; it now derives nChunks from the origin
  HEAD size via the new resolveObjectMetadata helper, with the 1024
  fallback only used when size is unknown.
- scenario.go: add newScenarioEnv + scenarioCleanup. Each of the
  four scenarios was opening with the same 8 lines of origin-client
  / ensure-bucket / edge-client construction and the same 5-line
  defer-cleanup-if-not-keepData block. Both are now one-liners.
… precedence

- bench.go: extract pickBenchRange() from the worker hot loop in
  runBenchLoop. The original 25-line if/else cascade is now a single
  call site with three intent-named branches (full / random /
  sequential). Adds bench_test.go::TestPickBenchRange covering each
  branch including the wrap-at-boundary case for sequential and a
  256-draw in-bounds check for random.
- config.go: switch globalFlags.resolve from 'value differs from
  default => operator set the flag' to cobra Flag.Changed. This is
  strictly more correct: an operator who passes the literal default
  (e.g. --origin-bucket=orca-test) now wins over the YAML rather
  than silently letting the YAML override. resolve now takes a
  *cobra.Command instead of a context.Context (the ctx parameter
  was unused). Drops the apologetic 'good-enough' comment.
- orcadev.go: pass cmd directly to resolve.
The dev-harness manifests under deploy/orca/dev/ used to include two
one-shot Jobs (02-init-job.yaml.tmpl, 04-azurite-init.yaml.tmpl) that
created the LocalStack S3 buckets and the Azurite container at first
deploy. Commit 886607c ("dev-harness: self-healing LocalStack +
Azurite bucket bootstrap") deleted both Jobs and replaced them with
PostStart lifecycle hooks / a sidecar driven by an inline ConfigMap
so the buckets/container are re-created on every emulator restart.

That commit did not update TestDevManifestsRender, which still
asserted a Job kind in the rendered output, leaving 'make' broken on
this branch and main since the May 21 dev-harness change. The
rendered kind set is now ConfigMap + Deployment + Service; update
the assertion to match and document why Job is no longer expected.

Unrelated drive-by fix included here so the orcadev refactor branch
runs 'make' green end-to-end.
Adds an 'orcadev' target that runs 'test' then builds bin/orcadev
from ./hack/cmd/orcadev. Recipe shape mirrors the existing forge
target; the ORCADEV_BIN / ORCADEV_CMD variables sit alongside the
target itself rather than the global variables header so the tool's
build config travels with its only consumer. Also adds the tool to
the .PHONY list (next to forge) and the manually curated help text
under the Build section.

Deliberately NOT added to the 'all' target: orcadev is a host-side
operator tool used against an already-running orca harness, not a
shipped binary, so the default 'make' shouldn't pay its build cost.
hack/orca/Makefile keeps its 'go run hack/cmd/orcadev' invocations
so the inner dev loop stays free of a separate compile step.
Replace the previous .env + Make-targets + multiple shell scripts
flow with one install script (hack/orca/setup-orca.sh) plus thin
kind cluster lifecycle helpers (kind-up.sh / kind-down.sh) and one
quickstart (hack/orca/README.md). The install script is
cluster-agnostic: it works against any kubectl context (kind, AKS,
EKS, k3d, ...) with the same handful of flags.

Defaults match the previously-supported dev shape: azureblob origin
backed by in-cluster Azurite, S3 cachestore backed by in-cluster
LocalStack, zero real cloud credentials required. Real-Azure mode
is opted into via AZURE_STORAGE_* env vars.

orcadev grows a --preset=dev flag (the default) that bundles the
well-known dev coordinates. The auto-port-forward machinery now
covers svc/orca + svc/azurite + svc/localstack so the same
`bin/orcadev <verb>` invocation works on every cluster flavor
without depending on kind NodePort hostPort mappings. Kube context
default switches from kind-orca-dev to empty (current context).

Removed: quickstart.md, dev-harness.md (folded into README.md);
.env.example, orcadev-flags.sh (replaced by CLI flags + Go preset);
kind-create.sh, kind-load.sh, down.sh, deploy-credentials.sh
(folded into setup-orca.sh / kind-up.sh / kind-down.sh);
rendered-dev/ directory (setup-orca.sh renders to a tempdir).

Root Makefile gains orca-install (current context) and
orca-kind-up / orca-kind-down (with orca-up / orca-down kept as
back-compat aliases). hack/orca/Makefile shrinks to the three
operational verbs (status, logs, port-forward); everything else is
done directly with bin/orcadev <verb>.
…ward, anti-affinity)

Addresses the seven dev-workflow review findings logged in
designs/orca/dev-workflow-remediation-plan.md.

1. Drop kind NodePort: emulator Services switch to ClusterIP
   unconditionally; remove kind extraPortMappings. orcadev
   port-forwards everywhere now, so NodePort offered no value and
   was an AKS / shared-cluster footgun (fixed nodePort 30100/30200
   could collide or be policy-blocked).

2. Hardened uninstall: setup-orca.sh --uninstall now deletes only
   resources matching app.kubernetes.io/name=orca or
   app.kubernetes.io/part-of=orca-dev labels and leaves the
   namespace intact. New --delete-namespace flag explicitly opts
   into removing the namespace (and every unrelated resource in
   it). orca-credentials Secret now carries the orca name label so
   the label-selector picks it up.

3. Universal port-forward coverage: every orcadev subcommand that
   touches origin / cachestore / edge now calls ensurePortForwards
   at the top of its RunE (upload, list, delete, cache list/inspect/
   clear, roundtrip, bench, scenario). Previously only roundtrip/
   bench/scenario opened forwards, so bin/orcadev upload --generate
   failed on AKS once kind NodePorts were dropped.
   TestEverySubcommandOpensPortForwards is the structural guard.

4. README accuracy: drop multi-object and range-large from the
   scenario list (not implemented), drop the same from the orcadev
   package docstring, update cache examples to the default azureblob
   container orca-test, document --delete-namespace and the
   existing-cluster anti-affinity behavior.

5. RequireAntiAffinity template knob in deploy/orca/04-deployment:
   true (kind) renders the strict requiredDuringScheduling block;
   false (non-kind) renders preferredDuringScheduling so clusters
   with fewer than 3 schedulable nodes still roll out. setup-orca.sh
   picks the right default by detecting kind-* contexts.

6. Tempdir trap consolidation in setup-orca.sh: single cleanup_paths
   stack + one EXIT trap, replacing the previous double-trap that
   leaked the kind image archive tempdir on --kind-load runs.

7. Safety rails: make orca-install errors when targeted at a non-kind
   context with the default ghcr.io/azure/orca:dev image (which
   wouldn't be pullable). setup-orca.sh --build without --kind-load
   now errors fast instead of building an image nothing uses.

Validated on kind end-to-end: fresh cluster + setup-orca.sh +
orcadev upload/list/cache/roundtrip/scenario/bench. AKS validation
remains a follow-up owner task per the plan.
@plombardi89 plombardi89 changed the title hack: introduce orcadev, dev/debug tool (subsumes orcaseed) orcadev tool (benchmarking, dev setup, etc.) May 28, 2026
Comment thread hack/cmd/orcadev/orcadev/edge.go
jveski
jveski previously approved these changes May 28, 2026
Copy link
Copy Markdown
Contributor

@jveski jveski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit

The old roundtrip output buried the source and received SHA-256s
at the right edges of two long, differently-shaped lines, so a
human had to compare 64 hex characters across visual whitespace
to confirm they matched. The new layout puts each sha256 on its
own indented line under a short heading and appends a MATCH or
MISMATCH marker so the at-a-glance verdict is unambiguous:

  source: orca-test.bin (5.00 MiB)
    sha256: 45a643a91d90c4fb...

  iter 0: status=200 bytes=5.00 MiB elapsed=151ms rate=33.08 MiB/s
    sha256: 45a643a91d90c4fb...  MATCH

  PASS sha256=45a643a9... (3 iters)

The MISMATCH branch keeps the existing copy-paste summary block
(MISMATCH on iter N + source + received) so operators still get
full hashes for incident triage, but the per-iter MISMATCH marker
now appears at the same indentation as MATCH so visual scanning
catches it immediately.

Coverage: new roundtrip_output_test.go captures stderr through an
os.Pipe and asserts the heading + indented sha256 + marker + PASS
format end-to-end against a fake origin + httptest edge. Two new
helpers (shortHash, iterLabel) get their own table tests for
singular/plural and short-hash boundary cases.

Also fixes the stale setup-orca.sh "Next steps" block: it
previously printed `bin/orcadev roundtrip --file /tmp/test.bin`
without telling the user where /tmp/test.bin comes from. Now it
prints a `dd` line first and uses /tmp/orca-test.bin to match
the README example, and leads with `scenario cold-warm` which
needs no seed file at all.
The previous errors were terse:

  Error: --file and --key are mutually exclusive
  Error: one of --file or --key is required

This left the user staring at the help text trying to figure out
which mode they actually wanted. The new messages spell out the
source-of-truth contract for each mode and, in the mutually-exclusive
case, suggest the workaround ("upload it under a different name")
that the user actually wants if they were trying to compare a local
file against an existing origin object.

The two modes are:

  --file PATH: source-of-truth is the local file. Upload it, fetch
               it back through orca, compare bytes.

  --key NAME:  source-of-truth is the current origin object. Read
               it from origin, fetch the same key through orca,
               compare bytes.

Combining them would be ambiguous (which is the source?) so it is
rejected, but the error now says so in words the operator can act
on without re-reading the long help.

Adds TestRunRoundtrip_FlagErrors which locks both messages.
PR #176 review nit: the edge HTTP client used the zero-value
http.Client which falls back to http.DefaultTransport. That
transport caps MaxIdleConnsPerHost at 2, meaning every concurrent
request beyond the first two against the single orca host had to
pay a fresh TCP handshake (and TLS, in production deployments).

The bench subcommand defaults to 8 workers and the README suggests
--concurrency 16; scenarios spin a few dozen parallel range reads.
Capping the keep-alive pool at 2 throttled all of these to the
stdlib default, silently slowing benchmarks and hiding real
performance characteristics.

Fix clones http.DefaultTransport (so we inherit Dial / TLS / HTTP-2
defaults), raises MaxIdleConnsPerHost to 256 (more than any
realistic dev concurrency), and explicitly pins MaxConnsPerHost=0
(unlimited) so a future reviewer can see we never silently throttle
the operator's --concurrency choice.

Two new tests lock the transport sizing and the 5x-timeout cap so
a future refactor of newEdgeClient cannot quietly reintroduce the
regression.
@plombardi89 plombardi89 merged commit 67622ee into main May 29, 2026
24 checks passed
@plombardi89 plombardi89 deleted the phlombar/orcadev-tool branch May 29, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants