chore: sync prod state to master — vault, custom domains, magic links, kaniko, dpop, rbac#1
Merged
Conversation
…, kaniko, dpop, rbac This is a catch-up commit: the master branch had drifted significantly from the code actually running in production. Bundling the divergence here so subsequent PRs (deploy compute correctness, hostname injection, build context delivery, etc) can branch off a master that reflects reality. Each theme below is internally coherent; reviewers may want to read by section. Migrations (new, applied in prod): - 008_vault.sql — vault_secrets + vault_audit_log - 009_env_column.sql — env scope column on resources/deployments - 010_team_invitations.sql — pending-invite rows - 011_api_keys.sql — programmatic agent tokens - 012_audit_log.sql — generic audit log - 013_magic_links.sql — email magic-link auth - 014_custom_domains.sql — *.custom.tld pointing at deploys New handlers: - vault.go, vault_resolve.go — encrypted env-var store + vault://KEY ref resolution - api_keys.go — programmatic token issuance - audit.go — read audit log - custom_domain.go — custom domain CRUD - magic_link.go — email magic-link issue + redeem - teams.go — team-member CRUD - wellknown.go — RFC 8615 /.well-known/oauth-protected-resource - team_members.go — invite + role management New middleware: - api_key.go — bearer-API-key extraction - dpop.go — RFC 9449 DPoP proof binding (cnf / jkt claims) - quota.go — per-team rate limiting - rbac.go + role_lookup.go — role-based access for team-scoped endpoints - auth.go — audience (aud) + cnf validation added New models: - vault.go, api_key.go, audit_log.go, custom_domain.go, magic_link.go, team_invitations.go - deployment.go + resource.go gain env column + helper methods Compute provider rewrite (internal/providers/compute/k8s/): - client.go — kaniko in-cluster build (replaces the Rancher-Desktop "docker build" fallback). Builds via kaniko Job with build-context Secret; pushes to ghcr.io/.../instant-userapp/<id>. - custom_domain.go — Ingress + cert-manager Certificate reconcile for *.custom.tld - stack.go — multi-service stack provisioning Plans: - razorpay.go — subscription create/cancel/update via Razorpay API (matches the live billing handler) Other: - config.go — new env vars for vault key, magic-link expiry, custom-domain wildcard cert, kaniko image, etc. - main.go — wire all new routes + middleware - plans.yaml — vault + deployment limits added - email.go — magic-link template + transactional send Verified live in prod; deferred to git for too long. Subsequent PRs will fix specific friction points discovered while testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
mastermanas805
added a commit
that referenced
this pull request
May 11, 2026
Friction #1 from the agent-friction memo turned out not to be a missing feature — POST /stacks/new is fully implemented (router.go:178, handler in stack.go, K8sStackProvider for the build/deploy pipeline, manifest parser with service:// URL resolution). The friction was that none of it appeared in openapi.json, so an agent reading the spec had no way to discover the multi-service deploy primitive. This commit adds OpenAPI entries for: POST /stacks/new — manifest + per-service tarballs GET /stacks/{slug} — per-service status + URLs DELETE /stacks/{slug} — teardown POST /stacks/{slug}/redeploy Plus the StackRequest and StackResponse schemas covering the manifest shape (services keyed by name, each with build/port/expose/needs/env) and the response shape (overall status + services array with per- service status and the exposed url). Tests: - TestOpenAPI_StacksEndpointsDocumented guards the paths + schema. - TestOpenAPISpecParses catches any string-quoting regression. Live verification before this PR: Deployed a 2-service stack (api + web) anonymously to prod: $ curl -F manifest=<instant.yaml -F api=@api.tar.gz -F web=@web.tar.gz https://api.instanode.dev/stacks/new → 202, stk-72ca87d4 Polled GET /stacks/stk-72ca87d4 until status=healthy Hit https://web-stk-72ca87d4.deployment.instanode.dev/ → "<h1>Stack OK</h1>..." Hit /api on the web service → cross-service routing worked: {"ok":true,"via_web":true,"api_says":{"ok":true,"from":"api"}} Confirmed web pod env: API_URL=http://api:8080 (service:// resolved) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 16, 2026
…tract bugs
The api Deploy workflow had been red on every run. Root causes, all fixed:
Schema-mirror drift (the bulk of the ~25 failures)
testhelpers.runMigrations hand-mirrors the prod schema; CI ran against a
bare Postgres so any migration that added a table/column without a mirror
edit silently broke the gate. Completed the mirror (email_events,
pending_deletions, deployment_events, api_keys, magic_links, custom_domains,
service_components, uptime_samples + 9 drifted columns) and added
TestRunMigrationsMirrorsEveryMigrationTable — a guard that enumerates every
CREATE TABLE in the migration files and fails if one is unmirrored.
deploy.yml now also applies the real migration files before testing, so CI
runs the same schema developers do (kills column drift too).
Genuine product/test bugs the lax mirror had been masking:
- rate_limit.go / idempotency.go: a nil *redis.Client SIGSEGV'd the whole
API on the first request. Both now fail open (CLAUDE.md convention #1).
- cache/redis.go provisionLocal/StorageBytes: nil Redis client now returns
a clean error → handler 503, never a panic (convention #2).
- models.AcceptInvitation: did not clear is_primary when moving an invitee
into a team, violating uq_users_one_primary_per_team → 500. Now cleared.
- resource_elevate_test.go inserted stack/deployment rows missing the real
NOT NULL columns (namespace, app_id); now valid.
- Fixed 5 handler tests with stale tier/seat/mock assumptions and dropped
the stale -skip list (TestOpenAPI/TestCrossTeam/TestCustomDomainCreate
all pass now).
go test ./... -short: 25 packages, 0 FAIL, 0 panic. build + vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 17, 2026
…e_id Token-truncation class (P1, BUGHUNT-REPORT-2026-05-17-round2 recurring pattern #1). The storage backend derived the object-key prefix as token[:8]. In shared-key mode tenant isolation is by prefix CONVENTION only (every customer holds the same master key), so two tokens sharing 8 hex chars shared an object namespace — tenant B could read and overwrite tenant A's objects. A tenant-isolation security boundary must not depend on an 8-char collision not happening. minio-admin mode had the same flaw on the IAM user name (key_<token[:8]>). New prefixident.go canonical-identifier package: objectPrefixForToken uses the FULL token; the provider returns the slash-free prefix as Credentials.ProviderResourceID; the storage handler persists it via models.UpdateProviderResourceID on both the anonymous and authenticated provision paths. Deprovision resolves the IAM user/policy names from the stored value via resolveObjectPrefix and ALSO probes the legacy token[:8] form so pre-fix IAM users are still cleaned up. minio access-key / policy names derive from the (full-token) prefix via minioAccessKeyID / minioPolicyName. NO object migration: existing objects stay under their old token[:8] prefix; legacy rows (empty provider_resource_id) keep reading them via the worker scanner's identical legacy fallback. New rows get the full-token prefix. Coverage tests in prefixident_test.go assert the stored-PRID path is used verbatim, the legacy token[:8] fallback still resolves, and two 8-char-prefix-sharing tokens no longer collide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 17, 2026
…ning Round-3 P2 remediation across deploy/stack/webhook/auth surfaces. 1. deploy.go Redeploy now rejects a deployment in a terminal status (expired/deleted/stopped) with 409 + error code `deployment_not_redeployable` — redeploying one would resurrect an over-TTL/over-cap workload. New models.IsDeploymentTerminal + DeployStatusStopped const. 2. stack.go Redeploy re-runs the per-tier deployments_apps cap check when the stack is NOT in an active (slot-occupying) status — a failed/stopped stack flipping back to `building` could take a team to cap+1. New models.IsStackActive. 3. stack.go empty-env vault fallback changed from "production" to models.EnvDefault (development) at both new + redeploy sites — convention #11: a no-env legacy stack must not silently read production secrets. 4. deploy_teardown_reconciler.go increments a new metrics.DeployTeardownMarkFailed counter when MarkDeploymentTornDown fails — a persistently stuck row is now alertable in NR, not a silent log. 5. auth.go findOrCreateUserGitHub now matches an existing account by email (GetUserByEmail) and links github_id via new models.LinkGitHubID instead of forking a new team/user — mirrors findOrCreateUserGoogle and rejects takeover of an account already linked to a different GitHub ID. The /user/emails fallback now filters on Verified && Primary. 6. (already correct) models.CreateUser already routes email through NormalizeEmail at the write boundary — every OAuth/magic-link/claim call site is covered. No change needed; verified. 7. webhook.go receive_url is now built from a fixed server-controlled base (new webhookReceiveBaseURL: API_PUBLIC_URL / compiled-in base; c.BaseURL() only as a non-production dev fallback) instead of the client-controllable Host header. The URL is encrypted + persisted, so a client-settable host was a persistence-poisoning vector. 8. webhook.go Receive + ListRequests reject any non-webhook resource token with 404 — GetResourceByToken selects by token only, so a postgres/redis token previously passed. 9. auth.go GoogleAuthURL drops the impossible url.Parse-error 500 branch (the argument is a compile-time constant) — matches GoogleStart. Regression tests: models/redeploy_guard_test.go (IsDeploymentTerminal, IsStackActive), models/link_github_id_test.go (LinkGitHubID), handlers/webhook_receive_base_url_test.go (#7), handlers/p2_roundup_test.go (#1 error code, #4 metric, #9 GoogleAuthURL), and a wrong-resource-type case appended to handlers/webhook_test.go (#8). go build ./... and go vet ./... pass. New no-DB regression tests pass; the DB/Redis-backed suites require a test Postgres/Redis (unavailable in this environment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 20, 2026
…/B18
Closes a batch of P2/P3 envelope-contract and ordering issues identified by
the B9 (provisioning), B10 (auth/ratelimit/quota), and B18 (input fuzz)
bug-bash reports. All fixes carry inline CLAUDE.md rule-17 coverage notes.
Fixes:
1. **B18 M4** — `POST /storage/:token/presign` validates body before checking
token existence. Pre-fix, a random UUID returned `invalid_operation` (400)
before the existence check fired. Reordered: token parse → resource
lookup → body-shape validation. Closes information-flow risk if validators
ever loosen.
File: internal/handlers/storage_presign.go
2. **B18 M2** — Remove silent 120-byte truncation in sanitizeName. The
authoritative length bound was already requireName's 64-rune gate; the
second silent cap created a latent footgun if the name regex ever
loosens to allow multi-byte runes. Updated regression test for the
single-gate contract.
Files: internal/handlers/provision_helper.go, provision_helper_test.go
3. **B18 M3** — Document the intentional UUID-shape-before-auth ordering on
`GET /api/v1/webhooks/:token/requests`. The webhook token is a
public-by-design capability (lands in HTTP headers/logs/outbound URLs);
"well-formed-but-unknown" is not an oracle leak. Doc-only comment so
future refactors preserve the intent.
File: internal/handlers/webhook.go
4. **B18 L1** — Surface `X-Instant-Notice: name_normalized` header when
sanitizeName mutates the request name (CRLF / tab / NUL / HTML-special
chars stripped). Pre-fix the mutation was silent — agents looking up
"db_for_user\n" later by exact name would never find the persisted
"db_for_user". Header-only signal; does NOT fail the request (the
strip is a deliberate hardening on top of the regex).
File: internal/handlers/provision_helper.go
5. **B18 L2** — `parseProvisionBody` returns 415 `unsupported_media_type`
when the request carries an explicit non-JSON Content-Type
(application/xml etc.). Pre-fix, sending XML with `Content-Type:
application/xml` returned 400 `name_required` — a misleading code that
cost the caller one extra debugging cycle. The OpenAPI spec only
declares `application/json`; 415 is the RFC-correct status.
File: internal/handlers/provision_helper.go
6. **B10 P2-3** — Razorpay webhook invalid-signature envelope hydrated with
the canonical ErrorResponse shape. Pre-fix, signature failures returned
`{ok:false,error:"invalid_signature"}` with no request_id, message,
retry_after_seconds, or agent_action. Razorpay support always asks for
the request_id when a webhook fails. Same hydration applied to the
invalid_payload path.
File: internal/handlers/billing.go
7. **B10 P2-4** — Add `WWW-Authenticate: Bearer realm="instanode"` to every
401 from respondUnauthorized. RFC 6750 §3 requires this header on every
401 from a Bearer-protected resource. Pre-fix only the audience-mismatch
path emitted it. OAuth-aware clients and HTTP debugging tools look for
it.
File: internal/middleware/auth.go
Gate (matches CI/deploy.yml):
- `go build ./...` — green
- `go vet ./...` — green
- `go test ./... -short -count=1 -p 1` — green on every modified package;
pre-existing failures (12 in handlers + 2 in models + 3 B13 contract
tests) verified unchanged-against-master by stashing the patch and
re-running the same suite. All pre-existing flakes documented in
CLAUDE.md "Known Design Gaps".
Skipped (already shipped today, per brief):
- AESKeyring (a3155a5), B5/B11/B13/B7 (0c7991c), presign middleware (PR
#122, not yet on master), 768c0ca's 8 fixes.
Coverage:
- B19 finding #1 (presign middleware) — already shipped in PR #122; this
PR does not duplicate.
- B19 finding #4 (lease-recovery RTO) — worker-side, tracked as task #245.
- B9 P3-F8 (X-RateLimit-Remaining: 0 on success) — investigated; the math
in rate_limit.go is correct (limit-count). The reported 0-on-success is
not reproducible from the code path; left for in-prod re-probe after
this lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805
added a commit
that referenced
this pull request
May 21, 2026
…/B18
Closes a batch of P2/P3 envelope-contract and ordering issues identified by
the B9 (provisioning), B10 (auth/ratelimit/quota), and B18 (input fuzz)
bug-bash reports. All fixes carry inline CLAUDE.md rule-17 coverage notes.
Fixes:
1. **B18 M4** — `POST /storage/:token/presign` validates body before checking
token existence. Pre-fix, a random UUID returned `invalid_operation` (400)
before the existence check fired. Reordered: token parse → resource
lookup → body-shape validation. Closes information-flow risk if validators
ever loosen.
File: internal/handlers/storage_presign.go
2. **B18 M2** — Remove silent 120-byte truncation in sanitizeName. The
authoritative length bound was already requireName's 64-rune gate; the
second silent cap created a latent footgun if the name regex ever
loosens to allow multi-byte runes. Updated regression test for the
single-gate contract.
Files: internal/handlers/provision_helper.go, provision_helper_test.go
3. **B18 M3** — Document the intentional UUID-shape-before-auth ordering on
`GET /api/v1/webhooks/:token/requests`. The webhook token is a
public-by-design capability (lands in HTTP headers/logs/outbound URLs);
"well-formed-but-unknown" is not an oracle leak. Doc-only comment so
future refactors preserve the intent.
File: internal/handlers/webhook.go
4. **B18 L1** — Surface `X-Instant-Notice: name_normalized` header when
sanitizeName mutates the request name (CRLF / tab / NUL / HTML-special
chars stripped). Pre-fix the mutation was silent — agents looking up
"db_for_user\n" later by exact name would never find the persisted
"db_for_user". Header-only signal; does NOT fail the request (the
strip is a deliberate hardening on top of the regex).
File: internal/handlers/provision_helper.go
5. **B18 L2** — `parseProvisionBody` returns 415 `unsupported_media_type`
when the request carries an explicit non-JSON Content-Type
(application/xml etc.). Pre-fix, sending XML with `Content-Type:
application/xml` returned 400 `name_required` — a misleading code that
cost the caller one extra debugging cycle. The OpenAPI spec only
declares `application/json`; 415 is the RFC-correct status.
File: internal/handlers/provision_helper.go
6. **B10 P2-3** — Razorpay webhook invalid-signature envelope hydrated with
the canonical ErrorResponse shape. Pre-fix, signature failures returned
`{ok:false,error:"invalid_signature"}` with no request_id, message,
retry_after_seconds, or agent_action. Razorpay support always asks for
the request_id when a webhook fails. Same hydration applied to the
invalid_payload path.
File: internal/handlers/billing.go
7. **B10 P2-4** — Add `WWW-Authenticate: Bearer realm="instanode"` to every
401 from respondUnauthorized. RFC 6750 §3 requires this header on every
401 from a Bearer-protected resource. Pre-fix only the audience-mismatch
path emitted it. OAuth-aware clients and HTTP debugging tools look for
it.
File: internal/middleware/auth.go
Gate (matches CI/deploy.yml):
- `go build ./...` — green
- `go vet ./...` — green
- `go test ./... -short -count=1 -p 1` — green on every modified package;
pre-existing failures (12 in handlers + 2 in models + 3 B13 contract
tests) verified unchanged-against-master by stashing the patch and
re-running the same suite. All pre-existing flakes documented in
CLAUDE.md "Known Design Gaps".
Skipped (already shipped today, per brief):
- AESKeyring (ed55c41), B5/B11/B13/B7 (ed14581), presign middleware (PR
#122, not yet on master), f1ba49b's 8 fixes.
Coverage:
- B19 finding #1 (presign middleware) — already shipped in PR #122; this
PR does not duplicate.
- B19 finding #4 (lease-recovery RTO) — worker-side, tracked as task #245.
- B9 P3-F8 (X-RateLimit-Remaining: 0 on success) — investigated; the math
in rate_limit.go is correct (limit-count). The reported 0-on-success is
not reproducible from the code path; left for in-prod re-probe after
this lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Catch-up commit. `api/master` had drifted from the code actually running in production. This bundles the divergence so subsequent PRs (deploy compute correctness, hostname injection, build context delivery, agent diagnostics, etc) can branch off a master that reflects reality.
Themes
Test plan
What this unblocks
Friction PR 1–5 from the deploy friction post-mortem all touch this code and now have a clean baseline to branch from.
🤖 Generated with Claude Code