Skip to content

chore: sync prod state to master — vault, custom domains, magic links, kaniko, dpop, rbac#1

Merged
mastermanas805 merged 1 commit into
masterfrom
chore/sync-prod-state-to-master
May 11, 2026
Merged

chore: sync prod state to master — vault, custom domains, magic links, kaniko, dpop, rbac#1
mastermanas805 merged 1 commit into
masterfrom
chore/sync-prod-state-to-master

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

Catch-up commit. `api/master` had drifted from the code actually running in production. This bundles the divergence so subsequent PRs (deploy compute correctness, hostname injection, build context delivery, agent diagnostics, etc) can branch off a master that reflects reality.

Themes

  • Migrations 008–014: vault, env column, team invitations, API keys, audit log, magic links, custom domains
  • New handlers: vault.go, vault_resolve.go, api_keys.go, audit.go, custom_domain.go, magic_link.go, teams.go, wellknown.go, team_members.go
  • New middleware: api_key, dpop (RFC 9449), quota, rbac, role_lookup, audience+cnf checks on auth
  • New models: vault, api_key, audit_log, custom_domain, magic_link, team_invitations + deployment/resource env column
  • Compute provider rewrite: kaniko in-cluster build (replaces Rancher-Desktop docker fallback); custom-domain Ingress + cert-manager reconcile; multi-service stack
  • Plans: razorpay subscription CRUD; vault + deployment limits in plans.yaml
  • Other: config, main router wiring, magic-link email template

Test plan

  • `go build ./...` passes
  • All code is already running in prod — this is a sync, not net-new behaviour
  • Per-theme tests will be added/strengthened in subsequent friction PRs

What this unblocks

Friction PR 1–5 from the deploy friction post-mortem all touch this code and now have a clean baseline to branch from.

🤖 Generated with Claude Code

…, kaniko, dpop, rbac

This is a catch-up commit: the master branch had drifted significantly
from the code actually running in production. Bundling the divergence
here so subsequent PRs (deploy compute correctness, hostname injection,
build context delivery, etc) can branch off a master that reflects
reality. Each theme below is internally coherent; reviewers may want
to read by section.

Migrations (new, applied in prod):
- 008_vault.sql               — vault_secrets + vault_audit_log
- 009_env_column.sql          — env scope column on resources/deployments
- 010_team_invitations.sql    — pending-invite rows
- 011_api_keys.sql            — programmatic agent tokens
- 012_audit_log.sql           — generic audit log
- 013_magic_links.sql         — email magic-link auth
- 014_custom_domains.sql      — *.custom.tld pointing at deploys

New handlers:
- vault.go, vault_resolve.go  — encrypted env-var store + vault://KEY ref resolution
- api_keys.go                 — programmatic token issuance
- audit.go                    — read audit log
- custom_domain.go            — custom domain CRUD
- magic_link.go               — email magic-link issue + redeem
- teams.go                    — team-member CRUD
- wellknown.go                — RFC 8615 /.well-known/oauth-protected-resource
- team_members.go             — invite + role management

New middleware:
- api_key.go                  — bearer-API-key extraction
- dpop.go                     — RFC 9449 DPoP proof binding (cnf / jkt claims)
- quota.go                    — per-team rate limiting
- rbac.go + role_lookup.go    — role-based access for team-scoped endpoints
- auth.go                     — audience (aud) + cnf validation added

New models:
- vault.go, api_key.go, audit_log.go, custom_domain.go,
  magic_link.go, team_invitations.go
- deployment.go + resource.go gain env column + helper methods

Compute provider rewrite (internal/providers/compute/k8s/):
- client.go                   — kaniko in-cluster build (replaces the
  Rancher-Desktop "docker build" fallback). Builds via kaniko Job with
  build-context Secret; pushes to ghcr.io/.../instant-userapp/<id>.
- custom_domain.go            — Ingress + cert-manager Certificate
  reconcile for *.custom.tld
- stack.go                    — multi-service stack provisioning

Plans:
- razorpay.go                 — subscription create/cancel/update via
  Razorpay API (matches the live billing handler)

Other:
- config.go                   — new env vars for vault key, magic-link
  expiry, custom-domain wildcard cert, kaniko image, etc.
- main.go                     — wire all new routes + middleware
- plans.yaml                  — vault + deployment limits added
- email.go                    — magic-link template + transactional send

Verified live in prod; deferred to git for too long. Subsequent PRs
will fix specific friction points discovered while testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit e239e60 into master May 11, 2026
@mastermanas805 mastermanas805 deleted the chore/sync-prod-state-to-master branch May 11, 2026 08:03
mastermanas805 added a commit that referenced this pull request May 11, 2026
Friction #1 from the agent-friction memo turned out not to be a missing
feature — POST /stacks/new is fully implemented (router.go:178,
handler in stack.go, K8sStackProvider for the build/deploy pipeline,
manifest parser with service:// URL resolution). The friction was that
none of it appeared in openapi.json, so an agent reading the spec had
no way to discover the multi-service deploy primitive.

This commit adds OpenAPI entries for:

  POST   /stacks/new           — manifest + per-service tarballs
  GET    /stacks/{slug}        — per-service status + URLs
  DELETE /stacks/{slug}        — teardown
  POST   /stacks/{slug}/redeploy

Plus the StackRequest and StackResponse schemas covering the manifest
shape (services keyed by name, each with build/port/expose/needs/env)
and the response shape (overall status + services array with per-
service status and the exposed url).

Tests:
- TestOpenAPI_StacksEndpointsDocumented guards the paths + schema.
- TestOpenAPISpecParses catches any string-quoting regression.

Live verification before this PR:
  Deployed a 2-service stack (api + web) anonymously to prod:
    $ curl -F manifest=<instant.yaml -F api=@api.tar.gz -F web=@web.tar.gz
        https://api.instanode.dev/stacks/new
    → 202, stk-72ca87d4
  Polled GET /stacks/stk-72ca87d4 until status=healthy
  Hit https://web-stk-72ca87d4.deployment.instanode.dev/
    → "<h1>Stack OK</h1>..."
  Hit /api on the web service → cross-service routing worked:
    {"ok":true,"via_web":true,"api_says":{"ok":true,"from":"api"}}
  Confirmed web pod env: API_URL=http://api:8080 (service:// resolved)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 16, 2026
…tract bugs

The api Deploy workflow had been red on every run. Root causes, all fixed:

Schema-mirror drift (the bulk of the ~25 failures)
  testhelpers.runMigrations hand-mirrors the prod schema; CI ran against a
  bare Postgres so any migration that added a table/column without a mirror
  edit silently broke the gate. Completed the mirror (email_events,
  pending_deletions, deployment_events, api_keys, magic_links, custom_domains,
  service_components, uptime_samples + 9 drifted columns) and added
  TestRunMigrationsMirrorsEveryMigrationTable — a guard that enumerates every
  CREATE TABLE in the migration files and fails if one is unmirrored.
  deploy.yml now also applies the real migration files before testing, so CI
  runs the same schema developers do (kills column drift too).

Genuine product/test bugs the lax mirror had been masking:
  - rate_limit.go / idempotency.go: a nil *redis.Client SIGSEGV'd the whole
    API on the first request. Both now fail open (CLAUDE.md convention #1).
  - cache/redis.go provisionLocal/StorageBytes: nil Redis client now returns
    a clean error → handler 503, never a panic (convention #2).
  - models.AcceptInvitation: did not clear is_primary when moving an invitee
    into a team, violating uq_users_one_primary_per_team → 500. Now cleared.
  - resource_elevate_test.go inserted stack/deployment rows missing the real
    NOT NULL columns (namespace, app_id); now valid.
  - Fixed 5 handler tests with stale tier/seat/mock assumptions and dropped
    the stale -skip list (TestOpenAPI/TestCrossTeam/TestCustomDomainCreate
    all pass now).

go test ./... -short: 25 packages, 0 FAIL, 0 panic. build + vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 17, 2026
…e_id

Token-truncation class (P1, BUGHUNT-REPORT-2026-05-17-round2 recurring
pattern #1). The storage backend derived the object-key prefix as
token[:8]. In shared-key mode tenant isolation is by prefix CONVENTION
only (every customer holds the same master key), so two tokens sharing 8
hex chars shared an object namespace — tenant B could read and overwrite
tenant A's objects. A tenant-isolation security boundary must not depend
on an 8-char collision not happening. minio-admin mode had the same flaw
on the IAM user name (key_<token[:8]>).

New prefixident.go canonical-identifier package: objectPrefixForToken
uses the FULL token; the provider returns the slash-free prefix as
Credentials.ProviderResourceID; the storage handler persists it via
models.UpdateProviderResourceID on both the anonymous and authenticated
provision paths. Deprovision resolves the IAM user/policy names from the
stored value via resolveObjectPrefix and ALSO probes the legacy
token[:8] form so pre-fix IAM users are still cleaned up. minio
access-key / policy names derive from the (full-token) prefix via
minioAccessKeyID / minioPolicyName.

NO object migration: existing objects stay under their old token[:8]
prefix; legacy rows (empty provider_resource_id) keep reading them via
the worker scanner's identical legacy fallback. New rows get the
full-token prefix.

Coverage tests in prefixident_test.go assert the stored-PRID path is used
verbatim, the legacy token[:8] fallback still resolves, and two
8-char-prefix-sharing tokens no longer collide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 17, 2026
…ning

Round-3 P2 remediation across deploy/stack/webhook/auth surfaces.

1. deploy.go Redeploy now rejects a deployment in a terminal status
   (expired/deleted/stopped) with 409 + error code
   `deployment_not_redeployable` — redeploying one would resurrect an
   over-TTL/over-cap workload. New models.IsDeploymentTerminal +
   DeployStatusStopped const.

2. stack.go Redeploy re-runs the per-tier deployments_apps cap check when
   the stack is NOT in an active (slot-occupying) status — a failed/stopped
   stack flipping back to `building` could take a team to cap+1. New
   models.IsStackActive.

3. stack.go empty-env vault fallback changed from "production" to
   models.EnvDefault (development) at both new + redeploy sites — convention
   #11: a no-env legacy stack must not silently read production secrets.

4. deploy_teardown_reconciler.go increments a new
   metrics.DeployTeardownMarkFailed counter when MarkDeploymentTornDown
   fails — a persistently stuck row is now alertable in NR, not a silent log.

5. auth.go findOrCreateUserGitHub now matches an existing account by email
   (GetUserByEmail) and links github_id via new models.LinkGitHubID instead
   of forking a new team/user — mirrors findOrCreateUserGoogle and rejects
   takeover of an account already linked to a different GitHub ID. The
   /user/emails fallback now filters on Verified && Primary.

6. (already correct) models.CreateUser already routes email through
   NormalizeEmail at the write boundary — every OAuth/magic-link/claim call
   site is covered. No change needed; verified.

7. webhook.go receive_url is now built from a fixed server-controlled base
   (new webhookReceiveBaseURL: API_PUBLIC_URL / compiled-in base; c.BaseURL()
   only as a non-production dev fallback) instead of the client-controllable
   Host header. The URL is encrypted + persisted, so a client-settable host
   was a persistence-poisoning vector.

8. webhook.go Receive + ListRequests reject any non-webhook resource token
   with 404 — GetResourceByToken selects by token only, so a postgres/redis
   token previously passed.

9. auth.go GoogleAuthURL drops the impossible url.Parse-error 500 branch
   (the argument is a compile-time constant) — matches GoogleStart.

Regression tests: models/redeploy_guard_test.go (IsDeploymentTerminal,
IsStackActive), models/link_github_id_test.go (LinkGitHubID),
handlers/webhook_receive_base_url_test.go (#7), handlers/p2_roundup_test.go
(#1 error code, #4 metric, #9 GoogleAuthURL), and a wrong-resource-type case
appended to handlers/webhook_test.go (#8).

go build ./... and go vet ./... pass. New no-DB regression tests pass; the
DB/Redis-backed suites require a test Postgres/Redis (unavailable in this
environment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 20, 2026
…/B18

Closes a batch of P2/P3 envelope-contract and ordering issues identified by
the B9 (provisioning), B10 (auth/ratelimit/quota), and B18 (input fuzz)
bug-bash reports. All fixes carry inline CLAUDE.md rule-17 coverage notes.

Fixes:

1. **B18 M4** — `POST /storage/:token/presign` validates body before checking
   token existence. Pre-fix, a random UUID returned `invalid_operation` (400)
   before the existence check fired. Reordered: token parse → resource
   lookup → body-shape validation. Closes information-flow risk if validators
   ever loosen.
   File: internal/handlers/storage_presign.go

2. **B18 M2** — Remove silent 120-byte truncation in sanitizeName. The
   authoritative length bound was already requireName's 64-rune gate; the
   second silent cap created a latent footgun if the name regex ever
   loosens to allow multi-byte runes. Updated regression test for the
   single-gate contract.
   Files: internal/handlers/provision_helper.go, provision_helper_test.go

3. **B18 M3** — Document the intentional UUID-shape-before-auth ordering on
   `GET /api/v1/webhooks/:token/requests`. The webhook token is a
   public-by-design capability (lands in HTTP headers/logs/outbound URLs);
   "well-formed-but-unknown" is not an oracle leak. Doc-only comment so
   future refactors preserve the intent.
   File: internal/handlers/webhook.go

4. **B18 L1** — Surface `X-Instant-Notice: name_normalized` header when
   sanitizeName mutates the request name (CRLF / tab / NUL / HTML-special
   chars stripped). Pre-fix the mutation was silent — agents looking up
   "db_for_user\n" later by exact name would never find the persisted
   "db_for_user". Header-only signal; does NOT fail the request (the
   strip is a deliberate hardening on top of the regex).
   File: internal/handlers/provision_helper.go

5. **B18 L2** — `parseProvisionBody` returns 415 `unsupported_media_type`
   when the request carries an explicit non-JSON Content-Type
   (application/xml etc.). Pre-fix, sending XML with `Content-Type:
   application/xml` returned 400 `name_required` — a misleading code that
   cost the caller one extra debugging cycle. The OpenAPI spec only
   declares `application/json`; 415 is the RFC-correct status.
   File: internal/handlers/provision_helper.go

6. **B10 P2-3** — Razorpay webhook invalid-signature envelope hydrated with
   the canonical ErrorResponse shape. Pre-fix, signature failures returned
   `{ok:false,error:"invalid_signature"}` with no request_id, message,
   retry_after_seconds, or agent_action. Razorpay support always asks for
   the request_id when a webhook fails. Same hydration applied to the
   invalid_payload path.
   File: internal/handlers/billing.go

7. **B10 P2-4** — Add `WWW-Authenticate: Bearer realm="instanode"` to every
   401 from respondUnauthorized. RFC 6750 §3 requires this header on every
   401 from a Bearer-protected resource. Pre-fix only the audience-mismatch
   path emitted it. OAuth-aware clients and HTTP debugging tools look for
   it.
   File: internal/middleware/auth.go

Gate (matches CI/deploy.yml):
- `go build ./...` — green
- `go vet ./...` — green
- `go test ./... -short -count=1 -p 1` — green on every modified package;
  pre-existing failures (12 in handlers + 2 in models + 3 B13 contract
  tests) verified unchanged-against-master by stashing the patch and
  re-running the same suite. All pre-existing flakes documented in
  CLAUDE.md "Known Design Gaps".

Skipped (already shipped today, per brief):
- AESKeyring (a3155a5), B5/B11/B13/B7 (0c7991c), presign middleware (PR
  #122, not yet on master), 768c0ca's 8 fixes.

Coverage:
- B19 finding #1 (presign middleware) — already shipped in PR #122; this
  PR does not duplicate.
- B19 finding #4 (lease-recovery RTO) — worker-side, tracked as task #245.
- B9 P3-F8 (X-RateLimit-Remaining: 0 on success) — investigated; the math
  in rate_limit.go is correct (limit-count). The reported 0-on-success is
  not reproducible from the code path; left for in-prod re-probe after
  this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mastermanas805 added a commit that referenced this pull request May 21, 2026
…/B18

Closes a batch of P2/P3 envelope-contract and ordering issues identified by
the B9 (provisioning), B10 (auth/ratelimit/quota), and B18 (input fuzz)
bug-bash reports. All fixes carry inline CLAUDE.md rule-17 coverage notes.

Fixes:

1. **B18 M4** — `POST /storage/:token/presign` validates body before checking
   token existence. Pre-fix, a random UUID returned `invalid_operation` (400)
   before the existence check fired. Reordered: token parse → resource
   lookup → body-shape validation. Closes information-flow risk if validators
   ever loosen.
   File: internal/handlers/storage_presign.go

2. **B18 M2** — Remove silent 120-byte truncation in sanitizeName. The
   authoritative length bound was already requireName's 64-rune gate; the
   second silent cap created a latent footgun if the name regex ever
   loosens to allow multi-byte runes. Updated regression test for the
   single-gate contract.
   Files: internal/handlers/provision_helper.go, provision_helper_test.go

3. **B18 M3** — Document the intentional UUID-shape-before-auth ordering on
   `GET /api/v1/webhooks/:token/requests`. The webhook token is a
   public-by-design capability (lands in HTTP headers/logs/outbound URLs);
   "well-formed-but-unknown" is not an oracle leak. Doc-only comment so
   future refactors preserve the intent.
   File: internal/handlers/webhook.go

4. **B18 L1** — Surface `X-Instant-Notice: name_normalized` header when
   sanitizeName mutates the request name (CRLF / tab / NUL / HTML-special
   chars stripped). Pre-fix the mutation was silent — agents looking up
   "db_for_user\n" later by exact name would never find the persisted
   "db_for_user". Header-only signal; does NOT fail the request (the
   strip is a deliberate hardening on top of the regex).
   File: internal/handlers/provision_helper.go

5. **B18 L2** — `parseProvisionBody` returns 415 `unsupported_media_type`
   when the request carries an explicit non-JSON Content-Type
   (application/xml etc.). Pre-fix, sending XML with `Content-Type:
   application/xml` returned 400 `name_required` — a misleading code that
   cost the caller one extra debugging cycle. The OpenAPI spec only
   declares `application/json`; 415 is the RFC-correct status.
   File: internal/handlers/provision_helper.go

6. **B10 P2-3** — Razorpay webhook invalid-signature envelope hydrated with
   the canonical ErrorResponse shape. Pre-fix, signature failures returned
   `{ok:false,error:"invalid_signature"}` with no request_id, message,
   retry_after_seconds, or agent_action. Razorpay support always asks for
   the request_id when a webhook fails. Same hydration applied to the
   invalid_payload path.
   File: internal/handlers/billing.go

7. **B10 P2-4** — Add `WWW-Authenticate: Bearer realm="instanode"` to every
   401 from respondUnauthorized. RFC 6750 §3 requires this header on every
   401 from a Bearer-protected resource. Pre-fix only the audience-mismatch
   path emitted it. OAuth-aware clients and HTTP debugging tools look for
   it.
   File: internal/middleware/auth.go

Gate (matches CI/deploy.yml):
- `go build ./...` — green
- `go vet ./...` — green
- `go test ./... -short -count=1 -p 1` — green on every modified package;
  pre-existing failures (12 in handlers + 2 in models + 3 B13 contract
  tests) verified unchanged-against-master by stashing the patch and
  re-running the same suite. All pre-existing flakes documented in
  CLAUDE.md "Known Design Gaps".

Skipped (already shipped today, per brief):
- AESKeyring (ed55c41), B5/B11/B13/B7 (ed14581), presign middleware (PR
  #122, not yet on master), f1ba49b's 8 fixes.

Coverage:
- B19 finding #1 (presign middleware) — already shipped in PR #122; this
  PR does not duplicate.
- B19 finding #4 (lease-recovery RTO) — worker-side, tracked as task #245.
- B9 P3-F8 (X-RateLimit-Remaining: 0 on success) — investigated; the math
  in rate_limit.go is correct (limit-count). The reported 0-on-success is
  not reproducible from the code path; left for in-prod re-probe after
  this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant