Skip to content

feat(plans): add vault tier policy (max entries + allowed envs)#1

Merged
mastermanas805 merged 1 commit into
masterfrom
feat/vault-plan-policy
May 11, 2026
Merged

feat(plans): add vault tier policy (max entries + allowed envs)#1
mastermanas805 merged 1 commit into
masterfrom
feat/vault-plan-policy

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

Adds vault-feature policy fields to `PlanLimits`:

  • `VaultMaxEntries int` — per-team cap. `-1` = unlimited, `0` = feature unavailable.
  • `VaultEnvsAllowed []string` — allowed env scopes for vault entries.

Test coverage in `plans_test.go` for both fields across all tiers.

Test plan

  • `go test ./plans/...` passes

🤖 Generated with Claude Code

Adds two fields to PlanLimits:

- VaultMaxEntries (int): per-team cap on vault entries. -1 = unlimited,
  0 = vault feature unavailable on this tier.
- VaultEnvsAllowed ([]string): list of environment names permitted for
  vault entries (production / staging / dev / ...).

Test cases extend plans_test.go to cover both fields across all tiers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 0a31ebb into master May 11, 2026
@mastermanas805 mastermanas805 deleted the feat/vault-plan-policy branch May 11, 2026 08:00
mastermanas805 added a commit that referenced this pull request May 21, 2026
… OSV-Scanner) (#16)

* feat(plans): add vault tier policy (max entries + allowed envs) (#1)

Adds two fields to PlanLimits:

- VaultMaxEntries (int): per-team cap on vault entries. -1 = unlimited,
  0 = vault feature unavailable on this tier.
- VaultEnvsAllowed ([]string): list of environment names permitted for
  vault entries (production / staging / dev / ...).

Test cases extend plans_test.go to cover both fields across all tiers.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* common: add buildinfo package for compile-time GitSHA/BuildTime/Version

New `instant.dev/common/buildinfo` exposes three package vars
(`GitSHA`, `BuildTime`, `Version`) defaulting to sentinel strings.
Real values are wired in at link time via `go build -ldflags -X` —
the Dockerfile in each service passes `--build-arg GIT_SHA=...`
into the ldflag so /healthz and slog log lines stamp the exact
commit the running pod was built from.

This is track 1 of 8 in the observability rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* common/logctx: relocate from api repo into the canonical common module

Track 2 of the observability rollout originally created common/logctx
inside the api repo as a side effect of dispatching from an api worktree.
This blocked the obsstubs→common refactor in the api router because the
api/go.mod has `replace instant.dev/common => ../common` — so imports of
instant.dev/common/logctx were resolving to the monorepo common dir which
didn't have the package.

This commit puts common/logctx where its module path says it lives. After
this lands, the api repo's fix-obsstubs-to-common-2026-05-12 PR can drop
its obsstubs/ stubs and import instant.dev/common/logctx directly.

No code changes to the package itself — straight relocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plans: restore free tier in Default() to mirror anonymous

The api repo's plans tests (TestDefault_AllStandardTiersPresent,
TestAll_ReturnsAllPlans, TestFreeTier_MirrorsAnonymous) require a `free`
tier in the default registry. The api-level plans.yaml already defines
`free` as a byte-for-byte clone of `anonymous` (same limits, same
features) — the only difference being audience (free = claimed-but-unpaid
teams, anonymous = pre-claim agents). Both still get reaped at 24h, so
the pay-from-day-one policy holds.

The `free` tier is real product surface, not test scaffolding:
  - api/internal/handlers/billing.go:361 sets tier="free" for unpaid teams
  - api/internal/handlers/webhook.go:411-416 reaps both anonymous and free
  - api/internal/handlers/openapi.go advertises "free" in 3 schemas
  - api/internal/models/resource_elevate_test.go uses tier "free"
  - api/internal/handlers/onboarding_test.go asserts tier == "free"

The FREE-TIER-RECYCLE-2026-05-12.md plan also depends on `free` existing
in the registry (Option B email-gate falls into this tier).

Mirroring rule: anonymous and free must stay byte-identical so that an
anonymous->free flip at claim time cannot widen or narrow quotas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* logctx: read commit_id from buildinfo.GitSHA, drop env var fallback

Today's B1 + B2 dispatches both surfaced that /healthz returned the real
commit SHA (via buildinfo.GitSHA from the ldflag-patched Dockerfile) but
slog lines showed commit_id=dev because logctx read os.Getenv(COMMIT_ID).
The two systems disagreed.

The env-var fallback was a decoupling shim from when logctx shipped before
buildinfo. Now both live on the same common module; collapse to a direct
import.

* plans: add yearly variants (hobby/pro/team) + BillingPeriod helpers

Adds hobby_yearly ($90/yr), pro_yearly ($490/yr), team_yearly ($1990/yr)
to the embedded default registry — each mirrors its monthly counterpart's
limits + features exactly, only `price_monthly_cents` (annual amount in
cents) and `billing_period: yearly` differ.

New helpers:
  - Plan.BillingPeriod field
  - Registry.BillingPeriod(tier) — "monthly" | "yearly"
  - CanonicalTier(tier) — strips "_yearly" suffix so the webhook can map
    yearly plan_ids back to the bare tier and teams.plan_tier stays
    cycle-agnostic.

Tests pin the mirror invariant (limits + features identical to base tier)
and that yearly_price < monthly_price * 12 so the "save $X/yr" badge is
honest.

* plans: yearly discount 17% -> 10% (hobby $97.20 / pro $529.20 / team $2149.20)

P2 shipped the yearly variants at ~17% off monthly. User feedback: 17% is
too steep a give-up on annual revenue; standardize on 10% off across all
three tiers to keep yearly attractive without leaving margin on the table.

New prices (annual amount in cents, stored in price_monthly_cents per the
existing schema):

  hobby_yearly:  9000   -> 9720    ($90.00   -> $97.20)
  pro_yearly:   49000   -> 52920   ($490.00  -> $529.20)
  team_yearly: 199000   -> 214920  ($1990.00 -> $2149.20)

Each new price = (monthly * 12 * 0.9), giving an effective monthly rate
of $8.10 / $44.10 / $179.10 respectively.

Tests:
  - existing TestYearlyVariants_MirrorMonthlyLimits still passes (limits +
    features unchanged)
  - existing TestYearlyPrices_DiscountedVsMonthlyTimesTwelve still passes
  - new TestYearlyDiscountIsExactly10Percent locks the contract:
    (yearly / 12) / monthly == 0.9 +/- 0.01 for hobby/pro/team. Future
    price changes that drift the discount fail loudly.

Operator action required (not automatable from this PR): the existing
RAZORPAY_PLAN_ID_HOBBY_YEARLY / _PRO_YEARLY / _TEAM_YEARLY env vars still
point at the OLD prices in the Razorpay dashboard. Operator must EITHER
edit the 3 existing yearly plans in Razorpay to the new prices
($97.20, $529.20, $2149.20) OR create 3 new plans + rotate the env vars
in the k8s secret. Until then, checkout will charge the old amounts even
though the dashboard quotes the new ones.

Dashboard impact: none — the "Save $X/yr" badge reads PriceMonthly from
the registry, so it auto-updates once this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plans: yearly back to '2 months free' (hobby $90 / pro $490 / team $1990)

Reverts common#7 (yearly @ 10% off) back to the original 17%-ish pricing
expressed as exactly monthly x 10 — the mathematical form of "2 months
free". Per PRICING-BEST-PRACTICES-2026-05-13.md (top recommendation #3,
Athenic case study), the "2 months free" framing outperforms percentage-off
copy by ~3.4x in conversion. To use that framing honestly we need
yearly_cents == monthly_cents * 10.

- hobby_yearly: 9720 -> 9000 cents ($97.20 -> $90/yr)
- pro_yearly:   52920 -> 49000 cents ($529.20 -> $490/yr)
- team_yearly:  214920 -> 199000 cents ($2149.20 -> $1990/yr)

Tests:
- Renamed TestYearlyDiscountIsExactly10Percent -> TestYearlyIsTwoMonthsFree
  (asserts (yearly/12)/monthly == 10/12 within 0.01).
- Added TestYearlyIsExactlyMonthlyTimesTen — strict integer-cents lock so
  the "2 months free" claim is provable to the cent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plans: differentiate yearly discount — hobby 'save 1 month', pro/team '2 months free'

Hobby Annual is now $99/yr (= $9 x 11 = 8.3% off, "save 1 month").
Pro Annual stays $490/yr (= $49 x 10 = 17% off, "2 months free").
Team Annual stays $1990/yr (= $199 x 10 = 17% off, "2 months free").

Strategic intent: when a hobby user sees their annual savings is small
but Pro Annual saves "2 months free / $98", the differential nudges
them to tier-skip into Pro Annual rather than just upgrade frequency.

Tests:
- Split TestYearlyIsTwoMonthsFree into TestProAnnualIsTwoMonthsFree
  (pro+team only, 10/12 ratio) + TestHobbyAnnualIsOneMonthFree
  (hobby only, 11/12 ratio).
- Renamed TestYearlyIsExactlyMonthlyTimesTen to
  TestProTeamYearlyIsMonthlyTimesTen and added
  TestHobbyYearlyIsMonthlyTimesEleven for the new x11 lock.
- Added TestTierDiscountDifferentiation locking the strategic intent:
  pro_yearly_ratio < hobby_yearly_ratio (and same for team).

* plans: shared Rank() helper for tier ordering

Two package-private rank functions used to live in the api repo
(internal/handlers/billing.go::tierRank and
internal/handlers/admin_customers.go::adminTierRank). They had subtly
different orderings — billing.go covered 6 tiers (anonymous .. team),
admin_customers.go covered 4 (free .. team) and was off-by-one against
billing for the same names. The discrepancy never bit production because
the admin surface never sees anonymous/growth, but it's a footgun.

Promote a single canonical ordering here so all modules share one rank
function. Returns -1 for unknown tiers; callers must guard against the
sentinel when comparing ranks (a negative rank means "no transition
direction"). Yearly variants are NOT auto-normalised — callers pass them
through CanonicalTier first if they want "pro_yearly" to rank as "pro".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plans: add hobby_plus tier — $19/mo mid-step between Hobby and Pro (W11) (#11)

Inserts a new hobby_plus tier between hobby ($9) and pro ($49):
  - 2 deployment apps (vs hobby's 1)
  - custom_domains: true (first paid tier with this feature)
  - 5 GB object storage, 1 GB MongoDB, multi-env vault (50 entries)
  - 14-day backups with 1-click restore (vs hobby's 7d, no restore)
  - $199/yr annual variant (hobby_plus_yearly, ~13% discount)

Research-backed pricing decoy: triple-tier $9/$19/$49 lifts conversion
~22% vs $9/$49 by anchoring against the middle price.

Rank ordering: anonymous=0, free=1, hobby=2, hobby_plus=3, growth=4,
pro=5, team=6. Every previous upgrade transition still resolves as
"upgrade" because the relative ordering is preserved (only absolute
values shifted).

Also removes the legacy TrialDays field from Plan + Registry to keep
common in lockstep with the api (which removed trial in W10).

* plans: add custom_domains_max per-tier cap (FIX-G) (#12)

Adds Limits.CustomDomainsMax field + Registry.CustomDomainsMaxLimit()
method so handlers can enforce a per-team count cap on custom hostnames.

Tier ladder (mirrors defaultYAML and api/plans.yaml):

  anonymous / free / hobby / hobby_yearly  = 0  (feature off — boolean trips first)
  hobby_plus / hobby_plus_yearly           = 1  (first tier with the feature)
  growth                                   = 3
  pro / pro_yearly                         = 5
  team / team_yearly                       = 50 (effectively unlimited for dashboards)

Closes BugBash U10 / #128 — previously the boolean Features.CustomDomains
flag was the only gate, letting any Hobby Plus+ team bind an unbounded
number of hostnames. Pairs with api PR that enforces the cap in
custom_domain.go before the row insert.

Tests:
- TestCustomDomainsMaxLimit locks the per-tier numbers above.
- TestCustomDomainsMax_PairedWithBooleanFlag guards the invariant that
  custom_domains_max > 0 always pairs with features.custom_domains:true
  (and vice versa) — drift between the two is dead code or unreachable
  capacity.

* plans: add rpo_minutes / rto_minutes per-tier (FIX-H #Q50) (#13)

Adds two Limits fields surfaced on GET /api/v1/capabilities so an agent
can reason about a tier's durability promises before provisioning. Pairs
with the FIX-H api/worker backup-integrity work: the api handler reads
RPOMinutes/RTOMinutes via the new Registry methods. Anonymous/free
return 0 ("not promised") because those tiers don't take scheduled
backups; hobby/hobby_plus = 1440/30, pro/team = 60/15.

No yaml updates here — plans.yaml lives in the api repo and FIX-H ships
the values there in the same wave.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plans: Pro storage bump + Growth bump (PRICING-AUDIT-2026-05-15)

Pro: postgres 5→10 GB, vector 5→10 GB, redis 256→512 MB, mongo 2→5 GB,
object 10→50 GB. Same $49/mo. Defensible against Supabase Pro ($25/8 GB
PG/100 GB object) on a 30-second side-by-side.

Growth: postgres + vector 5→20 GB, redis 256→1024 MB so the tier ladder
stays ordered above Pro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plans: hobby_plus rolled back to production-only vault envs

W12 pricing pass (2026-05-15): multi-env is Pro+. Mirrors the
api/plans.yaml change and updates TestHobbyPlus_TierMatrix +
TestVaultEnvsAllowed_HobbyIsProductionOnly to assert the new
production-only posture.

Code gate lives in api/internal/handlers/stack.go::multiEnvTierAllowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(plans): add QueueCount limit field + QueueCountLimit() method (A6)

Adds `queue_count: int` to the Limits struct and `QueueCountLimit(tier string) int`
to Registry. The zero-value fallback treats absent fields as unlimited (-1) for
backward compatibility with YAML files that predate this change.

queue_count values in defaultYAML:
  anonymous/free/growth/team/team_yearly: -1 (unlimited)
  hobby/hobby_yearly: 3
  hobby_plus/hobby_plus_yearly: 5
  pro/pro_yearly: 20

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plans): correct growth/pro tier-rank inversion

P1, BUGHUNT-REPORT-2026-05-17-round2: the canonical Rank table had
growth=4, pro=5 — i.e. growth ranked BELOW pro. This contradicted
plans.yaml pricing (pro $49/mo < growth $99/mo) and the worker's
billingTierRankMap (pro=4, growth=5). The api consumes common's Rank,
the worker uses its own table — the two disagreed, so an automatic
plan transition could be misclassified as an upgrade when it was a
downgrade (and vice versa).

Rank is now anchored to price: anonymous 0, free 1, hobby 2,
hobby_plus 3, pro 4, growth 5, team 6 — matching the worker.

rank_test.go updated: TestRank_AllStandardTiers / _MonotonicallyIncreasing
/ _CaseInsensitive reflect the corrected order; new TestRank_ProRanksBelow-
Growth pins pro < growth < team explicitly so the inversion cannot regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plans): correct hobby_yearly price drift in defaultYAML

defaultYAML's hobby_yearly block had price_monthly_cents: 9900, while
api/plans.yaml (the source of truth, confirmed correct against the
instanode-web PricingPage FIX-K note "$90/yr = $7.50/mo") holds 9000.
defaultYAML is documented to be a byte-mirror of api/plans.yaml.

Diffed all four _yearly blocks (hobby_yearly, hobby_plus_yearly,
pro_yearly, team_yearly): only the hobby_yearly price disagreed — every
other yearly-block price and limit field was already in sync.

The 9000 value puts hobby_yearly at hobby x10 ("save 2 months"), which
contradicted three tests that pinned the stale x11 "save 1 month" model
(TestHobbyAnnualIsOneMonthFree, TestHobbyYearlyIsMonthlyTimesEleven,
TestTierDiscountDifferentiation). Since plans.yaml is authoritative,
those tests encoded the drift and are replaced:
 - TestHobbyAnnualIsTwoMonthsFree  (10/12 ratio for hobby)
 - TestYearlyIsMonthlyTimesTen     (x10 lock for hobby/pro/team)
 - TestTierDiscountUniformity      (uniform 10/12 across core tiers)
 - TestHobbyPlusYearlyDiscount     (hobby_plus's distinct mid-discount)

Added TestHobbyYearlyPriceIsPinned — a value-pinning guard that fails if
defaultYAML's hobby_yearly price drifts off 9000 again.

go build ./... / go vet ./... / go test ./... -count=1 all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plans): add rpo_minutes/rto_minutes to every defaultYAML tier

BugBash 2026-05-18 P2-W2-41: common/plans.go's defaultYAML const set no
rpo_minutes/rto_minutes on any tier block, so plans.Default() reported
RPO=RTO=0 for every tier — including Pro/Team whose real values are
60/15. The Limits.RPOMinutes/RTOMinutes struct fields and the
RPOMinutes()/RTOMinutes() accessors already existed; only the embedded
YAML was missing the keys.

GET /api/v1/capabilities is served from a Default()-backed registry in
any environment without a plans.yaml file present, so an agent reasoning
about a workload's durability requirement got a false "not promised"
(0/0) signal for paid tiers.

- Add rpo_minutes/rto_minutes to all 11 tier blocks in defaultYAML,
  matching api/plans.yaml exactly (anon/free 0/0, hobby* 1440/30,
  pro*/team*/growth 60/15).
- Re-verified the whole defaultYAML is a faithful mirror of
  api/plans.yaml — programmatic limits/features/price/billing_period
  diff is now clean (audience is YAML-only metadata, no struct field).
- Add TestRPORTOMinutes_DefaultYAMLMatchesAPIPlansYAML — a
  registry-iterating regression test that fails if a new tier is added
  without RPO/RTO coverage or if Pro's values regress to 0.

Symptom:        plans.Default() RPOMinutes/RTOMinutes == 0 for all tiers
Enumeration:    grep -c 'rpo_minutes:' plans/plans.go (was 0, now 11)
Sites found:    11 tier blocks
Sites touched:  11
Coverage test:  TestRPORTOMinutes_DefaultYAMLMatchesAPIPlansYAML

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resourcestatus): canonical ResourceStatus enum + expiry-stage derivation

BugBash flagged "expiry-stage predicate divergence": api and worker each
carried independently-drifting hand-written predicates for resource
status (active/paused/suspended/expired/deleted) and for the expiry-warning
stage derived from expires_at vs now.

New package instant.dev/common/resourcestatus is the single source of truth:

  - Status enum + Valid/IsActive/IsPaused/IsSuspended/IsExpired/IsDeleted/
    IsTerminal/IsReapable predicates, AllStatuses(), Parse(), ReapableStatuses()
  - ExpiryStage enum (none/12h/6h/1h/past-ttl) + DeriveExpiryStage(),
    HoursUntilExpiry(), IsPastTTL() — the worker's selectStage/hoursLeft
    logic centralised, P2-12 "most-imminent-bucket-wins" behaviour preserved

Exhaustive tests TestStatusPredicates_ExhaustiveOverEnum and
TestDeriveExpiryStage_ExhaustiveOverStagesAndBoundaries iterate
AllStatuses()/AllExpiryStages() — adding an enum value without handling
it fails the build.

Cross-repo contract change (CLAUDE.md rule 22): api + worker convert to
this package in follow-up commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(resourcestatus): add StatusPending for two-phase provision lifecycle

MR-P0-2 (BugBash 2026-05-20). The api's provisioner_reconciler sweeps
`WHERE status='pending'` to recover rows stranded by an api crash
mid-provision, but no code ever wrote 'pending' — every CreateResource
INSERT landed on the column DEFAULT 'active' immediately, so the
crash-recovery subsystem was dead code that matched zero rows.

Add the StatusPending constant + IsPending predicate + cases in
AllStatuses/Valid so the api side can insert pending and flip to
active only after the backend provision RPC + persistence succeed.
Pending is NOT reapable (the reconciler, not the TTL reaper, handles
a stranded pending row) and NOT terminal.

Update the exhaustive-status table test to add the StatusPending case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* storageprovider: cloud-agnostic storage credential abstraction

Define the StorageCredentialProvider interface so /storage/new can switch
from DO Spaces shared-master-key to Cloudflare R2 prefix-scoped tokens
(or AWS S3 STS sessions) via OBJECT_STORE_BACKEND env flip + data
migration — no application code changes.

Per STORAGE-ABSTRACTION-DESIGN-2026-05-20.md:

  Provider             PrefixScoped  STS  BucketPerTenant  MaxKeys
  ─────────────────────────────────  ───  ───────────────  ───────
  do-spaces (today)    no            no   ~100/account     200
  r2                   yes           yes  yes              unbounded
  s3 (skeleton)        yes           yes  yes              unbounded

Each impl reports its actual capabilities; the api's POST /storage/new
consults Capabilities() to pick credential vs broker mode. The S3 impl is
skeleton-only — session-policy assembly is real and tested, AWS SDK
wiring is injected via SetAssumeRoleFunc. The MinIO impl lives in api/
so common stays free of madmin-go transitive deps.

Tests (CLAUDE.md rule 18 — registry-iterating, not hand-typed):
  - contract_test.go iterates ListRegistered() and validates every
    backend satisfies the interface
  - dospaces_test.go: capability shape, shared-master-key issuance
  - r2_test.go: mocks Cloudflare R2 API; asserts the buckets/keys
    request body carries parameters.prefixes (prefix-scoping)
    AND the temp-creds request carries ttlSeconds + session token
  - s3_test.go: stub AssumeRole; asserts session policy carries
    Condition.StringLike.s3:prefix = <token>/*

build/vet/test green on instant.dev/common.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(queueprovider): per-tenant queue isolation interface + 4 backends

MR-P0-5 (NATS per-tenant isolation, 2026-05-20). Held architecture P0.
See NATS-ISOLATION-MIGRATION-2026-05-20.md in repo root for the design doc.

# What this adds

`common/queueprovider/` — provider-agnostic interface for per-tenant queue
credential issuance, mirroring the `common/storageprovider/` pattern.

Implementations:
- nats/      — real impl, NATS operator-mode (per-tenant accounts + signed
               user JWTs via nats-io/nkeys + nats-io/jwt/v2). Falls back to
               legacy_open transparently when no operator seed is configured,
               so api can deploy BEFORE the operator runs `nsc generate`.
- rabbitmq/  — skeleton; ErrNotImplemented. Portability proof.
- kafka/     — skeleton; ErrNotImplemented. Portability proof.
- legacyopen/— cutover shim returning no creds (grandfathered behavior).

# Why

NATS in `instant-data` runs unauthenticated. Any pod in the cluster can dial
nats://nats.instant-data.svc.cluster.local:4222 and read/write every other
tenant's subjects + JetStream streams. The "subject prefix derived from token"
pattern is naming convention, not isolation.

Post-cutover: tenant accounts are signed by the operator key; each tenant
gets its own NATS account = its own JetStream namespace = its own subject
namespace. Cross-tenant pub/sub is denied at the server.

# Tests

- contract_test.go iterates every registered backend (CLAUDE.md rule 18) —
  no hand-typed slices.
- nats/nats_test.go verifies (a) IssueIsolatedCredentials mints a valid
  user JWT with subject-scoped permissions, (b) two tenants get
  DISJOINT subject allow-lists (the breach we're fixing), (c) TTL applies
  to user JWT expiry, (d) Revoke pushes an updated account claim.

# Coverage block

Symptom:       NATS unauthenticated cross-tenant access
Enumeration:   rg -F 'nats://' across all 6 repos — see design doc
Sites found:   ~36 hits across api/worker/provisioner/common/infra/dashboard
Sites touched: common/queueprovider lands the interface; this PR ships
               common only. api wires the interface in a paired PR.
Coverage test: TestRegistry_AllProvidersSatisfyContract +
               TestNATS_TwoTenants_DisjointSubjectPermissions
Live verified: pending operator key generation (needs operator action)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* common: add readiness package for deep /readyz checks

Shared library for the api / worker / provisioner deep readiness probe.
Each service mounts a /readyz handler that runs component-by-component
checks (platform_db, brevo, razorpay, do_spaces, provisioner_grpc,
river, etc.) in parallel under a per-check 10s cache, then derives
overall=ok|degraded|failed per the per-service criticality matrix.

Wired to k8s readinessProbe (not livenessProbe — a Brevo outage MUST
NOT SIGKILL every api pod). A failed critical check returns 503 so
kubelet pulls the pod from the Service endpoints; a failed non-critical
check returns 200 + overall=degraded so the pod keeps serving while
the NR alert fires for the operator. This is the surface the Brevo
silent-rejection bug from 2026-05-20 would have caught weeks earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(storageprovider): accept shared-key / shared-master-key as do-spaces aliases

Live prod deploys OBJECT_STORE_BACKEND=shared-key (legacy naming from
api/internal/config.go mode-resolution), which previously failed
NormalizeBackend() and forced the factory to fall back to ErrUnknownBackend.
This commit teaches the factory to collapse "shared-key" / "shared_key" /
"sharedkey" / "shared-master-key" / "shared_master_key" onto "do-spaces",
matching the storage-mode label surfaced in /storage/new responses.

Coverage block (per CLAUDE.md rule 17):
  Symptom:        live OBJECT_STORE_BACKEND=shared-key didn't match factory enum
  Enumeration:    grep -rn 'NormalizeBackend\|OBJECT_STORE_BACKEND' common/ api/
  Sites found:    2 (factory.go switch + contract_test.go cases)
  Sites touched:  2
  Coverage test:  TestNormalizeBackend covers shared-key + variants
  Live verified:  next deploy of api will boot cleanly with the existing
                  k8s secret instead of crashing on unknown-backend.

Closes P1 from DOC-REALITY-DELTA-2026-05-20.md §3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* storageprovider: B17-P1 godoc fix + canonical Backend constants

Two B17 BugBash findings for the SDK-side storage abstraction:

1. Config.Backend godoc claimed "empty or unknown values land on minio".
   The implementation actually returns ErrUnknownBackend for empty/unknown
   Backend values (deliberately — defaulting to a real provider has masked
   operator misconfiguration in the past). Godoc updated to match the
   shipped behavior and explain why empty is rejected loudly.

2. Canonical Backend identifiers exported as constants
   (BackendDOSpaces / BackendR2 / BackendS3 / BackendMinIO) so callers
   can compare against typed names instead of stringly-typed magic
   strings. BackendSharedKey kept as a Deprecated: alias for legacy
   operator configs that emitted "shared-key"; NormalizeBackend collapses
   it to BackendDOSpaces — both reach the same implementation.

Gate green: go build / vet / test ./storageprovider/... all PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* storageprovider: B17 P2/P3 sweep — hardened sanitiser + Capabilities docs

Closes the storage-broker P2/P3 findings from BUGBASH-2026-05-20 (B17).
P0/P1s on the broker route (rate-limit, auth, signing key) ship separately
in the api repo (they touch handler middleware, not common).

Fixes in this commit:

* B17-STORAGE-P2-14 — Add common/storageprovider/sanitise.go with
  SanitiseTenantKey(in string) string. The api-side legacy
  `sanitisePresignKey` covers `..`, `.`, leading `/` and double-slash but
  not the shapes the audit flagged:
    - URL-encoded `..` (%2e%2e, %2E%2E, ..%2f, mixed case, double-encoded)
    - NUL bytes (raw \x00 and percent-encoded %00) anywhere in the key
    - Windows-style \\\\ separators that minio-go treats as literals
    - Mixed Unicode dots (documented as NOT collapsed — homoglyphs like
      U+2025 are regular key segments)
  Sanitisation is conservative: `.` / `..` components are DROPPED, never
  path-resolved. That's strictly safer than path.Clean (which would pop a
  legitimate parent segment if a tenant snuck `..` past the decoder).
  Tests cover 25+ traversal shapes and pin three invariants:
    - no leading slash on output
    - no `.` or `..` component survives
    - no NUL byte survives
  The api's legacy sanitiser is kept for now; migration of callsites is a
  separate slice — this commit is the canonical helper + coverage.

* B17-STORAGE-P2-16 — Document the previously "dead" Capabilities fields
  (ServerAccessLogs, MaxKeysPerAccount) explicitly as INFORMATIONAL ONLY.
  Both are populated by every backend impl (do-spaces 200, r2 0, s3 0,
  minio 0) but consumed by no routing code today. The doc now spells out
  why they exist (operator audits + future credential-pool / cap-alert
  hooks have one source of truth) and tells readers NOT to branch routing
  decisions on them. Avoids the next reviewer concluding they're dead and
  removing them, breaking forward-compat for consumers that started
  reading the fields after the abstraction shipped.

Coverage block per CLAUDE.md rule 17:
  Symptom:        path-traversal sanitiser missing URL-encoded / NUL /
                  Windows-separator shapes (B17-STORAGE-P2-14) + dead
                  Capabilities fields with no consumer (B17-STORAGE-P2-16)
  Enumeration:    `grep -rn sanitisePresignKey api/` (1 site, kept) +
                  `grep -rn 'ServerAccessLogs\\|MaxKeysPerAccount'`
                  (5 sites: provider.go + 4 backend impls; doc-only
                  change, no behavior delta)
  Sites found:    2 sanitisers + 5 Capabilities field references
  Sites touched:  1 new canonical sanitiser in common (api-side migration
                  deferred — sanitise.go is the canonical surface; api's
                  legacy sanitisePresignKey is documented in
                  api/internal/handlers/storage_presign.go and will be
                  swapped in a follow-up slice) + provider.go godoc
  Coverage test:  TestSanitiseTenantKey_DefenseInDepth (25 cases) +
                  TestSanitiseTenantKey_NoLeadingSlash +
                  TestSanitiseTenantKey_NoTraversalComponentSurvives +
                  TestSanitiseTenantKey_StripsNUL
  Gates green:    go build ./... clean / go vet ./... clean /
                  go test ./... -count=1 PASS (all 12 packages green;
                  ok instant.dev/common/storageprovider 4.398s)
  Live verified:  Library change — api/worker/provisioner pick it up on
                  their next CI run (they depend on instant.dev/common via
                  go.mod replace or version bump).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(plans): B6-P3 — growth.deployments_apps 5 → 50

Pro's deployments_apps = 10; the previous Growth value of 5 placed
Growth ($99/mo) BELOW Pro ($49/mo) on a customer-facing dimension.
Bumped to 50 — preserves tier-ladder ordering above Pro while staying
short of Team's unlimited (-1).

Kept synchronised with api/plans.yaml (the api repo's wave-3 consolidated
commit also flips the value); the api's tier-ladder invariants pinning
test loads api/plans.yaml directly, so this commit only affects the
embedded defaultYAML fallback used in package-default tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* security(readiness): redact secrets in scrub() before truncation

Wave-3 audit P1, 2026-05-21.

scrub() in common/readiness/checks.go truncated upstream errors to 80
chars but did NOT redact credential fragments. A real-world pq error
like 'password authentication failed for user "instant" password=...'
would surface verbatim via the publicly-reachable /readyz endpoint on
api/worker/provisioner.

Affects two callsites: PingDB, PingRedis. HTTPHeadCheck + GRPCHealth
already used scrubNetError which maps to a fixed enum.

Fix:
  - Redact BEFORE truncate. Truncate-first leaks credentials that land
    in the first 80 chars of the upstream message.
  - Package-level regexp registry covers: pq password=/passwd=/pwd=
    kv pairs, URL-embedded credentials (scheme://user:pass@host), pq
    'for user "..."' username leak (semi-sensitive), Authorization:
    Bearer/Basic, known secret-shape prefixes (xkeysib-, sk-, rzp_),
    catch-all 32+ hex.

Tests (CLAUDE.md rule 18 — registry-iterating, not hand-typed):
  - TestScrub_RedactsDBPassword, _URLCredentials, _Bearer, _HexSecrets,
    _KnownPrefixes — per-pattern unit assertions
  - TestScrub_RedactsBeforeTruncating — pins the load-bearing
    redact-before-truncate invariant
  - TestScrub_RegistryWalk — 15-row registry walks every shape; a new
    secretPatterns entry without a registry row trips review
  - TestPingRedis_RedactsCredentialsEndToEnd — exercises the public
    callsite end-to-end via fakePinger
  - TestScrub_TruncatesAfterRedaction / _TrimsWhitespace /
    _PreservesNonSecretShape — defensive regression coverage

Coverage block:
  Symptom:       /readyz last_error leaked DB/URL/Bearer creds
  Enumeration:   rg -F 'scrub(' common/readiness
  Sites found:   2 (PingDB, PingRedis)
  Sites touched: 2 — fix is in scrub() itself; both callers inherit
  Coverage test: TestScrub_RegistryWalk + TestPingRedis_RedactsCredentialsEndToEnd
  Live verified: /readyz JSON shape — last_error empty in healthy
                 state on api/worker/provisioner; degraded paths will
                 now redact

ExportForTest pattern keeps the scrub() helper unexported in
production binaries while letting external _test packages assert
on the raw output directly.

Gate: cd common && go build ./... && go vet ./... && go test
./readiness/... -count=1 -race  ALL GREEN (24 tests inc. 15
registry rows).

Pre-existing plans/TestDeploymentsAppsLimit_Tiers failure is from
cc97d4f (growth 5→50) and out of scope for this security fix.

* fix(bugbash 2026-05-21): NATS AccountSeed for post-restart revocation + test alignment (#14)

* fix(queueprovider/nats): A04-F3 — expose AccountSeed for post-restart revocation

Migration 060 added resources.queue_account_seed_encrypted to make NATS account
revocation survive a provisioner pod restart, but IssueTenantCredentials was
discarding the freshly-minted account seed (`_ = accountSeed`). Without the
seed reaching the api caller, the column was never populated and RevokeWith
Seed could never re-sign the account claim after a restart wiped the in-memory
accountCache.

This change:
  - Adds TenantCreds.AccountSeed (documented as a secret; NEVER log).
  - Populates AccountSeed in nats.IssueTenantCredentials.
  - Adds round-trip test proving RevokeWithSeed works without accountCache
    (simulates the post-restart path that migration 060 was built for).

Cross-repo: api + worker must (a) bump common, (b) AES-256-GCM-encrypt
AccountSeed via the existing keyring and persist to queue_account_seed_
encrypted, (c) decrypt + pass to RevokeWithSeed on teardown. Tracked
separately. Forward-compatible: AccountSeed is only populated on isolated
provisions, so legacy_open prod is unaffected.

Coverage block (rule 17):
  Symptom:       queue_account_seed_encrypted always NULL; revocation no-ops post-restart
  Enumeration:   rg -n 'AccountSeed|queue_account_seed_encrypted' common/
  Sites found:   3 (TenantCreds field, IssueTenantCredentials return, RevokeWithSeed param)
  Sites touched: all 3 (RevokeWithSeed already accepted seed; populating it now activates the path)
  Coverage test: TestNATS_IssueExposesAccountSeed_AndRevokeWithSeed_RoundTrips

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): growth tier DeploymentsAppsLimit asserts 50 (wave-3 BugBash value)

Wave-3 BugBash bumped growth tier deployments_apps from 5 → 50 in plans.yaml; test
was not updated. Test fix only — plans.yaml + common/plans/plans.go defaultYAML
are the authoritative source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: Tier 1 OSS security scanners

Adds GitHub-native + free OSS vulnerability scanners. 100% free for
public repos.

- CodeQL with security-extended query suite
- Dependabot for gomod + github-actions
- govulncheck (Go reachability-filtered CVE scan)
- OSV-Scanner (cross-ecosystem CVE scan)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: scanner workflows clone sibling proto repo

The Tier 1 CodeQL + govulncheck workflows failed on PR #16 because
common uses `replace instant.dev/proto => ../proto` in go.mod.

Fix: each workflow now checks out common into ./common, plus clones
the public sibling repo InstaNode-dev/proto.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(go): bump toolchain to 1.25.10 — fixes reachable stdlib CVEs

govulncheck on PR #16 flagged Go-stdlib vulnerabilities reachable
from production code paths. All fixed in Go 1.25.9–1.25.10.

Also merges any in-flight master commits onto the scanner-install
branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant