tracking: prod-bug-resistance checklist (umbrella)

## Goal

Single tracking surface for everything we need before, and after, the first paying customer touches the system. Each item links out to its own implementation Issue. This Issue stays open until all sub-items are complete.

## Tier 1 — must-have, cheap, do this week

- [ ] **#31** — Prod branch as deploy target (\`main → prod\` flow)
- [ ] **#33** — Staging environment (Supabase + Vapi + Telnyx + Stripe in test mode)
- [ ] **PR template with smoke checklist** for every \`main → prod\` PR
  - Place test call against staging
  - Order 2 items, verify cart at finalize
  - Decline path (\"no\" to SMS consent) — verify no SMS, order cancelled
  - Happy path — verify SMS arrives with Stripe URL
  - Pay test card (\`4242…\`) — verify webhook fires, kitchen sees order
  - Kitchen advance flow — verify status SMS to customer
  - File as separate Issue when ready
- [ ] **Branch protection on \`prod\`** — required CI checks, no force-push, no direct commits, linear history. File when #31 lands.
- [ ] **Migration deploy workflow** (open follow-up #5) — folded into #33 sub-task H
- [ ] **Documented rollback runbook** — exact commands per environment. Update [developer/m8-runbook.md](developer/m8-runbook.md). File as Issue.
- [ ] **API contract fixtures + tests** — Vapi, Stripe, Telnyx response schemas validated by Vitest. Catches \"vendor changed validator\" the moment it happens. File as Issue.

## Tier 2 — high-value, ~half-day each

- [ ] **Synthetic prod smoke test cron** — Vapi outbound test call every 15 min against prod, verifies SMS+payment loop. Page on 3 consecutive failures. File as Issue.
- [ ] **Error alerting** — Edge Function 5xx → Slack/email webhook. Today errors are invisible until a customer complains. File as Issue.
- [ ] **Real-time SLO dashboard** — call success rate, SMS delivery rate, payment completion rate. Could start with a SQL query + cron, evolve into Grafana later. File as Issue.

## Tier 3 — medium-value, do after first paying customer

- [ ] **Feature flags per-restaurant** — \`experimental_*\` columns on \`restaurants\`. Roll out new behavior on Sui's first, watch a week, then enable for others. File as Issue.
- [ ] **E2E test suite against staging** — Playwright/Vapi outbound API. Codifies the smoke checklist into automation. File as Issue.
- [ ] **Postmortem template** — every prod incident gets a 1-pager. Add to [developer/](developer/). File as Issue.

## Tier 4 — aspirational

- [ ] **Hotfix flow documented** — \`hotfix/*\` branches off \`prod\`, fixed, merged, cherry-picked back to \`main\`. File as Issue.
- [ ] **Canary / progressive rollout** — for high-risk features, route subset of restaurants first. Probably overkill until 10+ tenants.
- [ ] **Multi-region failover** — out of scope for v1.

## Anti-patterns we already learned the hard way (don't repeat)

- *\"Looks right per docs, ship it.\"* Vapi/Stripe/Telnyx APIs evolve. Speculative changes need a real-call test against staging before prod. (PRs #28→#30, May 4–5 2026.)
- *Same person reviewing their own changes* — Greptile is the substitute reviewer. Don't merge PRs Greptile flagged P1 without addressing or explicitly accepting risk in the reply.
- *Deploys triggered by green CI alone* — CI catches code regressions, not config or contract regressions. Both need a real call against staging to surface.
- *Mixing infra changes with feature changes in one PR* — \`serverMessages\` snuck into a feature PR's scope and broke calls. Infra deserves its own PRs.

## How to use this Issue

- Each unchecked sub-item should have its own Issue when ready to work on.
- Update this Issue's checkboxes as sub-Issues land.
- When all Tier 1 + Tier 2 are checked, this Issue gets closed.

## Definition of done for this umbrella

All Tier 1 and Tier 2 items closed AND verified on staging AND a production smoke checklist has been run end-to-end with no regressions for two consecutive releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: prod-bug-resistance checklist (umbrella) #34

Goal

Tier 1 — must-have, cheap, do this week

Tier 2 — high-value, ~half-day each

Tier 3 — medium-value, do after first paying customer

Tier 4 — aspirational

Anti-patterns we already learned the hard way (don't repeat)

How to use this Issue

Definition of done for this umbrella

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tracking: prod-bug-resistance checklist (umbrella) #34

Description

Goal

Tier 1 — must-have, cheap, do this week

Tier 2 — high-value, ~half-day each

Tier 3 — medium-value, do after first paying customer

Tier 4 — aspirational

Anti-patterns we already learned the hard way (don't repeat)

How to use this Issue

Definition of done for this umbrella

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions