Skip to content

tracking: prod-bug-resistance checklist (umbrella) #34

@ByteStreams-AI

Description

@ByteStreams-AI

Goal

Single tracking surface for everything we need before, and after, the first paying customer touches the system. Each item links out to its own implementation Issue. This Issue stays open until all sub-items are complete.

Tier 1 — must-have, cheap, do this week

Tier 2 — high-value, ~half-day each

  • Synthetic prod smoke test cron — Vapi outbound test call every 15 min against prod, verifies SMS+payment loop. Page on 3 consecutive failures. File as Issue.
  • Error alerting — Edge Function 5xx → Slack/email webhook. Today errors are invisible until a customer complains. File as Issue.
  • Real-time SLO dashboard — call success rate, SMS delivery rate, payment completion rate. Could start with a SQL query + cron, evolve into Grafana later. File as Issue.

Tier 3 — medium-value, do after first paying customer

  • Feature flags per-restaurant — `experimental_*` columns on `restaurants`. Roll out new behavior on Sui's first, watch a week, then enable for others. File as Issue.
  • E2E test suite against staging — Playwright/Vapi outbound API. Codifies the smoke checklist into automation. File as Issue.
  • Postmortem template — every prod incident gets a 1-pager. Add to developer/. File as Issue.

Tier 4 — aspirational

  • Hotfix flow documented — `hotfix/*` branches off `prod`, fixed, merged, cherry-picked back to `main`. File as Issue.
  • Canary / progressive rollout — for high-risk features, route subset of restaurants first. Probably overkill until 10+ tenants.
  • Multi-region failover — out of scope for v1.

Anti-patterns we already learned the hard way (don't repeat)

  • "Looks right per docs, ship it." Vapi/Stripe/Telnyx APIs evolve. Speculative changes need a real-call test against staging before prod. (PRs fix(vapi): subscribe to end-of-call-report so SMS dispatch can fire #28fix(vapi): drop serverMessages — phone-number-level subscription #30, May 4–5 2026.)
  • Same person reviewing their own changes — Greptile is the substitute reviewer. Don't merge PRs Greptile flagged P1 without addressing or explicitly accepting risk in the reply.
  • Deploys triggered by green CI alone — CI catches code regressions, not config or contract regressions. Both need a real call against staging to surface.
  • Mixing infra changes with feature changes in one PR — `serverMessages` snuck into a feature PR's scope and broke calls. Infra deserves its own PRs.

How to use this Issue

  • Each unchecked sub-item should have its own Issue when ready to work on.
  • Update this Issue's checkboxes as sub-Issues land.
  • When all Tier 1 + Tier 2 are checked, this Issue gets closed.

Definition of done for this umbrella

All Tier 1 and Tier 2 items closed AND verified on staging AND a production smoke checklist has been run end-to-end with no regressions for two consecutive releases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions