Before going live: - [ ] Set up on-call schedule (PagerDuty / Opsgenie / plain phone rotation) - [ ] Define SEV1/SEV2/SEV3 severity levels and response SLAs - [ ] Write incident runbook: how to roll back a bad deploy, restart worker, flush Redis queue - [ ] Add runbook link to status page and internal wiki - [ ] Test an alert end-to-end (fire a fake alert, confirm the right person is paged) **Owner:** Eng lead
Before going live:
Owner: Eng lead