Skip to content

Latest commit

 

History

History
11 lines (9 loc) · 3.64 KB

1-incident-detection.md

File metadata and controls

11 lines (9 loc) · 3.64 KB

Incident Detection

Language Used Behavior Displayed
Novice
  1. "Other teams (QA, customer support) will notify us of any problems."
  2. "Problems with our service are obvious; outages are obvious to everyone."
  1. The service team is notified of incidents via manual, external notification mechanisms (ticketing system, phone calls, etc.)
  2. No baseline metrics established; description of service level is bucketed into four broad categories ("available", "unavailable", "degraded", "answering, but unavailable")1
Beginner
  1. Most of the time, we’re the first to know when a service has transitioned from available to unavailable (or another state).
  2. We’re the first to know when a service is impacted.
  1. External monitoring is in place to detect in real time when a service transitions between one of the four broad buckets
  2. The team is notified in an automated way when monitoring detects a transition between these four buckets
Competent
  1. "We've detected a number of service level transitions via the monitoring of very new (and maybe very old) API endpoints; in all cases, MTTD was reduced."
  2. We use historical data to perform manual, 'first approximation' guesses of service level changes; we're starting to communicate this information outward, potentially in ongoing discussions about SLAs."
  1. Historical data has been collected to establish broad baselines of acceptable service, enough to infer bands within the four buckets
  2. External monitoring of infrastructure, API endpoints, and other outward-facing interfaces exists and is recorded in the (historical) monitoring system
Proficient
  1. "Other teams can help us monitor our own service because we've provided hooks for them to integrate within their own system."
  2. "We prioritize feature requests and bug reports to these monitoring hooks within our development sprints and in our organizational support work; monitoring is a first-class citizen for our team, and takes precedent over the deployment of new features."
  3. I know that specific code/infrastructure change caused this specific change in service level; here's how I know..."
  1. Baseline data is comprehensive enough to be able to be statistically correlated to current code state and map to code changes
  2. Application internals report monitoring data to the monitoring system
  3. Monitoring systems employ a deep use of statistical significance to provide proof (and disproof) of service anomalies
Advanced
  1. "We've decoupled the deployment of code and/or infrastructure changes, because we can roll those changes back or forward, as necessary, to automatically remediate the issue before any service level impact becomes notable."
  2. Our team isn't being paged anymore for changes that automation can react to; our number of incidents that on-call engineers have to respond to is measurably down."
  1. Monitoring output is reincorporated into operational behavior in an automated fashion
  2. Anomalies do not result in defined “incidents,” as operational systems can automatically react to statistically significant changes in metrics

1 The distinction between “unavailable” and “answering” is in the former, the service does not respond to requests at all; in the latter, the service responds, but does not provide the requested functionality, i.e. returning HTTP 5xx response codes