Ops Issue: [Link to ops ticket]
Authors: [Name of CoE authors]
Pages / escalations before first accept: [Number of team members the escalation went through before somebody accepted]
Time to first response: [How long from first page until a team member accepted and responded]
Number of team participants:
Provide a high level description of what happened.
Did we find out about this from an existing alarm? Did we find out from user complaints? did we find out from some other method?
What behavior did we see? Who was impacted? What services were impacted? What changes did we see on our dashboards/graphs?
What process was used? What tools were used? What did you find in each tool? What were your hunches or assumptions? What did you rule out? Was there an existing runbook addressing confirmation and mitigation steps?
Incident timeline, from the time an issue started, through customer impact, ending with incident resolution. Which tasks had a positive impact on the outcome? Which tasks had a negative impact on the outcome? Which tasks had no impact on restoring service?
Start with asking why the failure happened. Keep asking why until you get to the root cause. There can be more than 5 questions, but consider 5 the minimum. It's okay to follow question branches.
Q.
A.
Q.
A.
Q.
A.
Q.
A.
Q.
A.
Root Cause:
How was the issue get fixed or the customer impact mitigated, and what resources or teams were required to do so?
Did we already know about this potential failure mode? Did we have scheduled work? Were there backlog items that would have prevented this issue?
How can we do better (in alarms, process, automation, response, etc)? What could we have done to prevent this issue from occurring? How do we make sure this never happens again? What can we do to improve how the incident was handled? Think big, think outside the box.
As a thought exercise, how could the blast radius for a similar event be cut in half?
What follow-up actions are we taking? What scheduled work was created?