Skip to content

Latest commit

 

History

History
66 lines (42 loc) · 2.58 KB

CoE.md

File metadata and controls

66 lines (42 loc) · 2.58 KB

Incident Name

Ops Issue: [Link to ops ticket]
Authors: [Name of CoE authors]
Pages / escalations before first accept: [Number of team members the escalation went through before somebody accepted]
Time to first response: [How long from first page until a team member accepted and responded]
Number of team participants:

Incident Description

Provide a high level description of what happened.

How was the incident detected?

Did we find out about this from an existing alarm? Did we find out from user complaints? did we find out from some other method?

What were the symptoms / impact?

What behavior did we see? Who was impacted? What services were impacted? What changes did we see on our dashboards/graphs?

What discovery or investigation was done?

What process was used? What tools were used? What did you find in each tool? What were your hunches or assumptions? What did you rule out? Was there an existing runbook addressing confirmation and mitigation steps?

Timeline

Incident timeline, from the time an issue started, through customer impact, ending with incident resolution. Which tasks had a positive impact on the outcome? Which tasks had a negative impact on the outcome? Which tasks had no impact on restoring service?

5-whys and Root Cause

Start with asking why the failure happened. Keep asking why until you get to the root cause. There can be more than 5 questions, but consider 5 the minimum. It's okay to follow question branches.

Q.

A.

Q.

A.

Q.

A.

Q.

A.

Q.

A.

Root Cause:

How was the issue resolved and the impact mitigated?

How was the issue get fixed or the customer impact mitigated, and what resources or teams were required to do so?

Were there existing backlog items for this issue? Was this a known failure mode?

Did we already know about this potential failure mode? Did we have scheduled work? Were there backlog items that would have prevented this issue?

Overall learnings and recommendations

How can we do better (in alarms, process, automation, response, etc)? What could we have done to prevent this issue from occurring? How do we make sure this never happens again? What can we do to improve how the incident was handled? Think big, think outside the box.
As a thought exercise, how could the blast radius for a similar event be cut in half?

What went well?

What went wrong?

Where did we get lucky?

Recommendations

What are the actionable tasks and follow-ups?

What follow-up actions are we taking? What scheduled work was created?