Skip to content

Commit

Permalink
Typo (availabillity ==> availability)
Browse files Browse the repository at this point in the history
  • Loading branch information
jaysonmc authored and matthewskelton committed Aug 3, 2021
1 parent c81141f commit 4ee0f72
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions reliability.md
Expand Up @@ -25,8 +25,8 @@ Method: Use the [*Spotify Squad Health Check*](https://labs.spotify.com/2014/0
| 2\. **User Goals and SLIs** - What should your service/application do from the viewpoint of the user? | We do not have a clear definition of what our application or service does from the user perspective. | We have clear, user-centric definitions of the application/service capabilities and outcomes from a user perspective. |
| 3\. **Understanding users and behavior** - Who are the users of the software and how do they interact with the software? How do you know? | We don't really know how our users interact with our application/service --OR-- We don't really know who our users are. | We have **user personas** validated through user research and we measure and track usage of the applications/services using digital **telemetry**. |
| 4\. **SLIs/SLOs** - How do you **know when users have experienced an outage** or unexpected behaviour in the software? | We know there is an outage or problem when users complain via chat or the help desk. | We proactively monitor the user experience using synthetic transactions across the key user journeys. |
| 5\. **Service Health** - What is the single most important **indicator or metric** you use to determine the **health and availability of your software** in production/live? | We don't have a single key metric for the health and availabillity of the application/service. | We have a clear, agreed key metric for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes. |
| 6\. **SLIs** - What combination of three or four **indicators or metrics** do you use (or could/would you use) to provide a **comprehensive picture of the health and availability** of your software in production/live? | We don't have a set of key metrics for the health and availabillity of the application/service. | We have a clear, agreed set of key metrics for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes. |
| 5\. **Service Health** - What is the single most important **indicator or metric** you use to determine the **health and availability of your software** in production/live? | We don't have a single key metric for the health and availability of the application/service. | We have a clear, agreed key metric for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes. |
| 6\. **SLIs** - What combination of three or four **indicators or metrics** do you use (or could/would you use) to provide a **comprehensive picture of the health and availability** of your software in production/live? | We don't have a set of key metrics for the health and availability of the application/service. | We have a clear, agreed set of key metrics for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes. |
| 7\. **Error Budget and similar mechanisms** - How does the team know when to **spend time on operational aspects** of the software (logging, metrics, performance, reliability, security, etc.)? Does that time actually get spent? | We spend time on operational aspects only when there is a problem that needs fixing. | We allocate between 20% and 30% of our time for working on operational aspects and we check this each week. We alert if we have not spent time on operational aspects --OR-- We use SRE Error Budgets to plan our time spent on operational aspects. |
| 8\. **Alerting** - What proportion (approximately) of your time and effort as a team do you spend on **making alerts and operational messages more reliable and more relevant**? | We spend as little time as possible on alerts and operational messages - we need to focus on user-visible features. | We regularly spend time reviewing and improving alerts and operational messages. |
| 9\. **Toil and fixing problems** - What proportion (approx) of your time gets taken up with incidents from live systems and how predictable is the time needed to fix problems? | We do not deal with issues from live systems at all - we focus on new features --OR-- live issues can really affect our delivery cadence and are very disruptive. | We allocate a consistent amount of time for dealing with live issues --OR-- one team member is responsible for triage of live issues each week OR we rarely have problems with live issues because the software works well. |
Expand Down

0 comments on commit 4ee0f72

Please sign in to comment.