Add critera for Poor and Excellent foor each question

ConfluxHQ · Aug 7, 2018 · bf9db66 · bf9db66
1 parent b0aee3b
commit bf9db66
Showing 1 changed file with 78 additions and 0 deletions.
diff --git a/operability-questions.md b/operability-questions.md
@@ -11,6 +11,8 @@ Each question requires answers to these key questions:
 * **Evidence**:  How will you demonstrate this using evidence?
 * **Score**: a score of 1 to 5 for how well your approach compares to industry-leading approaches (1 is poor; 5 is excellent)
 
+Use the _Poor_ and _Excellent_ criteria to help guide your scores. These are examples of bad and good extremes; your situation may demand slightly different critera.
+
 > Print this page and record questions using pen & paper with your team
 
 Copyright © 2018 [Conflux Digital Ltd](https://confluxdigital.net/)
@@ -45,6 +47,9 @@ _4_
 
 > We need a clear understanding of the people and teams that can help to make the software systems work well. 
 
+* Poor: _We fix any operator problems after go-live_
+* Excellent: _We collaborate with the live service / operations teams from the start of the project_
+
 ### Who?
 
 ### How?
@@ -57,6 +62,9 @@ _4_
 
 > We should have a clear approach for meeting the needs of operations people. 
 
+* Poor: _We respond to operational requests after go-live when tickets are raised by the live service teams_
+* Excellent: _We collaborate on operational aspects from the very first week of the engagement/project_
+
 ### Who?
 
 ### How?
@@ -71,6 +79,9 @@ _4_
 
 > We should be spending a good proportion of time and effort on addressing operational aspects.
 
+* Poor: _We try to spend as little time and effort as possible on operational aspects / We do not track the spend on operational aspects at all_
+* Excellent: _We spend 30% of our time and budget addressing operational aspects_
+
 ### Who?
 
 ### How?
@@ -83,6 +94,9 @@ _4_
 
 > We should be addressing operational aspects on a regular, frequent basis throughout the project, not occasionally or just at the end of the project.
 
+* Poor: _We do not track operational aspects / It can take months to address operational aspects_
+* Excellent: _We deploy changes to address operational aspects as frequently as we deploy changes to address user-visible features_
+
 ### Who?
 
 ### How?
@@ -97,6 +111,9 @@ _4_
 
 > We need clarity about the number and nature of feature toggles for a system. Feature toggle need careful management. 
 
+* Poor: _We need to run diffs against config files to determine which feature toggles are arctive_
+* Excellent: _We have a simple UI or API to report the active/inactive feature flags in an environment_
+
 ### Who?
 
 ### How?
@@ -109,6 +126,9 @@ _4_
 
 > We need to be able to change the configuration of software in an environment without redeploying the executable binaries or scripts.
 
+* Poor: _We cannot deploy a configuration change without deploying the software_
+* Excellent: _We simply run a config deployment separately from the software_
+
 ### Who?
 
 ### How?
@@ -121,6 +141,9 @@ _4_
 
 > We need to ensure that only valid, tested configuration data is being used and that the configuration schema itself is controlled. 
 
+* Poor: _We cannot verify the configuration in use_
+* Excellent: _We use `sha256sum` hashes to verify the configuration in use_
+
 ### Who?
 
 ### How?
@@ -135,6 +158,9 @@ _4_
 
 > We need to define simple ways to report health of the system in ways that are meaningful for that system.
 
+* Poor: _We wait for checks made manually by another team to tell us if our software is healthy_
+* Excellent: _We query the software using a standard HTTP healthcheck URL, returning HTTP 200/500, etc. based on logic that we write in the code_
+
 ### Who?
 
 ### How?
@@ -147,6 +173,9 @@ _4_
 
 > We need to define simple ways to report health of the system in ways that are meaningful for that system.
 
+* Poor: _We do not have service KPIs defined_
+* Excellent: _We use logging and/or time series metrics to emit service KPIs that are picked up by a dashboard_
+
 ### Who?
 
 ### How?
@@ -159,6 +188,9 @@ _4_
 
 > Logging is a key aspect of modern software systems and must be working correctly at all times.
 
+* Poor: _We do not test if logging is working_
+* Excellent: _We test that logging is working using BDD feature tests that search for specific log message strings after a particular application behaviour is executed_
+
 ### Who?
 
 ### How?
@@ -171,6 +203,9 @@ _4_
 
 > Time series metrics are a key aspect of modern software systems and must be working correctly at all times.
 
+* Poor: _We do not test if time series metrics are working_
+* Excellent: _We test that time series metrics are working using BDD feature tests that search for specific time series data after a particular application behaviour is executed_
+
 ### Who?
 
 ### How?
@@ -183,6 +218,9 @@ _4_
 
 > Keeping software testable is a key aspect of operability. 
 
+* Poor: _We do not explicitly aim to make out software easily testable_
+* Excellent: _We run clients and external test packs against all parts of our software within our deployment pipeline_
+
 ### Who?
 
 ### How?
@@ -197,6 +235,9 @@ _4_
 
 > We need to have clarity about certificate renewal so we avoid systems breaking due to expired certificates.
 
+* Poor: _We do not know when our certificates are going to expire_
+* Excellent: _We use certificate monitoring tools to keep a live check on when certs will expire so we can take remedial action ahead of time_
+
 ### Who?
 
 ### How?
@@ -209,6 +250,9 @@ _4_
 
 > We need to have clarity about certificate renewal so we avoid systems breaking due to expired certificates.
 
+* Poor: _Another team renews and installs SSL/TLS certificates manually_
+* Excellent: _We use automated processes to renew and configure SSL/TLS certificates using *Let's Encrypt*_
+
 ### Who?
 
 ### How?
@@ -221,6 +265,9 @@ _4_
 
 > We need to have clarity about certificate renewal so we avoid systems breaking due to expired certificates.
 
+* Poor: _Another team renews and installs certificates manually_
+* Excellent: _We use automated processes to renew and configure certificates using an API_
+
 ### Who?
 
 ### How?
@@ -233,6 +280,9 @@ _4_
 
 > We need to encrypt data in transit to prevent eavesdropping.
 
+* Poor: _We do not explicitly test for transport security; we assume that another team will configure security for us_
+* Excellent: _We test for secure transport as a specific feature of our application_
+
 ### Who?
 
 ### How?
@@ -245,6 +295,9 @@ _4_
 
 > We need to mask or hide sensitive data in logs whilst still exposing the surrounding data to teams.
 
+* Poor: _We do not test for data masking in logs_
+* Excellent: _We test that data masking is happening by using BDD feature tests that search for specific log message strings after a particular application behaviour is executed_
+
 ### Who?
 
 ### How?
@@ -257,6 +310,9 @@ _4_
 
 > We need to apply patches to public-facing systems as quickly as possible (but still safely) when a Zero-Day vulnerability is disclosed.
 
+* Poor: _Another team is responsible for patching / We do not know if or when a Zero-Day vulnerability occurs_
+* Excellent: _We work with the security team to test and roll out a fix, using our automated deployment pipeline to test and deploy the change_
+
 ### Who?
 
 ### How?
@@ -271,6 +327,9 @@ _4_
 
 > We need to demonstrate that the software can perform well.
 
+* Poor: _We rely on the Performance team to validate the performance of our service or application_
+* Excellent: _We run a set of indicative performance tests within our deployment pipeline that are run on every check-in to version control_
+
 ### Who?
 
 ### How?
@@ -285,6 +344,9 @@ _4_
 
 > We need to define and share a set of known failure modes or failure conditions so we better understand how the software will operate.
 
+* Poor: _We do not really know how the system might fail_
+* Excellent: _We use a set of error identifiers to define the failure modes in our software and we use these identifiers in our log messages_
+
 ### Who?
 
 ### How?
@@ -297,6 +359,9 @@ _4_
 
 > We need to demonstrate that the system does not overload downstream systems with reconnection attempts, and uses sensible back-off schemes.
 
+* Poor: _We do not really know whether connection retry works properly_
+* Excellent: _We test the connection retry logic as part of our automated deployment pipeline_
+
 ### Who?
 
 ### How?
@@ -311,6 +376,9 @@ _4_
 
 > We need to be able to trace a request across multiple servers/containers/nodes for runtime diagnostic purposes.
 
+* Poor: _We do not trace calls through the system_
+* Excellent: _We use a standard tracing library such as OpenTracing to trace calls through the system. We collaborate with othher teams to ensure that the correct tracing fields are maintained across component boundaries_
+
 ### Who?
 
 ### How?
@@ -323,6 +391,9 @@ _4_
 
 > We need to display key information about the live operation of the system to teams focused on operations.
 
+* Poor: _Operations teams tend to discover the status indicators themselves_
+* Excellent: _We build a dashboard in collaboration with the Operations teams so they have all the details they need in a user-friendly way with UX a key consideration_
+
 ### Who?
 
 ### How?
@@ -337,6 +408,10 @@ _4_
 
 > We need to demonstrate that the software can recover from internal failures gracefully.
 
+
+* Poor: _We do not really know whether the system can recover from internal failures_
+* Excellent: _We test many internal failure scenarios as part of our automated deployment pipeline_
+
 ### Who?
 
 ### How?
@@ -349,6 +424,9 @@ _4_
 
 > We need to demonstrate that the software can recover from external failures gracefully.
 
+* Poor: _We do not really know whether the system can recover from external failures_
+* Excellent: _We test many external failure scenarios as part of our automated deployment pipeline_
+
 ### Who?
 
 ### How?