What is HumanOps?
HumanOps is a set of principles which focus on the human aspects of running infrastructure.
It deliberately highlights the importance of the teams running systems, not just the systems themselves.
The health of your infrastructure is not just about hardware, software, automations and uptime - it also includes the health and wellbeing of your team.
The goal of HumanOps is to improve and maintain the good health of your team: easing communication, reducing fatigue and reducing stress.
- Humans are part of the system. Well architected, modern infrastructure is as automated and self-managing as possible, but it still requires human operation for maintenance, upgrades and incident response. As such, the human operations should be considered a part of the system just like any other so that proper consideration can be given to their role.
- Humans impact systems. The interaction between human operators and the system goes both ways - the actions of the system can impact the wellbeing of the human operators just as the actions of the operators can impact the reliable operations of the system. As such, equal importance should be placed on the wellbeing of the team as well as the system they are managing.
- Humans impact business. Just as there is a direct business impact from downtime caused by unreliable systems, there is a direct business impact from an unhealthy team working on those systems.
- Human issues count as system issues. Systems issues e.g. error rates, uptime SLA % and other technical points tend to get more focus when prioritising issues. However, although there may be less of a direct link, the poor health of the operations team can have just as much impact on the system. As such, the impact of issues on the operations team e.g. alerts out of hours, frequency of manual interventions should receive as equal weight as systems issues.
- Escalate to humans as a last resort. Dealing with incidents rapidly and consistently without error is best suited to machines. As such, the system should take actions where it can to resolve problems by itself, and reserve involving the human operations for the edge case as “escalations”, rather than deferring to the human operations team by default.
The metrics of HumanOps
How do we know if we are being successful in applying HumanOps principles? Many of the benefits will be qualitative e.g. better work/life balance, but others can be more quantitative:
- Better sleep (achieving sleep length targets, zero interruptions/wakeups)
- "Reasonable" working hours (not needing to work late, not getting called at weekends)
- Reduced stress (blood pressure, reduced anxiety)
- Reduction in outages/incidents with root cause analysis or elements of human error
- Reduction in time to resolution (due to improved response process & incident handling)
- Reduction in alerts/incidents requiring human response