Skip to content
This repository has been archived by the owner on Jun 23, 2021. It is now read-only.

Operations (DevOps)

James Hood edited this page Oct 14, 2019 · 5 revisions

This application uses Amazon CloudWatch for managing operations (DevOps). This page includes patterns and best practices for working with CloudWatch that are used in this application.

Setup operations infrastructure in a separate stack

Operations infrastructure like alarms and dashboards are setup in a separate stack from the primary backend service infrastructure. This keeps deployment of operations infrastructure decoupled from deployment of the actual backend service. This is useful because generally you will want to deploy changes to alarms and dashboards quickly compared to deployment of the actual production service. It also helps you avoid hitting CloudFormation stack resource limits since a complex service can have many alarms.

In this application, the ops stack is broken up into 2 nested stacks: one for alarms and one for dashboards. However, note in a more complex service, you may choose to organize alarms into multiple nested stacks, e.g., alarms for synchronous API request processing vs alarms for async stream processing.

Examples in this project:

  1. Operations infrastructure is created as part of a separate ops stack, which contains nested stacks for alarms and dashboards.

Alarming on API metrics vs Lambda metrics

For Lambda functions that handle synchronous API requests, it's better to add alarming on error and latency metrics produced by Amazon API Gateway rather than metrics produced by AWS Lambda. Amazon API Gateway metrics are more granular, providing per-operation metrics as well as splitting error metrics into 4xx (client errors) and 5xx (service errors), which is useful for differentiating when a service bug caused an issue vs a normal client error such as an input validation error. The latency metrics provided by Amazon API Gateway are also closer to what the client is actually experiencing since it includes the overhead of Amazon API Gateway processing the request in addition to Lambda.

For asynchronous Lambda functions, e.g., those triggered by an S3 bucket write or an SQS queue, assuming you're following the best practice of configuring a DLQ on all async processing Lambda functions, you should alarm on DLQ message counts (SQS ApproximateNumberOfMessagesVisible metric) rather than Lambda function errors. This is because the Lambda service will automatically retry failed async Lambda requests. If you configure your alarm on Lambda function errors, it could give you false alarms when functions encounter transient errors but then succeed on retry.

Examples in this project:

  1. Configuring API alarms against Amazon API Gateway metrics rather than Lambda metrics.

Configure dashboards via the AWS CloudWatch console

CloudWatch dashboards are a powerful way to get a quick view of the health of your system. They should be managed by AWS CloudFormation so they are deployed along with the rest of your application. However, writing a dashboard configuration in CloudFormation by hand is extremely painful. A nice shortcut is to use the AWS CloudWatch console to manually configure your dashboard. Then you can click on Actions -> View/edit source. This will give you the JSON definition of the dashboard, which you can copy/paste into your CloudFormation template. You should then use Fn::Sub to replace any hardcoded values within the dashboard definition with template references to ensure the template is portable.

Examples in this project:

  1. Dashboard definition was originally pasted from the CloudWatch console, but note references like AWS region and Lambda function names have been replaced with CloudFormation references.

Add CloudWatch Insights queries on API access logs to dashboards

CloudWatch Insights combined with Amazon API Gateway access logging are a powerful combination for quickly understanding the requests coming into your service. They can be used to quickly query top users for a given time period or in an outage scenario, top users impacted by the outage. The dashboard also links you to the CloudWatch insights console, allowing you to modify the query and run it ad hoc which is useful during outage situations.

Examples in this project:

  1. The dashboard defined in the ops stack includes several CloudWatch Insights queries that are useful in real world applications, e.g., Top 10 Customers Impacted by API 5xx errors.