Skip to content
This repository has been archived by the owner on Jun 23, 2021. It is now read-only.

Serverless Operations

Lu Hong edited this page Oct 11, 2019 · 35 revisions

Operations plays a vital role in a production service. In this project, we have captured how we, at AWS, implement operational best practices, such as setting up alarms, dashboards, and CI/CD pipelines. When you deploy the project following the Quickstart guide, the operations component will be deployed automatically in your AWS account. We will leverage Amazon Cloudwatch for alarms and dashboards. You can follow the steps laid out in this walkthrough to learn more.

Alarms

To view the alarms setup by the project deployment:

  1. Login to your AWS account that you deployed the project to. If you have not deployed the project, check Quick Start.
  2. Go to CloudWatch console and click "Alarms".
  3. Search for "realworld-serverless-application-ops-Alarm" and you will see four alarms: Api4xxErrors, ApiAvailability, ApiLatencyP90 and ApiLatencyP50. The alarms monitor for the most critical operational problems: error rate, availability and latency.
  4. Click on one of the alarms and you will see the detail page for each alarm.
    1. Api4xxErrors: it alarms when 4xx error rate is above 0.3 for 5 mins. 4xx is treated as expected during normal application running while a large percentage of 4xx can also denote a bad version deployment.
    2. ApiAvailability: it alarms when 5xx error rate is above 0.1 for 5 mins. 5xx is not expected and indicates server error.
    3. ApiLatencyP90: it alarms when latency P90 is above 2000 for 5 mins. It indicates a latency spike and notifies you when customers are experience outstanding high latency.
    4. ApiLatencyP50: it alarms when latency P50 is above 200 for 5 mins. It monitors the average latency for the most slowest 50 percent of requests and indicates the general experience of customers.

Here is an example of alarms:

Dashboard

If the alarms go into "ALARM" state, messages will be sent to an Amazon SNS topic, called "AlarmsTopic" and prefixed with "realworld-serverless-application-ops-Alarm". To receive notifications from the topic, you can subscribe to the topic:

  1. Go to Amazon SNS console, click "Topics" and choose the topic "realworld-serverless-application-ops-Alarm-xxx-AlarmsTopic".
  2. On the detail page, click "Create subscription" and use the desired protocol and endpoint. For example, if you want to receive email notification, you can choose "Email" under "Protocol" and put your email address under "Endpoint".

Here is an example email notification:

Dashboard

Dashboard

  1. Go to CloudWatch console and click "Dashboards".
  2. Click on the name "Dashboard-xxx" and you will see the dashboard

The dashboard is composed of three parts: API Gateway metrics, API Lambda metrics and CloudWatch Insights queries.

Here is an example dashboard:

Dashboard

API Gateway metrics

API Gateway metrics provide a view of API usage and health. Metrics include 5XX error count and availability, request count, 4XX error count, and latency.

API Lambda metrics

API Lambda metrics measure usage and health of the lambda function. Metrics include error count and success rate, invocations, and latency.

CloudWatch Insights queries

CloudWatch Insights provides insights to API performance and how customers are impacted by performance issues. Queries include "Top 10 customers by Request Count", "Top 10 Customers Impacted by API 5xx", "Top 10 API 5xx Errors", "Top 10 API 4xx Errors" and "Top 10 API Latency Requests".

"Top 10 customers by Request Count" and "Top 10 Customers Impacted by API 5xx" respectively show customers who are most actively using the service and who are most impacted by service issues.

"Top 10 API 5xx Errors", "Top 10 API 4xx Errors" provides the error messages or error types.

"Top 10 API Latency Requests" allows you to understand which types of requests are experiencing highest latencies.