doc: Add TTS telemetry and ECS monitoring

TheThingsIndustries · Nov 3, 2020 · 0f5643a · 0f5643a
1 parent 83d0f03
commit 0f5643a
Show file tree

Hide file tree

Showing 2 changed files with 156 additions and 0 deletions.
diff --git a/doc/content/getting-started/aws/ecs/monitoring/_index.md b/doc/content/getting-started/aws/ecs/monitoring/_index.md
@@ -0,0 +1,89 @@
+---
+title: "Monitoring"
+description: ""
+weight: 5
+---
+
+This page describes how you can monitor a {{% tts %}} deployment on AWS ECS.
+
+<!--more-->
+
+One of the [deployment steps]({{< ref "../deployment" >}}) was to deploy the "Monitoring" stack. This installed [Prometheus](https://prometheus.io/) into your cluster.
+
+The provided Prometheus image already comes with a number of recording and alerting rules that are useful for most (if not all) deployments.
+
+## Alerts
+
+For all ECS services that form your {{% tts %}} deployment, the following alerts are defined:
+
+- `NoIdentityServers` (The cluster does not have any Identity Server instances). This means that there are no ECS tasks for the Identity Server component.
+- `SomeIdentityServersDown` (Some Identity Server instances of The Things Enterprise Stack are down). This means that some of the Identity Server instances are not responding. In many cases the deployment will still be online, as requests are routed to other instances.
+- `AllIdentityServersDown` (All Identity Server instances of The Things Enterprise Stack are down). This means that none of the Identity Server instances is responding.
+
+Similar alerts are defined for the Gateway Server, Network Server, Application Server, Join Server, etc.
+
+For the deployment as a whole, the following alerts are defined:
+
+- `InternalServerErrors` (An instance is returning responses with 500 codes).
+- `UpcomingCertificateExpiry` (A TLS certificate expires in less than 14 days).
+- `CertificateExpired` (A TLS certificate has expired).
+- `IncreasedLatency` (An instance is experiencing increased latency).
+- `UpcomingLicenseExpiry` (The Things Enterprise Stack license expires in less than 14 days).
+- `LicenseExpired` (The Things Enterprise Stack license expired).
+
+For several components there are alerts that fire when traffic patterns deviate from "usual" traffic:
+
+- `GatewayServerReceivedUplinkTrafficDrop` (The Gateway Server of The Things Enterprise Stack received less uplink traffic than usual). This alert includes the protocol (UDP, MQTT, WS), so make sure to look at that when this alert fires.
+- `GatewayServerForwardedUplinkTrafficDrop` (The Gateway Server of The Things Enterprise Stack forwarded less uplink traffic than usual). This alert includes the host (cluster, packet broker), so make sure to look at that when this alert fires. If the host is "cluster", there may be an issue with the Network Server. If the host is "packetbroker", there may be an issue with the Packet Broker Agent.
+- `NetworkServerReceivedUplinkTrafficDrop` (The Network Server of The Things Enterprise Stack received less uplink traffic than usual).
+- `ApplicationServerReceivedUplinkTrafficDrop` (The Application Server of The Things Enterprise Stack received less uplink traffic than usual).
+
+Similar alerts for downlink traffic will be added in the future.
+
+## Metrics
+
+In addition to the [metrics exported by {{% tts %}}]({{< ref "/reference/telemetry" >}}), the provided Prometheus image adds a number of recording rules that provide the input for the previously described alerting rules, but can also be useful in your dashboards.
+
+For all services the following metrics are recorded:
+
+- `ttn_lw_log_log_messages_rate` records the rate of log messages by job (service), namespace and level. This can be useful for monitoring warning and error rates.
+- `ttn_lw_events_publishes_rate` records the rate of published events by event name.
+
+For the Gateway Server:
+
+- `ttn_lw_gs_connected_gateways:by_protocol` and `ttn_lw_gs_connected_gateways:by_tenant_id` record the number of connected gateways in the cluster.
+- `ttn_lw_gs_uplink_received_rate:by_protocol` and `ttn_lw_gs_uplink_received_rate:by_tenant_id` record the rate of uplink messages received by the Gateway Server.
+- `ttn_lw_gs_downlink_sent_rate:by_protocol` and `ttn_lw_gs_downlink_sent_rate:by_tenant_id` record the rate of downlink messages sent by the Gateway Server.
+- `ttn_lw_gs_status_received_rate:by_protocol` and `ttn_lw_gs_status_received_rate:by_tenant_id` record the rate of status message received by the Gateway Server.
+- `ttn_lw_gs_uplink_forwarded_rate:by_host` and `ttn_lw_gs_uplink_forwarded_rate:by_tenant_id` record the rate of messages forwarded to the Network Server or Packet Broker. The `by_tenant_id` only considers messages forwarded to the Network Server.
+- `ttn_lw_gs_uplink_dropped_rate:by_host` and `ttn_lw_gs_uplink_dropped_rate:by_tenant_id` record the rate of messages that are dropped for various reasons.
+
+For some of these metrics, there are also variants with `:avg` and `:stddev` suffixes that can be used for anomaly detection, and are therefore also used in alerting rules.
+
+For the Network Server:
+
+- `ttn_lw_ns_uplink_received_rate` and `ttn_lw_ns_uplink_received_rate:by_tenant_id` record the rate of uplink messages received from the Gateway Server or Packet Broker.
+- `ttn_lw_ns_uplink_duplicates_rate:by_tenant_id` records the rate of duplicate uplinks that are merged into a single uplink.
+- `ttn_lw_ns_uplink_processed_rate` and `ttn_lw_ns_uplink_processed_rate:by_tenant_id` record the rate of processed uplinks, meaning uplinks that are matched to a device.
+- `ttn_lw_ns_uplink_forwarded_rate` and `ttn_lw_ns_uplink_forwarded_rate:by_tenant_id` record the rate of uplinks forwarded to an Application Server.
+- `ttn_lw_ns_uplink_dropped_rate:by_tenant_id` records the rate of uplinks that are dropped for various reasons.
+- `ttn_lw_ns_downlink_attempted_rate` and `ttn_lw_ns_downlink_attempted_rate:by_tenant_id` record the rate of attempted downlink messages; `ttn_lw_ns_downlink_forwarded_rate` and `ttn_lw_ns_downlink_forwarded_rate:by_tenant_id` the rate of forwarded downlink messages.
+
+For some of these metrics, there are also variants with `:avg` and `:stddev` suffixes that can be used for anomaly detection, and are therefore also used in alerting rules.
+
+For the Application Server:
+
+- `ttn_lw_as_subscriptions:by_protocol` and `ttn_lw_as_subscriptions:by_tenant_id` record the number of active uplink subscriptions from integrations and external applications.
+- `ttn_lw_as_pubsub_integrations:by_provider` and `ttn_lw_as_pubsub_integrations:by_tenant_id` record the number of active pub/sub integrations.
+- `ttn_lw_as_uplink_received_rate` and `ttn_lw_as_uplink_received_rate:by_tenant_id` record the rate of uplink messages received from the Network Server.
+- `ttn_lw_as_uplink_forwarded_rate` and `ttn_lw_as_uplink_forwarded_rate:by_tenant_id` record the rate of uplink messages forwarded to integrations and external applications.
+- `ttn_lw_as_uplink_dropped_rate:by_tenant_id` records the rate of uplink messages dropped for various reasons.
+- `ttn_lw_as_downlink_received_rate` and `ttn_lw_as_downlink_received_rate:by_tenant_id` record the rate of downlink messages received from integrations and external applications.
+- `ttn_lw_as_downlink_forwarded_rate` and `ttn_lw_as_downlink_forwarded_rate:by_tenant_id` record the rate of downlink messages forwarded to the Network Server.
+- `ttn_lw_as_downlink_dropped_rate:by_tenant_id` records the rate of downlink messages dropped for various reasons.
+
+For some of these metrics, there are also variants with `:avg` and `:stddev` suffixes that can be used for anomaly detection, and are therefore also used in alerting rules.
+
+For the Join Server:
+
+- `ttn_lw_js_join_accepted_rate:by_tenant_id` and `ttn_lw_js_join_rejected_rate:by_tenant_id` record the rate of accepted and rejected join requests.
diff --git a/doc/content/reference/telemetry/_index.md b/doc/content/reference/telemetry/_index.md
@@ -0,0 +1,67 @@
+---
+title: 'Telemetry'
+description: ''
+---
+
+{{% tts %}} exports [Prometheus](https://prometheus.io/) telemetry on the `/metrics` endpoint. This reference gives an overview of the most important metrics.
+
+<!--more-->
+
+> **NOTE:** Metrics are not covered by our compatibility commitment. This means that metrics may be changed or removed in major, minor, or patch releases. This page only lists metrics that are considered relatively stable, but even these metrics may be changed.
+
+## Process and Go Metrics
+
+{{% tts %}} uses the [Prometheus instrumentation library for Go](https://github.com/prometheus/client_golang/) that includes a collector for the state of the process (on Linux) and metrics from the Go runtime.
+
+## gRPC metrics
+
+{{% tts %}} uses the [Prometheus gRPC instrumentation library](https://github.com/grpc-ecosystem/go-grpc-prometheus) that exposes metrics about gRPC method calls.
+
+- `grpc_server_conns_opened_total` and `grpc_server_conns_closed_total` can be used to see how many gRPC connections to {{% tts %}} are opened, and how many are currently open.
+- `grpc_server_started_total` and `grpc_server_handled_total` can be used to see how many RPCs to {{% tts %}} are started, active and finished (including response code).
+- `grpc_server_msg_sent_total` and `grpc_server_msg_received_total` can be used to see how many RPC messages are sent/received by the server.
+
+Similar metrics exist for gRPC client connections opened by {{% tts %}}, and RPCs made by {{% tts %}}.
+
+## General Metrics
+
+- `ttn_lw_license_expiry_seconds` can be used to keep track of license expiry {{< distributions-inline "Enterprise" >}}
+- `ttn_lw_log_log_messages_total` can be used to track the log messages written by different log namespaces at different log levels.
+- `ttn_lw_events_publishes_total` can be used to track the published events by event type.
+- `ttn_lw_events_channel_dropped_total` can be used to watch for dropped events, which typically indicates that a consumer (such as a user's web browser) can't keep up.
+
+For the Gateway Server:
+
+- `ttn_lw_gs_connected_gateways` indicates the number of connected gateways.
+- `ttn_lw_gs_uplink_received_total` indicates the number of uplink messages received from gateways.
+- `ttn_lw_gs_downlink_sent_total` indicates the number of downlink messages sent to gateways.
+- `ttn_lw_gs_status_received_total` indicates the number of status messages received from gateways.
+- `ttn_lw_gs_uplink_forwarded_total` indicates the number of uplink messages forwarded to the Network Server or Packet Broker.
+- `ttn_lw_gs_uplink_dropped_total` indicates the number of uplink messages that are dropped for various reasons.
+
+For the Network Server:
+
+- `ttn_lw_ns_uplink_received_total` indicates the number of uplink messages received from the Gateway Server or Packet Broker.
+- `ttn_lw_ns_uplink_duplicates_total` indicates the number of duplicate uplinks that are merged into a single uplink.
+- `ttn_lw_ns_uplink_processed_total` indicates the number of processed uplinks, meaning uplinks that are matched to a device.
+- `ttn_lw_ns_uplink_forwarded_total` indicates the number of uplinks forwarded to an Application Server.
+- `ttn_lw_ns_uplink_dropped_total` indicates the number of uplinks that are dropped for various reasons.
+- `ttn_lw_ns_downlink_attempted_total` indicates the number of downlink attempts.
+- `ttn_lw_ns_downlink_forwarded_total` indicates the number of downlinks forwarded to the Gateway Server.
+- `ttn_lw_ns_uplink_gateways` is a histogram that indicates the number of gateways that received a given uplink.
+
+For the Application Server:
+
+- `ttn_lw_as_subscriptions_started_total` and `ttn_lw_as_subscriptions_stopped_total` indicate the number of started/stopped uplink subscriptions from integrations and external applications.
+- `ttn_lw_as_pubsub_integrations_started_total` and `ttn_lw_as_pubsub_integrations_stopped_total` indicate the number of started/stopped pub/sub integrations.
+- `ttn_lw_as_uplink_received_total` indicates the number of uplink messages received from the Network Server.
+- `ttn_lw_as_uplink_forwarded_total` indicates the number of uplink messages forwarded to integrations and external applications.
+- `ttn_lw_as_uplink_dropped_total` indicates the number of uplink messages dropped for various reasons.
+- `ttn_lw_as_downlink_received_total` indicates the number of downlink messages received from integrations and external applications.
+- `ttn_lw_as_downlink_forwarded_total` indicates the number of downlink messages forwarded to the Network Server.
+- `ttn_lw_as_downlink_dropped_total` indicates the number of downlink messages dropped for various reasons.
+- `ttn_lw_javascript_run_latency_seconds` is a histogram that indicates the duration of executing JavaScript payload formatters.
+
+For the Join Server:
+
+- `ttn_lw_js_join_accepted_total` and `ttn_lw_js_join_rejected_total` indicate the number of accepted and rejected join requests.