Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195

udsmicrosoft · 2025-05-20T16:20:07Z

This PR implements Phase 1 of OpenTelemetry instrumentation for the DocumentDB Gateway. The implementation focuses on two critical metrics that provide immediate operational visibility:

Cluster Availability Tracking

Adds metrics to monitor the perceived availability of primary and secondary clusters
Implemented as gauge metrics with cluster_id labels
Values: 1 (available) or 0 (unavailable)

Request Routing Monitoring

Tracks traffic routing between primary and secondary regions
Implemented as counter metrics with target_region and status labels
Provides visibility into traffic shifts during failover events

The PR includes:

OpenTelemetry provider implementation
Telemetry trait integration with the gateway
Docker-based monitoring stack with:
- OpenTelemetry Collector
- Prometheus
- Grafana dashboards
Documentation on monitoring setup and usage

This implementation creates a foundation for enhanced observability that will help detect and diagnose failover events and cluster availability issues in multi-region deployments.

AndrewKhoma

can we sync offline on what are we trying to achieve with the otel for gw monitoring, please?

AndrewKhoma · 2025-06-04T23:30:21Z

pg_documentdb_gw/src/main.rs

+
+    let telemetry_clone = telemetry_provider.clone();
+
+    // Initialize the cluster monitor with the primary pool


nit: it's a system pool for system requests such as get the GUCs from backend

AndrewKhoma · 2025-06-04T23:31:02Z

pg_documentdb_gw/src/main.rs

-    run_server(service_context, certificate_options, None, token.clone(), None)
+    let telemetry_provider = Arc::new(documentdb_gateway::open_telemetry_provider::OpenTelemetryProvider::new());
+
+    let telemetry_clone = telemetry_provider.clone();


let's use Arc::clone instead of postfix notation to show that it's not a heavy-weight clone, but rather a rc increase

AndrewKhoma · 2025-06-04T23:31:32Z

pg_documentdb_gw/src/main.rs

@@ -85,7 +85,21 @@ async fn main() {
    .await
    .unwrap();

-    run_server(service_context, certificate_options, None, token.clone(), None)
+    let telemetry_provider = Arc::new(documentdb_gateway::open_telemetry_provider::OpenTelemetryProvider::new());


why is it Arc in the first place? i thought telemetry provider is owned by service context or my mental model is off?

AndrewKhoma · 2025-06-04T23:33:25Z

pg_documentdb_gw/src/monitoring/mod.rs

+use crate::postgres::Pool;
+use crate::telemetry::TelemetryProvider;
+
+/// ClusterMonitor periodically checks the availability of primary and secondary clusters


can you specify which primary and secondary clusters are we getting connected to? in the current layout we only have 1 postgres instance per 1 instance of gw on the host machine, so I'm confused to see the secondary notation here

AndrewKhoma · 2025-06-04T23:34:24Z

pg_documentdb_gw/src/monitoring/mod.rs

+pub struct ClusterMonitor {
+    primary_pool: Arc<Pool>,
+    secondary_pool: Option<Arc<Pool>>,
+    telemetry: Arc<dyn TelemetryProvider>,


any chance of have it as generic instead of trait object? since we anyways know the concrete type of telemetry provider at from the main.rs at the moment of compilation

AndrewKhoma · 2025-06-04T23:37:40Z

pg_documentdb_gw/src/monitoring/mod.rs

+    /// Check if a specific connection pool is available by attempting to get a connection
+    async fn check_pool_availability(&self, pool: &Pool, cluster_name: &str) -> bool {
+        match pool.get().await {
+            Ok(mut client) => {


let's not name a pg connection a client, let's name it client instead, since we'll have an actual data client soon

AndrewKhoma · 2025-06-04T23:49:35Z

pg_documentdb_gw/src/open_telemetry_provider.rs

+    pub fn new() -> Self {
+        // Initialize the global meter provider if not already done
+        let meter_provider = METER_PROVIDER.get_or_init(|| {
+            let endpoint = env::var("OTEL_EXPORTER_OTLP_ENDPOINT")


can we capture this env variable through the config file?

AndrewKhoma · 2025-06-04T23:53:19Z

pg_documentdb_gw/src/open_telemetry_provider.rs

+use opentelemetry_otlp::{WithExportConfig};
+
+// Global meter provider for OpenTelemetry
+static METER_PROVIDER: OnceCell<Arc<opentelemetry_sdk::metrics::SdkMeterProvider>> = OnceCell::new();


what are we trying to achieve here with Arc? since this smart pointer is basically the owner of this type and we're making the only instance of it as a singleton, then what's the logic behind ref counting in async context for this one?

and let's not use OnceCell, if it should be a single-owned, then let's move it to the ServiceContext and use it from there. I'm not a fan of this library and the way it lets you redefine the value it holds

AndrewKhoma · 2025-06-04T23:55:32Z

pg_documentdb_gw/src/open_telemetry_provider.rs

+
+        let traffic_counter = meter
+            .u64_counter("docdb_gateway_request_routing")
+            .with_description("Counts requests routed to primary or secondary regions")


but gw is not in charge of regional failover, right? so maybe lets use this meter to just get the number of requests that gw served?

add otel and two metrics

1cfb416

AndrewKhoma suggested changes Jun 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195

Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195

Uh oh!

udsmicrosoft commented May 20, 2025

Uh oh!

AndrewKhoma left a comment

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

AndrewKhoma Jun 4, 2025

Uh oh!

Uh oh!


		let telemetry_clone = telemetry_provider.clone();

		// Initialize the cluster monitor with the primary pool

Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195

Are you sure you want to change the base?

Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195

Uh oh!

Conversation

udsmicrosoft commented May 20, 2025

Cluster Availability Tracking

Request Routing Monitoring

The PR includes:

Uh oh!

AndrewKhoma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!