Skip to content

Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

udsmicrosoft
Copy link

This PR implements Phase 1 of OpenTelemetry instrumentation for the DocumentDB Gateway. The implementation focuses on two critical metrics that provide immediate operational visibility:

Cluster Availability Tracking

  • Adds metrics to monitor the perceived availability of primary and secondary clusters
  • Implemented as gauge metrics with cluster_id labels
  • Values: 1 (available) or 0 (unavailable)

Request Routing Monitoring

  • Tracks traffic routing between primary and secondary regions
  • Implemented as counter metrics with target_region and status labels
  • Provides visibility into traffic shifts during failover events

The PR includes:

  • OpenTelemetry provider implementation
  • Telemetry trait integration with the gateway
  • Docker-based monitoring stack with:
    • OpenTelemetry Collector
    • Prometheus
    • Grafana dashboards
  • Documentation on monitoring setup and usage

This implementation creates a foundation for enhanced observability that will help detect and diagnose failover events and cluster availability issues in multi-region deployments.

Copy link
Member

@AndrewKhoma AndrewKhoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we sync offline on what are we trying to achieve with the otel for gw monitoring, please?


let telemetry_clone = telemetry_provider.clone();

// Initialize the cluster monitor with the primary pool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a system pool for system requests such as get the GUCs from backend

run_server(service_context, certificate_options, None, token.clone(), None)
let telemetry_provider = Arc::new(documentdb_gateway::open_telemetry_provider::OpenTelemetryProvider::new());

let telemetry_clone = telemetry_provider.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use Arc::clone instead of postfix notation to show that it's not a heavy-weight clone, but rather a rc increase

@@ -85,7 +85,21 @@ async fn main() {
.await
.unwrap();

run_server(service_context, certificate_options, None, token.clone(), None)
let telemetry_provider = Arc::new(documentdb_gateway::open_telemetry_provider::OpenTelemetryProvider::new());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it Arc in the first place? i thought telemetry provider is owned by service context or my mental model is off?

use crate::postgres::Pool;
use crate::telemetry::TelemetryProvider;

/// ClusterMonitor periodically checks the availability of primary and secondary clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you specify which primary and secondary clusters are we getting connected to? in the current layout we only have 1 postgres instance per 1 instance of gw on the host machine, so I'm confused to see the secondary notation here

pub struct ClusterMonitor {
primary_pool: Arc<Pool>,
secondary_pool: Option<Arc<Pool>>,
telemetry: Arc<dyn TelemetryProvider>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance of have it as generic instead of trait object? since we anyways know the concrete type of telemetry provider at from the main.rs at the moment of compilation

/// Check if a specific connection pool is available by attempting to get a connection
async fn check_pool_availability(&self, pool: &Pool, cluster_name: &str) -> bool {
match pool.get().await {
Ok(mut client) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not name a pg connection a client, let's name it client instead, since we'll have an actual data client soon

pub fn new() -> Self {
// Initialize the global meter provider if not already done
let meter_provider = METER_PROVIDER.get_or_init(|| {
let endpoint = env::var("OTEL_EXPORTER_OTLP_ENDPOINT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we capture this env variable through the config file?

use opentelemetry_otlp::{WithExportConfig};

// Global meter provider for OpenTelemetry
static METER_PROVIDER: OnceCell<Arc<opentelemetry_sdk::metrics::SdkMeterProvider>> = OnceCell::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we trying to achieve here with Arc? since this smart pointer is basically the owner of this type and we're making the only instance of it as a singleton, then what's the logic behind ref counting in async context for this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and let's not use OnceCell, if it should be a single-owned, then let's move it to the ServiceContext and use it from there. I'm not a fan of this library and the way it lets you redefine the value it holds


let traffic_counter = meter
.u64_counter("docdb_gateway_request_routing")
.with_description("Counts requests routed to primary or secondary regions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but gw is not in charge of regional failover, right? so maybe lets use this meter to just get the number of requests that gw served?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants