-
Notifications
You must be signed in to change notification settings - Fork 97
Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add OpenTelemetry Monitoring to DocumentDB Gateway (Phase 1) #195
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we sync offline on what are we trying to achieve with the otel for gw monitoring, please?
|
||
let telemetry_clone = telemetry_provider.clone(); | ||
|
||
// Initialize the cluster monitor with the primary pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's a system pool for system requests such as get the GUCs from backend
run_server(service_context, certificate_options, None, token.clone(), None) | ||
let telemetry_provider = Arc::new(documentdb_gateway::open_telemetry_provider::OpenTelemetryProvider::new()); | ||
|
||
let telemetry_clone = telemetry_provider.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use Arc::clone
instead of postfix notation to show that it's not a heavy-weight clone, but rather a rc increase
@@ -85,7 +85,21 @@ async fn main() { | |||
.await | |||
.unwrap(); | |||
|
|||
run_server(service_context, certificate_options, None, token.clone(), None) | |||
let telemetry_provider = Arc::new(documentdb_gateway::open_telemetry_provider::OpenTelemetryProvider::new()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it Arc in the first place? i thought telemetry provider is owned by service context or my mental model is off?
use crate::postgres::Pool; | ||
use crate::telemetry::TelemetryProvider; | ||
|
||
/// ClusterMonitor periodically checks the availability of primary and secondary clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you specify which primary and secondary clusters are we getting connected to? in the current layout we only have 1 postgres instance per 1 instance of gw on the host machine, so I'm confused to see the secondary notation here
pub struct ClusterMonitor { | ||
primary_pool: Arc<Pool>, | ||
secondary_pool: Option<Arc<Pool>>, | ||
telemetry: Arc<dyn TelemetryProvider>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance of have it as generic instead of trait object? since we anyways know the concrete type of telemetry provider at from the main.rs at the moment of compilation
/// Check if a specific connection pool is available by attempting to get a connection | ||
async fn check_pool_availability(&self, pool: &Pool, cluster_name: &str) -> bool { | ||
match pool.get().await { | ||
Ok(mut client) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not name a pg connection a client, let's name it client instead, since we'll have an actual data client soon
pub fn new() -> Self { | ||
// Initialize the global meter provider if not already done | ||
let meter_provider = METER_PROVIDER.get_or_init(|| { | ||
let endpoint = env::var("OTEL_EXPORTER_OTLP_ENDPOINT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we capture this env variable through the config file?
use opentelemetry_otlp::{WithExportConfig}; | ||
|
||
// Global meter provider for OpenTelemetry | ||
static METER_PROVIDER: OnceCell<Arc<opentelemetry_sdk::metrics::SdkMeterProvider>> = OnceCell::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we trying to achieve here with Arc? since this smart pointer is basically the owner of this type and we're making the only instance of it as a singleton, then what's the logic behind ref counting in async context for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and let's not use OnceCell, if it should be a single-owned, then let's move it to the ServiceContext and use it from there. I'm not a fan of this library and the way it lets you redefine the value it holds
|
||
let traffic_counter = meter | ||
.u64_counter("docdb_gateway_request_routing") | ||
.with_description("Counts requests routed to primary or secondary regions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but gw is not in charge of regional failover, right? so maybe lets use this meter to just get the number of requests that gw served?
This PR implements Phase 1 of OpenTelemetry instrumentation for the DocumentDB Gateway. The implementation focuses on two critical metrics that provide immediate operational visibility:
Cluster Availability Tracking
cluster_id
labelsRequest Routing Monitoring
target_region
andstatus
labelsThe PR includes:
This implementation creates a foundation for enhanced observability that will help detect and diagnose failover events and cluster availability issues in multi-region deployments.