317 Refactor Observability Stack with OpenTelemetry#371
Conversation
b74937a to
e39b5fb
Compare
There was a problem hiding this comment.
Pull request overview
This PR integrates OpenTelemetry-based metrics collection into Portal Backend, Orchestration Engine, and Policy Decision Point services, enabling vendor-agnostic observability with support for Prometheus (default), Datadog, New Relic, and other OTLP-compatible backends.
Key Changes:
- Created shared monitoring package (
exchange/shared/monitoring/) with OpenTelemetry instrumentation - Added portal-backend middleware (
v1/middleware/) for metrics collection - Configured Prometheus scraping and added sample traffic generation script
Reviewed changes
Copilot reviewed 20 out of 24 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
exchange/shared/monitoring/otel_metrics.go |
Core OpenTelemetry metrics implementation for exchange services with support for Prometheus and OTLP exporters |
exchange/shared/monitoring/metrics.go |
Backward-compatible API wrapper with route normalization to prevent cardinality explosion |
portal-backend/v1/middleware/otel_metrics.go |
Portal-specific OpenTelemetry middleware implementation |
portal-backend/main.go |
Integrates metrics middleware and handler into portal-backend |
exchange/policy-decision-point/main.go |
Adds metrics instrumentation to PDP service |
exchange/orchestration-engine/server/server.go |
Adds metrics instrumentation to orchestration engine |
observability/prometheus/prometheus.yml |
Updates scrape configuration for instrumented services |
observability/generate_sample_traffic.sh |
Script to generate sample HTTP traffic for testing metrics |
exchange/shared/monitoring/go.mod |
New module definition with invalid Go version 1.24.6 |
exchange/orchestration-engine/go.mod |
Updated with monitoring dependency and invalid Go version 1.25.0 |
portal-backend/go.mod |
Updated with OpenTelemetry dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bcf663b to
c6939c1
Compare
276c2ee to
4f318d9
Compare
4f318d9 to
5a0c3f0
Compare
mushrafmim
left a comment
There was a problem hiding this comment.
Other than that this is looking good, since the consent-engine is added, I will validate whether it is working along properly and approve the PR.
9b67ac4 to
448532b
Compare
448532b to
af8ba3f
Compare
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring to introduce a vendor-agnostic observability stack using OpenTelemetry. The new shared monitoring package is well-structured, and the documentation updates are excellent and very thorough. My review focuses on a few key areas to further improve the robustness and security of the implementation. I've identified a critical security regression in the Nginx configuration, a high-risk issue with route normalization that could lead to metric cardinality explosion, and a medium-severity inconsistency in histogram bucket configuration. Addressing these points will make this already strong contribution even better.
eba3284 to
3825981
Compare
4ee78bb to
e5d0cda
Compare
Summary
This PR connects Go services to the observability stack by implementing OpenTelemetry-based metrics. This enables Prometheus to scrape metrics from these services and display them in Grafana dashboards. Services now use vendor-agnostic OpenTelemetry instrumentation to allow seamless switching between Prometheus (default for local dev), Datadog, New Relic, or any OTLP-compatible backend without changing code - just environment variables.
All services now expose the following Prometheus metrics:
http_requests_total{http_method, http_route, http_status_code}- Total HTTP request count by method, route, and status codehttp_request_duration_seconds{http_method, http_route}- HTTP request latency histogram by method and routeexternal_calls_total{external_target, external_operation}- External service call metrics (exchange services)business_events_total{business_action, business_outcome}- Business event metrics (exchange services)Why these changes are needed:
Type of Change
Architectural Changes
exchange/shared/monitoring/for all exchange servicesmonitoring.Handler(),monitoring.HTTPMetricsMiddleware()) still works, delegates to OpenTelemetryTesting
Test Results
Runtime Testing
To verify the observability stack is working:
Start observability stack:
Check Prometheus targets:
Generate sample traffic:
cd observability ./generate_sample_traffic.shView metrics in Grafana:
Checklist
opendif-networkRelated Issues
Deployment Notes
Pre-Deployment Checklist
Service Restart Required: Services must be restarted to load the new monitoring code
Network Setup: Ensure
opendif-networkexists before starting servicesNo Configuration Changes Required: Services use Prometheus exporter by default (no env vars needed for local dev)
Prometheus Already Configured: Prometheus is already configured to scrape these services (see
observability/prometheus/prometheus.yml)Grafana Dashboard Ready: Grafana dashboard is already configured to display these metrics
Environment Variables (Optional)
For local development, no environment variables are needed (Prometheus is default).
To switch to other backends (Datadog, New Relic, etc.), set:
Post-Deployment Verification
Check Metrics Endpoints:
Verify Prometheus Scraping:
View in Grafana:
Generate Sample Traffic:
cd observability ./generate_sample_traffic.shMigration Notes
monitoring.Handler()ormonitoring.HTTPMetricsMiddleware()work without changesFuture Work