Skip to content

317 Refactor Observability Stack with OpenTelemetry#371

Merged
ginaxu1 merged 1 commit intomainfrom
317-part2-connect
Dec 23, 2025
Merged

317 Refactor Observability Stack with OpenTelemetry#371
ginaxu1 merged 1 commit intomainfrom
317-part2-connect

Conversation

@ginaxu1
Copy link
Collaborator

@ginaxu1 ginaxu1 commented Dec 5, 2025

Summary

This PR connects Go services to the observability stack by implementing OpenTelemetry-based metrics. This enables Prometheus to scrape metrics from these services and display them in Grafana dashboards. Services now use vendor-agnostic OpenTelemetry instrumentation to allow seamless switching between Prometheus (default for local dev), Datadog, New Relic, or any OTLP-compatible backend without changing code - just environment variables.

All services now expose the following Prometheus metrics:

  • http_requests_total{http_method, http_route, http_status_code} - Total HTTP request count by method, route, and status code
  • http_request_duration_seconds{http_method, http_route} - HTTP request latency histogram by method and route
  • external_calls_total{external_target, external_operation} - External service call metrics (exchange services)
  • business_events_total{business_action, business_outcome} - Business event metrics (exchange services)

Why these changes are needed:

  • The observability stack (Prometheus + Grafana) is configured but cannot collect data without service instrumentation
  • OpenTelemetry provides vendor-agnostic instrumentation, allowing teams to choose their observability backend (Prometheus, Datadog, New Relic, etc.) without code changes

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Other (please describe):

Architectural Changes

  • OpenTelemetry Integration: All services now use OpenTelemetry SDK for vendor-agnostic metrics collection
  • Shared Monitoring Package: Created reusable monitoring package in exchange/shared/monitoring/ for all exchange services
  • Backward-Compatible API: Old API (monitoring.Handler(), monitoring.HTTPMetricsMiddleware()) still works, delegates to OpenTelemetry
  • Auto-Initialization: Metrics initialize automatically when first used (no explicit initialization needed)
  • Non-Breaking Integration: Metrics are added without modifying existing handler logic
  • Vendor-Agnostic: Switch between Prometheus, Datadog, New Relic via environment variables (no code changes)

Testing

  • I have tested this change locally
  • I have added unit tests for new functionality
  • I have tested edge cases
  • All existing tests pass

Test Results

Runtime Testing

To verify the observability stack is working:

  1. Start observability stack:

    cd observability
    ./start-grafana.sh  # or: docker compose up -d
  2. Check Prometheus targets:

  3. Generate sample traffic:

    cd observability
    ./generate_sample_traffic.sh
  4. View metrics in Grafana:

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have checked that there are no merge conflicts
  • I have verified all services are on opendif-network
  • I have verified Prometheus can scrape all service endpoints

Related Issues

Deployment Notes

Pre-Deployment Checklist

  1. Service Restart Required: Services must be restarted to load the new monitoring code

    # Stop existing services
    # Rebuild: go build .
    # Restart services
  2. Network Setup: Ensure opendif-network exists before starting services

    docker network create opendif-network  # if it doesn't exist
    # Or use: cd observability && ./start-grafana.sh
  3. No Configuration Changes Required: Services use Prometheus exporter by default (no env vars needed for local dev)

  4. Prometheus Already Configured: Prometheus is already configured to scrape these services (see observability/prometheus/prometheus.yml)

  5. Grafana Dashboard Ready: Grafana dashboard is already configured to display these metrics

Environment Variables (Optional)

For local development, no environment variables are needed (Prometheus is default).

To switch to other backends (Datadog, New Relic, etc.), set:

export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=<your-endpoint>
export OTEL_EXPORTER_OTLP_HEADERS="<your-headers>"
export SERVICE_NAME=<service-name>

Post-Deployment Verification

  1. Check Metrics Endpoints:

    curl http://localhost:4000/metrics | grep http_requests_total
    curl http://localhost:8082/metrics | grep http_requests_total
    curl http://localhost:3000/metrics | grep http_requests_total
  2. Verify Prometheus Scraping:

  3. View in Grafana:

  4. Generate Sample Traffic:

    cd observability
    ./generate_sample_traffic.sh

Migration Notes

  • Backward Compatible: The API remains the same - existing code continues to work
  • Auto-initialization: Metrics initialize automatically when first used
  • No code changes required: Services using monitoring.Handler() or monitoring.HTTPMetricsMiddleware() work without changes
  • OpenTelemetry Under the Hood: Prometheus client is now indirect dependency via OpenTelemetry Prometheus exporter

Future Work

  • Add metrics instrumentation to Consent Engine and Audit Service
  • Add custom business metrics for specific use cases
  • Configure alerting rules in Prometheus
  • Add distributed tracing (OpenTelemetry traces)

@ginaxu1 ginaxu1 requested review from mushrafmim and sthanikan2000 and removed request for sthanikan2000 December 5, 2025 07:17
@ginaxu1 ginaxu1 changed the title 317 part2 connect 317 Observability connect with OE and PDP Dec 5, 2025
@ginaxu1 ginaxu1 changed the title 317 Observability connect with OE and PDP 317 Connect OE, PDP, Portal Backend to Observability Stack with OpenTelemetry Dec 8, 2025
@sthanikan2000 sthanikan2000 requested a review from Copilot December 8, 2025 07:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates OpenTelemetry-based metrics collection into Portal Backend, Orchestration Engine, and Policy Decision Point services, enabling vendor-agnostic observability with support for Prometheus (default), Datadog, New Relic, and other OTLP-compatible backends.

Key Changes:

  • Created shared monitoring package (exchange/shared/monitoring/) with OpenTelemetry instrumentation
  • Added portal-backend middleware (v1/middleware/) for metrics collection
  • Configured Prometheus scraping and added sample traffic generation script

Reviewed changes

Copilot reviewed 20 out of 24 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
exchange/shared/monitoring/otel_metrics.go Core OpenTelemetry metrics implementation for exchange services with support for Prometheus and OTLP exporters
exchange/shared/monitoring/metrics.go Backward-compatible API wrapper with route normalization to prevent cardinality explosion
portal-backend/v1/middleware/otel_metrics.go Portal-specific OpenTelemetry middleware implementation
portal-backend/main.go Integrates metrics middleware and handler into portal-backend
exchange/policy-decision-point/main.go Adds metrics instrumentation to PDP service
exchange/orchestration-engine/server/server.go Adds metrics instrumentation to orchestration engine
observability/prometheus/prometheus.yml Updates scrape configuration for instrumented services
observability/generate_sample_traffic.sh Script to generate sample HTTP traffic for testing metrics
exchange/shared/monitoring/go.mod New module definition with invalid Go version 1.24.6
exchange/orchestration-engine/go.mod Updated with monitoring dependency and invalid Go version 1.25.0
portal-backend/go.mod Updated with OpenTelemetry dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ginaxu1 ginaxu1 changed the title 317 Connect OE, PDP, Portal Backend to Observability Stack with OpenTelemetry 317 Connect OE, PDP to Observability Stack with OpenTelemetry Dec 8, 2025
@sthanikan2000 sthanikan2000 requested a review from Copilot December 8, 2025 12:25

This comment was marked as outdated.

@ginaxu1 ginaxu1 changed the title 317 Connect OE, PDP to Observability Stack with OpenTelemetry 317 Refactor Observability Stack with OpenTelemetry Dec 9, 2025
Copy link
Member

@mushrafmim mushrafmim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that this is looking good, since the consent-engine is added, I will validate whether it is working along properly and approve the PR.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring to introduce a vendor-agnostic observability stack using OpenTelemetry. The new shared monitoring package is well-structured, and the documentation updates are excellent and very thorough. My review focuses on a few key areas to further improve the robustness and security of the implementation. I've identified a critical security regression in the Nginx configuration, a high-risk issue with route normalization that could lead to metric cardinality explosion, and a medium-severity inconsistency in histogram bucket configuration. Addressing these points will make this already strong contribution even better.

@ginaxu1 ginaxu1 force-pushed the 317-part2-connect branch 2 times, most recently from eba3284 to 3825981 Compare December 19, 2025 06:45
@OpenDIF OpenDIF deleted a comment from sthanikan2000 Dec 23, 2025
@ginaxu1 ginaxu1 merged commit eef05e8 into main Dec 23, 2025
11 checks passed
@ginaxu1 ginaxu1 deleted the 317-part2-connect branch December 23, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants