Skip to content

Add OLAP metrics via OpenTelemetry and improve error handling#618

Merged
jfrench9 merged 4 commits intomainfrom
feature/olap-otel
Apr 15, 2026
Merged

Add OLAP metrics via OpenTelemetry and improve error handling#618
jfrench9 merged 4 commits intomainfrom
feature/olap-otel

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR integrates OpenTelemetry (OTel) observability into the OLAP extensions layer, adding metrics instrumentation and improving error handling across the middleware and router extensions. A new application entrypoint (main.py) is introduced alongside dependency and configuration updates.

Key Accomplishments

Observability & Metrics

  • Integrated OpenTelemetry instrumentation for OLAP-related extensions, enabling metrics collection and tracing across the middleware pipeline
  • Enhanced the extensions.py middleware with metrics hooks and improved error handling to provide better observability into request lifecycles

Extension Enhancements

  • RoboInvestor: Added new operations with structured metric reporting
  • RoboLedger: Significantly expanded operations (~94 lines added) and updated views to support new functionality
  • Updated GraphQL schema to accommodate new extension capabilities

Application Infrastructure

  • Added main.py (228 lines) as a new application entrypoint, centralizing startup configuration and OTel bootstrapping
  • Updated pyproject.toml with revised dependencies (including OTel packages) and project configuration adjustments
  • Updated uv.lock to reflect dependency changes

Documentation

  • Minor README update to reflect new capabilities

Breaking Changes

  • Dependency changes: Updated and new dependencies in pyproject.toml may require a fresh environment setup (uv.lock updated accordingly)
  • Middleware behavior: The extensions middleware now includes metrics and enhanced error handling, which may change error propagation semantics — downstream consumers should verify compatibility
  • GraphQL schema modifications: Schema changes in robosystems/graphql/schema.py may affect existing GraphQL clients

Testing Notes

  • Existing middleware extension tests have been updated (tests/middleware/test_extensions.py) to reflect the new metrics and error handling behavior
  • Verify that OpenTelemetry collector/exporter configuration is properly set up in target environments to receive emitted metrics and traces
  • Validate that RoboLedger and RoboInvestor operations behave correctly under both success and failure scenarios, ensuring metrics are emitted as expected

Infrastructure Considerations

  • An OpenTelemetry-compatible collector or backend must be available in deployed environments to ingest the new metrics and trace data
  • Environment variables or configuration for OTel exporters (endpoint, sampling rate, etc.) should be provisioned in all target deployment environments
  • The new entrypoint may require updates to deployment manifests, container definitions, or process managers

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/olap-otel
  • Target: main
  • Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Added middleware for request-level metrics on the GraphQL endpoint, enabling Prometheus integration for monitoring GraphQL operations.
- Introduced a standardized error response schema for extensions operations, ensuring consistent error handling across various endpoints.
- Updated the `OperationEnvelope` model to include `created_by` for better audit correlation.
- Enhanced multiple operations in RoboInvestor and RoboLedger with metrics decorators to track business events.
- Revised dependencies in `pyproject.toml` and `uv.lock` for compatibility and performance improvements.
- Updated README.md to include new demo commands and clarify usage.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

Claude finished @jfrench9's task in 3m 3s —— View job


PR Review: Add OLAP metrics via OpenTelemetry and improve error handling

Tasks:

  • Gather context and understand the request
  • Read changed files
  • Analyze middleware/extensions.py changes
  • Analyze main.py (new entrypoint)
  • Analyze router changes (roboledger + roboinvestor)
  • Analyze schema.py changes
  • Analyze pyproject.toml dependency changes
  • Analyze test changes
  • Post comprehensive review

Overall Assessment

This is a well-structured PR that adds OTel observability to the extensions layer. The created_by audit-provenance field, the OpenTelemetryExtensionSync GraphQL integration, and the cardinality-bounded endpoint labels are all solid additions. A few issues need attention before merge — one is a metrics correctness bug that would quietly produce useless dimensions.


Issues

Bug: endpoint_metrics_decorator always emits method="UNKNOWN" on every decorated endpoint

robosystems/routers/extensions/roboledger/operations.py, roboinvestor/operations.py, views.py

The decorator scans *args and **kwargs for a FastAPI.Request object, but FastAPI injects route handler parameters as typed keyword arguments (body, graph_id, user, cache, etc.) — never as a raw Request unless the handler explicitly declares request: Request. So request_obj is always None, method stays "UNKNOWN", and user_id goes unextracted.

# middleware/otel/metrics.py:876-890 — request_obj will always be None here
for arg in args:
    if isinstance(arg, Request):        # never true
        ...
for key, value in kwargs.items():
    if isinstance(value, Request):      # never true for these handlers
        ...

Since all decorated operations are POST-only, the quickest fix is a method parameter on the decorator (already used for the fixed label):

@endpoint_metrics_decorator(
    "/extensions/roboledger/{graph_id}/operations/update-entity",
    method="POST",                        # new param, avoids runtime scan
    business_event_type="ledger_update_entity",
)

Or add request: Request to each route handler — but that's more invasive across 23 endpoints.

Fix this →


Bug: response.body check in security_headers_middleware is always False

main.py:379-383

if path.startswith("/v1/") and any(
    keyword in str(response.body).lower() if hasattr(response, "body") else False
    for keyword in ["token", "password", "key"]
):
    response.headers["Cache-Control"] = "no-store, ..."

call_next() in Starlette middleware returns a StreamingResponse. Streaming responses do not expose a .body attribute — hasattr(response, "body") is False for virtually every real response. The Cache-Control sensitive-data header is therefore never set, even though the code appears to provide that protection. Either remove this dead branch or replace it with a route-prefix-only check (e.g., always set no-store for /v1/auth and /v1/billing).

Fix this →


Minor: import time inside middleware body

main.py:266

async def extensions_graphql_metrics_middleware(request: Request, call_next):
    path = request.url.path
    if not (...):
        return await call_next(request)

    import time        # ← deferred import inside hot path
    from robosystems.middleware.otel.metrics import (...)

Python caches module imports so this is not a correctness bug, but the deferred imports are unusual and could confuse readers. Move them to the top of main.py. The OTel imports from the conditional fast-exit path can instead be guarded by a module-level try/except ImportError.


Design: Business events fire on idempotent cache replays

middleware/otel/metrics.py:904-916 / middleware/extensions.py

When execute_operation returns a cached envelope (idempotent replay), it returns early — on_fresh_success is skipped and audit status is idempotent_replay=True. But the endpoint_metrics_decorator wrapping the route function fires the business_event_type counter regardless of whether the response came from cache:

# decorator fires this even for replayed operations
if business_event_type:
    metrics_instance.record_business_event(...)

This inflates business event counts for retried clients. Consider suppressing the business event when execute_operation returns a status="completed" envelope that has its operation_id matching a cached one — or expose an idempotent_replay flag through the envelope for the decorator to check.


Minor: status_code type inconsistency between record_request and record_request_duration

middleware/otel/metrics.py:343-355

# record_request: status_code converted to str ✓
"status_code": str(status_code),

# record_request_duration: status_code left as int ✗
"status_code": status_code,        # line 350

OTel attribute types must be consistent across recordings of the same instrument. If both methods record to _request_duration, having status_code as int in one path and str in another will cause attribute key collisions or silent dimension splits in Prometheus.


Positives

  • created_by in OperationEnvelope — clean, well-motivated addition with proper alias (createdBy), full coverage in the test matrix, and correct propagation through execute_operation. The docstring clearly explains the legacy-caller None case.
  • OpenTelemetryExtensionSync in schema.py — the comment explaining why the Sync variant is required (tests use execute_sync) is excellent, and the reasoning about async vs sync execution overhead is correct.
  • GraphQL metrics middleware cardinality design — normalizing to "/extensions/{graph_id}/graphql" rather than the actual path keeps Prometheus series count bounded. The graph_id appears only on the business event (event_graph_id), which is the right tradeoff.
  • Exception re-raise pattern in endpoint_metrics_decorator — error metrics are recorded before the exception propagates, so failures are visible in OTel even when the caller gets a 4xx/5xx.
  • app.state.current_time — consider renaming to startup_time or started_at to avoid the implication that it updates.
  • @app.on_event("startup") / @app.on_event("shutdown") — these are deprecated in recent FastAPI in favor of the lifespan async context manager. Not blocking, but worth flagging for a follow-up.

- Improved cache control for sensitive API responses by implementing a path-prefix allowlist, ensuring sensitive data is not cached.
- Updated the `OperationEnvelope` model to include an `idempotent_replay` attribute, allowing clients to distinguish between fresh executions and cached responses.
- Enhanced the `endpoint_metrics_decorator` to support explicit HTTP method labels, improving metrics accuracy for endpoints without a `Request` parameter.
- Added tests to verify the correct behavior of the new `idempotent_replay` feature and the explicit method handling in metrics recording.
- Introduced a new `lifespan` async context manager to handle startup and shutdown events, replacing deprecated event handlers.
- Added validation for environment configuration during startup, with logging for successful or failed validation.
- Initialized query queue executor and Redis SSE event subscriber during startup, with error handling for each step.
- Updated the `is_relaxed_csp_path` function to simplify its docstring and improve clarity.
- Enhanced custom documentation routes for Swagger and ReDoc with dark theme support.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude can you review the latest changes

@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

Claude finished @jfrench9's task in 3m 46s —— View job


PR Review: Add OLAP metrics via OpenTelemetry and improve error handling

Tasks:

  • Gather context and understand the request
  • Check latest commits and diff vs origin/main
  • Read changed files (middleware, routers, main.py, tests)
  • Analyze changes and identify issues
  • Post comprehensive review

Overall Assessment

The three commits since the previous review have addressed every issue from the earlier round — the "UNKNOWN" method bug, the StreamingResponse.body false security guard, the status_code type inconsistency, and the deprecated @app.on_event handlers. The new idempotent_replay / created_by machinery is well-designed, and the test coverage for both additions is solid. One bug in the async-dispatch path and a few minor items still need attention.


Previous Issues — Status

Issue Status
endpoint_metrics_decorator emits method="UNKNOWN" ✅ Fixed — method= param added; all 23 endpoints decorated with method="POST"
response.body check never fires on StreamingResponse ✅ Fixed — replaced with path-prefix allowlist
status_code type inconsistency (int vs str) ✅ Fixed — record_request_duration now uses str(status_code)
@app.on_event deprecated ✅ Fixed — migrated to lifespan async context manager
Business events fire on idempotent cache replays ✅ Fixed — idempotent_replay flag; decorator suppresses counter on replays
Deferred import time inside middleware hot path ✅ Fixed — moved to module level

Issues

Bug: auto_map_elements_op idempotency replay path doesn't set idempotent_replay=True

robosystems/routers/extensions/roboledger/operations.py:1111-1122

auto_map_elements_op is the only operation that handles its own idempotency cache manually (it's async/Dagster-dispatched and can't go through execute_operation). The cached-hit branch returns the envelope directly without applying the idempotent_replay=True copy that execute_operation correctly applies at extensions.py:582:

# execute_operation does this correctly:
cached = cached.model_copy(update={"idempotent_replay": True})

# auto_map_elements_op does NOT:
if cached is not None:
    log_operation_audit(..., idempotent_replay=True)  # ← audit log ✓
    return cached  # ← envelope.idempotent_replay stays False ✗

Consequences:

  1. The endpoint_metrics_decorator wrapping this route reads getattr(result, "idempotent_replay", False) — it stays False, so ledger_auto_map_elements business events are not suppressed on replays, inflating the counter.
  2. Clients that inspect the envelope to distinguish "new enqueue" from "retry replay" get a False even though the task was not re-enqueued.

Fix is one line — mirror what execute_operation does:

if cached is not None:
    log_operation_audit(...)
    return cached.model_copy(update={"idempotent_replay": True})  # add this

Fix this →


Minor: _sensitive_path_prefixes tuple allocated on every request

main.py:322

async def security_headers_middleware(request: Request, call_next):
    ...
    _sensitive_path_prefixes = ("/v1/auth", "/v1/user", "/v1/billing", "/v1/orgs")  # ← per request

This tuple is re-created on every HTTP request. Move it to module level (or inside create_app as a closure constant) so it's allocated once:

_SENSITIVE_PATH_PREFIXES = ("/v1/auth", "/v1/user", "/v1/billing", "/v1/orgs")

Flag: Major dependency version jumps — verify numpy 2.x compatibility

pyproject.toml

Several version constraints moved significantly:

Package Before After
numpy >=1.26.0,<3.0 >=2.4.0,<3.0
strawberry-graphql >=0.240.0,<1.0 >=0.314.0,<1.0
fastapi >=0.116.0,<1.0 >=0.135.0,<1.0
dagster >=1.9.0,<2.0 >=1.13.0,<2.0
pyarrow >=20.0.0 >=23.0.0

numpy 2.x dropped the legacy np.bool, np.int, np.float, np.complex aliases and changed several dtype behaviors — if any adapter or analytics code uses those, it will fail at runtime. The arelle-release==2.39.5 pin (XBRL processor) and onnxruntime transitive dep (via fastembed) both use numpy internals; worth verifying those are tested under the new constraint.


Positives

  • lifespan migration — the async context manager form is clean; error paths in startup correctly distinguish prod (fail fast) vs. dev (warn and continue). The SSE subscriber stop is guarded so shutdown doesn't crash on startup-failure scenarios.

  • idempotent_replay + created_by in OperationEnvelope — the model_copy approach to avoid mutating the cached instance is thoughtful (comment at extensions.py:579 explains the future-proofing rationale well). Both fields have correct alias for camelCase wire format, and the test matrix in test_extensions.py covers the alias round-trip explicitly.

  • auto_map_elements_op created_by propagation — the only operation that manually calls wrap_pending correctly passes created_by=user_id so its envelopes carry the same provenance as dispatcher-routed ones.

  • OpenAPI enrichment — injecting OperationError, shared error responses, Idempotency-Key header, and GraphQL auth notes into every matching path is the right place for it (keeps router files thin). The _is_operation_path / _is_graphql_path helpers make the dispatch readable, and using setdefault avoids overwriting per-operation overrides.

  • OpenTelemetryExtensionSync in schema — the comment explaining why the Sync variant is required (tests use execute_sync; async variant breaks sync introspection) is exactly the kind of non-obvious context that prevents a future maintainer from "upgrading" to the async version and breaking the test suite.

  • graph_id kwargs fallback in decorator — adding the kwargs.get("graph_id") path correctly handles the case where no Request object is declared on the handler but graph_id is still a typed path parameter. This gives business events correct event_graph_id dimensions without requiring a request: Request parameter on every route.


- Centralized sensitive path prefixes in a single constant for improved maintainability and clarity in cache control logic.
- Updated the `auto_map_elements_op` function to ensure proper handling of idempotent replay, preventing double-counting of business events and allowing clients to distinguish between fresh and replayed tasks.
- Added comprehensive tests for `auto_map_elements_op` to validate the new idempotent replay behavior and ensure cache integrity.
@jfrench9 jfrench9 merged commit 98e7e9c into main Apr 15, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/olap-otel branch April 15, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant