Add OLAP metrics via OpenTelemetry and improve error handling by jfrench9 · Pull Request #618 · RoboFinSystems/robosystems

jfrench9 · 2026-04-15T04:57:35Z

Summary

This PR integrates OpenTelemetry (OTel) observability into the OLAP extensions layer, adding metrics instrumentation and improving error handling across the middleware and router extensions. A new application entrypoint (main.py) is introduced alongside dependency and configuration updates.

Key Accomplishments

Observability & Metrics

Integrated OpenTelemetry instrumentation for OLAP-related extensions, enabling metrics collection and tracing across the middleware pipeline
Enhanced the extensions.py middleware with metrics hooks and improved error handling to provide better observability into request lifecycles

Extension Enhancements

RoboInvestor: Added new operations with structured metric reporting
RoboLedger: Significantly expanded operations (~94 lines added) and updated views to support new functionality
Updated GraphQL schema to accommodate new extension capabilities

Application Infrastructure

Added main.py (228 lines) as a new application entrypoint, centralizing startup configuration and OTel bootstrapping
Updated pyproject.toml with revised dependencies (including OTel packages) and project configuration adjustments
Updated uv.lock to reflect dependency changes

Documentation

Minor README update to reflect new capabilities

Breaking Changes

Dependency changes: Updated and new dependencies in pyproject.toml may require a fresh environment setup (uv.lock updated accordingly)
Middleware behavior: The extensions middleware now includes metrics and enhanced error handling, which may change error propagation semantics — downstream consumers should verify compatibility
GraphQL schema modifications: Schema changes in robosystems/graphql/schema.py may affect existing GraphQL clients

Testing Notes

Existing middleware extension tests have been updated (tests/middleware/test_extensions.py) to reflect the new metrics and error handling behavior
Verify that OpenTelemetry collector/exporter configuration is properly set up in target environments to receive emitted metrics and traces
Validate that RoboLedger and RoboInvestor operations behave correctly under both success and failure scenarios, ensuring metrics are emitted as expected

Infrastructure Considerations

An OpenTelemetry-compatible collector or backend must be available in deployed environments to ingest the new metrics and trace data
Environment variables or configuration for OTel exporters (endpoint, sampling rate, etc.) should be provisioned in all target deployment environments
The new entrypoint may require updates to deployment manifests, container definitions, or process managers

🤖 Generated with Claude Code

Branch Info:

Source: feature/olap-otel
Target: main
Type: feature

Co-Authored-By: Claude noreply@anthropic.com

- Added middleware for request-level metrics on the GraphQL endpoint, enabling Prometheus integration for monitoring GraphQL operations. - Introduced a standardized error response schema for extensions operations, ensuring consistent error handling across various endpoints. - Updated the `OperationEnvelope` model to include `created_by` for better audit correlation. - Enhanced multiple operations in RoboInvestor and RoboLedger with metrics decorators to track business events. - Revised dependencies in `pyproject.toml` and `uv.lock` for compatibility and performance improvements. - Updated README.md to include new demo commands and clarify usage.

jfrench9 · 2026-04-15T04:57:36Z

@claude please review this PR

claude · 2026-04-15T04:58:00Z

Claude finished @jfrench9's task in 3m 3s —— View job

PR Review: Add OLAP metrics via OpenTelemetry and improve error handling

Tasks:

Gather context and understand the request
Read changed files
Analyze middleware/extensions.py changes
Analyze main.py (new entrypoint)
Analyze router changes (roboledger + roboinvestor)
Analyze schema.py changes
Analyze pyproject.toml dependency changes
Analyze test changes
Post comprehensive review

Overall Assessment

This is a well-structured PR that adds OTel observability to the extensions layer. The created_by audit-provenance field, the OpenTelemetryExtensionSync GraphQL integration, and the cardinality-bounded endpoint labels are all solid additions. A few issues need attention before merge — one is a metrics correctness bug that would quietly produce useless dimensions.

Issues

Bug: `endpoint_metrics_decorator` always emits `method="UNKNOWN"` on every decorated endpoint

robosystems/routers/extensions/roboledger/operations.py, roboinvestor/operations.py, views.py

The decorator scans *args and **kwargs for a FastAPI.Request object, but FastAPI injects route handler parameters as typed keyword arguments (body, graph_id, user, cache, etc.) — never as a raw Request unless the handler explicitly declares request: Request. So request_obj is always None, method stays "UNKNOWN", and user_id goes unextracted.

# middleware/otel/metrics.py:876-890 — request_obj will always be None here
for arg in args:
    if isinstance(arg, Request):        # never true
        ...
for key, value in kwargs.items():
    if isinstance(value, Request):      # never true for these handlers
        ...

Since all decorated operations are POST-only, the quickest fix is a method parameter on the decorator (already used for the fixed label):

@endpoint_metrics_decorator(
    "/extensions/roboledger/{graph_id}/operations/update-entity",
    method="POST",                        # new param, avoids runtime scan
    business_event_type="ledger_update_entity",
)

Or add request: Request to each route handler — but that's more invasive across 23 endpoints.

Fix this →

Bug: `response.body` check in `security_headers_middleware` is always `False`

main.py:379-383

if path.startswith("/v1/") and any(
    keyword in str(response.body).lower() if hasattr(response, "body") else False
    for keyword in ["token", "password", "key"]
):
    response.headers["Cache-Control"] = "no-store, ..."

call_next() in Starlette middleware returns a StreamingResponse. Streaming responses do not expose a .body attribute — hasattr(response, "body") is False for virtually every real response. The Cache-Control sensitive-data header is therefore never set, even though the code appears to provide that protection. Either remove this dead branch or replace it with a route-prefix-only check (e.g., always set no-store for /v1/auth and /v1/billing).

Fix this →

Minor: `import time` inside middleware body

main.py:266

async def extensions_graphql_metrics_middleware(request: Request, call_next):
    path = request.url.path
    if not (...):
        return await call_next(request)

    import time        # ← deferred import inside hot path
    from robosystems.middleware.otel.metrics import (...)

Python caches module imports so this is not a correctness bug, but the deferred imports are unusual and could confuse readers. Move them to the top of main.py. The OTel imports from the conditional fast-exit path can instead be guarded by a module-level try/except ImportError.

Design: Business events fire on idempotent cache replays

middleware/otel/metrics.py:904-916 / middleware/extensions.py

When execute_operation returns a cached envelope (idempotent replay), it returns early — on_fresh_success is skipped and audit status is idempotent_replay=True. But the endpoint_metrics_decorator wrapping the route function fires the business_event_type counter regardless of whether the response came from cache:

# decorator fires this even for replayed operations
if business_event_type:
    metrics_instance.record_business_event(...)

This inflates business event counts for retried clients. Consider suppressing the business event when execute_operation returns a status="completed" envelope that has its operation_id matching a cached one — or expose an idempotent_replay flag through the envelope for the decorator to check.

Minor: `status_code` type inconsistency between `record_request` and `record_request_duration`

middleware/otel/metrics.py:343-355

# record_request: status_code converted to str ✓
"status_code": str(status_code),

# record_request_duration: status_code left as int ✗
"status_code": status_code,        # line 350

OTel attribute types must be consistent across recordings of the same instrument. If both methods record to _request_duration, having status_code as int in one path and str in another will cause attribute key collisions or silent dimension splits in Prometheus.

Positives

created_by in OperationEnvelope — clean, well-motivated addition with proper alias (createdBy), full coverage in the test matrix, and correct propagation through execute_operation. The docstring clearly explains the legacy-caller None case.
OpenTelemetryExtensionSync in schema.py — the comment explaining why the Sync variant is required (tests use execute_sync) is excellent, and the reasoning about async vs sync execution overhead is correct.
GraphQL metrics middleware cardinality design — normalizing to "/extensions/{graph_id}/graphql" rather than the actual path keeps Prometheus series count bounded. The graph_id appears only on the business event (event_graph_id), which is the right tradeoff.
Exception re-raise pattern in endpoint_metrics_decorator — error metrics are recorded before the exception propagates, so failures are visible in OTel even when the caller gets a 4xx/5xx.
app.state.current_time — consider renaming to startup_time or started_at to avoid the implication that it updates.
@app.on_event("startup") / @app.on_event("shutdown") — these are deprecated in recent FastAPI in favor of the lifespan async context manager. Not blocking, but worth flagging for a follow-up.

- Improved cache control for sensitive API responses by implementing a path-prefix allowlist, ensuring sensitive data is not cached. - Updated the `OperationEnvelope` model to include an `idempotent_replay` attribute, allowing clients to distinguish between fresh executions and cached responses. - Enhanced the `endpoint_metrics_decorator` to support explicit HTTP method labels, improving metrics accuracy for endpoints without a `Request` parameter. - Added tests to verify the correct behavior of the new `idempotent_replay` feature and the explicit method handling in metrics recording.

- Introduced a new `lifespan` async context manager to handle startup and shutdown events, replacing deprecated event handlers. - Added validation for environment configuration during startup, with logging for successful or failed validation. - Initialized query queue executor and Redis SSE event subscriber during startup, with error handling for each step. - Updated the `is_relaxed_csp_path` function to simplify its docstring and improve clarity. - Enhanced custom documentation routes for Swagger and ReDoc with dark theme support.

jfrench9 · 2026-04-15T05:28:02Z

@claude can you review the latest changes

claude · 2026-04-15T05:28:27Z

Claude finished @jfrench9's task in 3m 46s —— View job

PR Review: Add OLAP metrics via OpenTelemetry and improve error handling

Tasks:

Gather context and understand the request
Check latest commits and diff vs origin/main
Read changed files (middleware, routers, main.py, tests)
Analyze changes and identify issues
Post comprehensive review

Overall Assessment

The three commits since the previous review have addressed every issue from the earlier round — the "UNKNOWN" method bug, the StreamingResponse.body false security guard, the status_code type inconsistency, and the deprecated @app.on_event handlers. The new idempotent_replay / created_by machinery is well-designed, and the test coverage for both additions is solid. One bug in the async-dispatch path and a few minor items still need attention.

Previous Issues — Status

Issue	Status
`endpoint_metrics_decorator` emits `method="UNKNOWN"`	✅ Fixed — `method=` param added; all 23 endpoints decorated with `method="POST"`
`response.body` check never fires on `StreamingResponse`	✅ Fixed — replaced with path-prefix allowlist
`status_code` type inconsistency (`int` vs `str`)	✅ Fixed — `record_request_duration` now uses `str(status_code)`
`@app.on_event` deprecated	✅ Fixed — migrated to `lifespan` async context manager
Business events fire on idempotent cache replays	✅ Fixed — `idempotent_replay` flag; decorator suppresses counter on replays
Deferred `import time` inside middleware hot path	✅ Fixed — moved to module level

Issues

Bug: `auto_map_elements_op` idempotency replay path doesn't set `idempotent_replay=True`

robosystems/routers/extensions/roboledger/operations.py:1111-1122

auto_map_elements_op is the only operation that handles its own idempotency cache manually (it's async/Dagster-dispatched and can't go through execute_operation). The cached-hit branch returns the envelope directly without applying the idempotent_replay=True copy that execute_operation correctly applies at extensions.py:582:

# execute_operation does this correctly:
cached = cached.model_copy(update={"idempotent_replay": True})

# auto_map_elements_op does NOT:
if cached is not None:
    log_operation_audit(..., idempotent_replay=True)  # ← audit log ✓
    return cached  # ← envelope.idempotent_replay stays False ✗

Consequences:

The endpoint_metrics_decorator wrapping this route reads getattr(result, "idempotent_replay", False) — it stays False, so ledger_auto_map_elements business events are not suppressed on replays, inflating the counter.
Clients that inspect the envelope to distinguish "new enqueue" from "retry replay" get a False even though the task was not re-enqueued.

Fix is one line — mirror what execute_operation does:

if cached is not None:
    log_operation_audit(...)
    return cached.model_copy(update={"idempotent_replay": True})  # add this

Fix this →

Minor: `_sensitive_path_prefixes` tuple allocated on every request

main.py:322

async def security_headers_middleware(request: Request, call_next):
    ...
    _sensitive_path_prefixes = ("/v1/auth", "/v1/user", "/v1/billing", "/v1/orgs")  # ← per request

This tuple is re-created on every HTTP request. Move it to module level (or inside create_app as a closure constant) so it's allocated once:

_SENSITIVE_PATH_PREFIXES = ("/v1/auth", "/v1/user", "/v1/billing", "/v1/orgs")

Flag: Major dependency version jumps — verify numpy 2.x compatibility

pyproject.toml

Several version constraints moved significantly:

Package	Before	After
`numpy`	`>=1.26.0,<3.0`	`>=2.4.0,<3.0`
`strawberry-graphql`	`>=0.240.0,<1.0`	`>=0.314.0,<1.0`
`fastapi`	`>=0.116.0,<1.0`	`>=0.135.0,<1.0`
`dagster`	`>=1.9.0,<2.0`	`>=1.13.0,<2.0`
`pyarrow`	`>=20.0.0`	`>=23.0.0`

numpy 2.x dropped the legacy np.bool, np.int, np.float, np.complex aliases and changed several dtype behaviors — if any adapter or analytics code uses those, it will fail at runtime. The arelle-release==2.39.5 pin (XBRL processor) and onnxruntime transitive dep (via fastembed) both use numpy internals; worth verifying those are tested under the new constraint.

Positives

lifespan migration — the async context manager form is clean; error paths in startup correctly distinguish prod (fail fast) vs. dev (warn and continue). The SSE subscriber stop is guarded so shutdown doesn't crash on startup-failure scenarios.
idempotent_replay + created_by in OperationEnvelope — the model_copy approach to avoid mutating the cached instance is thoughtful (comment at extensions.py:579 explains the future-proofing rationale well). Both fields have correct alias for camelCase wire format, and the test matrix in test_extensions.py covers the alias round-trip explicitly.
auto_map_elements_op created_by propagation — the only operation that manually calls wrap_pending correctly passes created_by=user_id so its envelopes carry the same provenance as dispatcher-routed ones.
OpenAPI enrichment — injecting OperationError, shared error responses, Idempotency-Key header, and GraphQL auth notes into every matching path is the right place for it (keeps router files thin). The _is_operation_path / _is_graphql_path helpers make the dispatch readable, and using setdefault avoids overwriting per-operation overrides.
OpenTelemetryExtensionSync in schema — the comment explaining why the Sync variant is required (tests use execute_sync; async variant breaks sync introspection) is exactly the kind of non-obvious context that prevents a future maintainer from "upgrading" to the async version and breaking the test suite.
graph_id kwargs fallback in decorator — adding the kwargs.get("graph_id") path correctly handles the case where no Request object is declared on the handler but graph_id is still a typed path parameter. This gives business events correct event_graph_id dimensions without requiring a request: Request parameter on every route.

- Centralized sensitive path prefixes in a single constant for improved maintainability and clarity in cache control logic. - Updated the `auto_map_elements_op` function to ensure proper handling of idempotent replay, preventing double-counting of business events and allowing clients to distinguish between fresh and replayed tasks. - Added comprehensive tests for `auto_map_elements_op` to validate the new idempotent replay behavior and ensure cache integrity.

jfrench9 added 2 commits April 15, 2026 00:21

jfrench9 merged commit 98e7e9c into main Apr 15, 2026
7 checks passed

jfrench9 deleted the feature/olap-otel branch April 15, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OLAP metrics via OpenTelemetry and improve error handling#618

Add OLAP metrics via OpenTelemetry and improve error handling#618
jfrench9 merged 4 commits intomainfrom
feature/olap-otel

jfrench9 commented Apr 15, 2026

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Apr 15, 2026

Summary

Key Accomplishments

Observability & Metrics

Extension Enhancements

Application Infrastructure

Documentation

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add OLAP metrics via OpenTelemetry and improve error handling

Overall Assessment

Issues

Bug: endpoint_metrics_decorator always emits method="UNKNOWN" on every decorated endpoint

Bug: response.body check in security_headers_middleware is always False

Minor: import time inside middleware body

Design: Business events fire on idempotent cache replays

Minor: status_code type inconsistency between record_request and record_request_duration

Positives

Uh oh!

jfrench9 commented Apr 15, 2026

Uh oh!

claude Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add OLAP metrics via OpenTelemetry and improve error handling

Overall Assessment

Previous Issues — Status

Issues

Bug: auto_map_elements_op idempotency replay path doesn't set idempotent_replay=True

Minor: _sensitive_path_prefixes tuple allocated on every request

Flag: Major dependency version jumps — verify numpy 2.x compatibility

Positives

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 15, 2026 •

edited

Loading

Bug: `endpoint_metrics_decorator` always emits `method="UNKNOWN"` on every decorated endpoint

Bug: `response.body` check in `security_headers_middleware` is always `False`

Minor: `import time` inside middleware body

Minor: `status_code` type inconsistency between `record_request` and `record_request_duration`

claude Bot commented Apr 15, 2026 •

edited

Loading

Bug: `auto_map_elements_op` idempotency replay path doesn't set `idempotent_replay=True`

Minor: `_sensitive_path_prefixes` tuple allocated on every request