Add OLAP metrics via OpenTelemetry and improve error handling#618
Add OLAP metrics via OpenTelemetry and improve error handling#618
Conversation
- Added middleware for request-level metrics on the GraphQL endpoint, enabling Prometheus integration for monitoring GraphQL operations. - Introduced a standardized error response schema for extensions operations, ensuring consistent error handling across various endpoints. - Updated the `OperationEnvelope` model to include `created_by` for better audit correlation. - Enhanced multiple operations in RoboInvestor and RoboLedger with metrics decorators to track business events. - Revised dependencies in `pyproject.toml` and `uv.lock` for compatibility and performance improvements. - Updated README.md to include new demo commands and clarify usage.
|
@claude please review this PR |
|
Claude finished @jfrench9's task in 3m 3s —— View job PR Review: Add OLAP metrics via OpenTelemetry and improve error handlingTasks:
Overall AssessmentThis is a well-structured PR that adds OTel observability to the extensions layer. The IssuesBug:
|
- Improved cache control for sensitive API responses by implementing a path-prefix allowlist, ensuring sensitive data is not cached. - Updated the `OperationEnvelope` model to include an `idempotent_replay` attribute, allowing clients to distinguish between fresh executions and cached responses. - Enhanced the `endpoint_metrics_decorator` to support explicit HTTP method labels, improving metrics accuracy for endpoints without a `Request` parameter. - Added tests to verify the correct behavior of the new `idempotent_replay` feature and the explicit method handling in metrics recording.
- Introduced a new `lifespan` async context manager to handle startup and shutdown events, replacing deprecated event handlers. - Added validation for environment configuration during startup, with logging for successful or failed validation. - Initialized query queue executor and Redis SSE event subscriber during startup, with error handling for each step. - Updated the `is_relaxed_csp_path` function to simplify its docstring and improve clarity. - Enhanced custom documentation routes for Swagger and ReDoc with dark theme support.
|
@claude can you review the latest changes |
|
Claude finished @jfrench9's task in 3m 46s —— View job PR Review: Add OLAP metrics via OpenTelemetry and improve error handlingTasks:
Overall AssessmentThe three commits since the previous review have addressed every issue from the earlier round — the Previous Issues — Status
IssuesBug:
|
| Package | Before | After |
|---|---|---|
numpy |
>=1.26.0,<3.0 |
>=2.4.0,<3.0 |
strawberry-graphql |
>=0.240.0,<1.0 |
>=0.314.0,<1.0 |
fastapi |
>=0.116.0,<1.0 |
>=0.135.0,<1.0 |
dagster |
>=1.9.0,<2.0 |
>=1.13.0,<2.0 |
pyarrow |
>=20.0.0 |
>=23.0.0 |
numpy 2.x dropped the legacy np.bool, np.int, np.float, np.complex aliases and changed several dtype behaviors — if any adapter or analytics code uses those, it will fail at runtime. The arelle-release==2.39.5 pin (XBRL processor) and onnxruntime transitive dep (via fastembed) both use numpy internals; worth verifying those are tested under the new constraint.
Positives
-
lifespanmigration — the async context manager form is clean; error paths in startup correctly distinguish prod (fail fast) vs. dev (warn and continue). The SSE subscriber stop is guarded so shutdown doesn't crash on startup-failure scenarios. -
idempotent_replay+created_byinOperationEnvelope— themodel_copyapproach to avoid mutating the cached instance is thoughtful (comment atextensions.py:579explains the future-proofing rationale well). Both fields have correctaliasfor camelCase wire format, and the test matrix intest_extensions.pycovers the alias round-trip explicitly. -
auto_map_elements_opcreated_bypropagation — the only operation that manually callswrap_pendingcorrectly passescreated_by=user_idso its envelopes carry the same provenance as dispatcher-routed ones. -
OpenAPI enrichment — injecting
OperationError, shared error responses,Idempotency-Keyheader, and GraphQL auth notes into every matching path is the right place for it (keeps router files thin). The_is_operation_path/_is_graphql_pathhelpers make the dispatch readable, and usingsetdefaultavoids overwriting per-operation overrides. -
OpenTelemetryExtensionSyncin schema — the comment explaining why theSyncvariant is required (tests useexecute_sync; async variant breaks sync introspection) is exactly the kind of non-obvious context that prevents a future maintainer from "upgrading" to the async version and breaking the test suite. -
graph_idkwargs fallback in decorator — adding thekwargs.get("graph_id")path correctly handles the case where noRequestobject is declared on the handler butgraph_idis still a typed path parameter. This gives business events correctevent_graph_iddimensions without requiring arequest: Requestparameter on every route.
- Centralized sensitive path prefixes in a single constant for improved maintainability and clarity in cache control logic. - Updated the `auto_map_elements_op` function to ensure proper handling of idempotent replay, preventing double-counting of business events and allowing clients to distinguish between fresh and replayed tasks. - Added comprehensive tests for `auto_map_elements_op` to validate the new idempotent replay behavior and ensure cache integrity.
Summary
This PR integrates OpenTelemetry (OTel) observability into the OLAP extensions layer, adding metrics instrumentation and improving error handling across the middleware and router extensions. A new application entrypoint (
main.py) is introduced alongside dependency and configuration updates.Key Accomplishments
Observability & Metrics
extensions.pymiddleware with metrics hooks and improved error handling to provide better observability into request lifecyclesExtension Enhancements
Application Infrastructure
main.py(228 lines) as a new application entrypoint, centralizing startup configuration and OTel bootstrappingpyproject.tomlwith revised dependencies (including OTel packages) and project configuration adjustmentsuv.lockto reflect dependency changesDocumentation
Breaking Changes
pyproject.tomlmay require a fresh environment setup (uv.lockupdated accordingly)robosystems/graphql/schema.pymay affect existing GraphQL clientsTesting Notes
tests/middleware/test_extensions.py) to reflect the new metrics and error handling behaviorInfrastructure Considerations
🤖 Generated with Claude Code
Branch Info:
feature/olap-otelmainCo-Authored-By: Claude noreply@anthropic.com