feat(observability): Sprint 16 §5.6 — Prometheus metrics + OpenTelemetry tracing#14
feat(observability): Sprint 16 §5.6 — Prometheus metrics + OpenTelemetry tracing#14dobrodob wants to merge 4 commits intofeat/sprint-16-a-slow-query-logfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request adds Prometheus metrics and OpenTelemetry tracing infrastructure to the application. It introduces an HTTP metrics interceptor for RED (Request/Error/Duration) tracking, exposes metrics at a Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HttpMetricsInterceptor
participant Controller
participant MetricsService
participant PrometheusRegistry
participant Client2 as Client (/metrics)
Client->>HttpMetricsInterceptor: HTTP GET /health
Note over HttpMetricsInterceptor: Capture timestamp, extract req/res
HttpMetricsInterceptor->>Controller: Forward request
Controller->>Controller: Process request
Controller-->>HttpMetricsInterceptor: Return response
HttpMetricsInterceptor->>MetricsService: recordHttp(method, route, status, duration)
MetricsService->>PrometheusRegistry: Increment counter + record histogram
HttpMetricsInterceptor-->>Client: HTTP 200
Client2->>Controller: GET /metrics
Controller->>PrometheusRegistry: register.metrics()
PrometheusRegistry-->>Controller: Prometheus exposition text
Controller-->>Client2: HTTP 200 + exposition data
sequenceDiagram
participant Process as Node Process
participant TracingModule as tracing.ts
participant OTelSDK as OpenTelemetry SDK
participant AutoInstrument as Auto-Instrumentations
participant AppModule
Process->>TracingModule: import (trigger side-effect)
TracingModule->>TracingModule: Check OTEL_EXPORTER_OTLP_ENDPOINT
alt OTEL_EXPORTER_OTLP_ENDPOINT set
TracingModule->>OTelSDK: new NodeSDK({ resource, traceExporter })
TracingModule->>OTelSDK: sdk.start()
OTelSDK->>AutoInstrument: Initialize auto-instrumentations
AutoInstrument->>Process: Patch modules (http, fs, etc.)
OTelSDK-->>TracingModule: SDK initialized
TracingModule->>Process: Register SIGTERM/SIGINT handlers
else OTEL_EXPORTER_OTLP_ENDPOINT not set
TracingModule-->>TracingModule: Return early (tracing disabled)
end
Process->>AppModule: Load (after tracing patches installed)
AppModule-->>Process: App initialized
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
78600c6 to
a0a24e5
Compare
|
@coderabbitai review — this PR was marked as skipped earlier. Please review explicitly. |
|
🧠 Learnings used✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
apps/core/test/observability.integration.spec.ts (1)
59-60: Tighten the metric assertion to includerouteand avoid/metricsfalse positives.The current regex can match the scrape request itself (
GET /metrics,200) even if/healthwasn’t recorded correctly.Suggested assertion shape
- /publy_http_requests_total\{[^}]*method="GET"[^}]*status="200"[^}]*\}\s+[1-9]\d*/ + /publy_http_requests_total\{[^}]*method="GET"[^}]*route="(?:\/health|\/)"[^}]*status="200"[^}]*\}\s+[1-9]\d*/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/core/test/observability.integration.spec.ts` around lines 59 - 60, The metric assertion currently can match the scrape request itself; tighten the regex in the test (observability.integration.spec.ts) so the prometheus metric line includes the route label for the endpoint you expect (e.g., route="/health") in addition to method="GET" and status="200" — update the expect(res.text).toMatch(...) for publy_http_requests_total to require route="..."/health so the test cannot be satisfied by the /metrics scrape line.apps/core/src/modules/observability/metrics.service.ts (1)
29-31: Update the comment: it referencesfinalize(), but interceptor currently records viatap.Small docs drift, but worth aligning to avoid confusion during future changes.
Suggested doc tweak
- * in the `finalize()` branch so success + error paths both flow through + * in the interceptor's `tap({ next, error })` callbacks so success + error + * paths both flow through🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/core/src/modules/observability/metrics.service.ts` around lines 29 - 31, Update the doc comment above the MetricsService method that records completed HTTP requests to remove the stale reference to finalize() and instead mention that the interceptor records via tap in HttpMetricsInterceptor; specifically, change the wording in the comment that currently says "Called from `HttpMetricsInterceptor` in the `finalize()` branch" to something like "Called from `HttpMetricsInterceptor`, which records via `tap()` so both success and error paths flow here with the final response status code", ensuring the comment references HttpMetricsInterceptor and the MetricsService record method name so future readers can locate the flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/core/src/modules/observability/http-metrics.interceptor.ts`:
- Around line 44-47: The fallback for route labels currently uses raw req.path
which can create unbounded metric cardinality; change the fallback logic in the
http-metrics.interceptor (where route is computed and passed to
this.metrics.recordHttp) to return a bounded, constant label (e.g. "<unmatched>"
or "<unknown_route>") or a sanitized, limited form instead of req.path for cases
where req.route?.path is undefined, ensuring all 404/unmatched requests use that
constant/sanitized label.
In `@apps/core/test/observability.integration.spec.ts`:
- Around line 28-32: The test suite currently initializes the Nest app in
beforeAll but never resets DB state between tests; add a beforeEach hook that
calls the shared resetDatabase() helper so each integration test runs against a
clean DB. Specifically, add beforeEach(async () => { await resetDatabase(); });
alongside the existing beforeAll that creates moduleRef/app; ensure the
resetDatabase symbol is imported from your test helpers and invoked before each
test so moduleRef/app initialization remains unchanged.
In `@package.json`:
- Around line 54-59: Update the `@opentelemetry/api` dependency entry in
package.json to constrain the version so it cannot resolve to 1.10.0+ (replace
the caret "^1.9.1" for the "@opentelemetry/api" dependency with either a tilde
"~1.9.1" or an explicit range ">=1.9.1 <1.10.0"); after changing the dependency
string, run your package manager to update the lockfile (npm/yarn/pnpm) so the
locked dependency respects the new constraint.
---
Nitpick comments:
In `@apps/core/src/modules/observability/metrics.service.ts`:
- Around line 29-31: Update the doc comment above the MetricsService method that
records completed HTTP requests to remove the stale reference to finalize() and
instead mention that the interceptor records via tap in HttpMetricsInterceptor;
specifically, change the wording in the comment that currently says "Called from
`HttpMetricsInterceptor` in the `finalize()` branch" to something like "Called
from `HttpMetricsInterceptor`, which records via `tap()` so both success and
error paths flow here with the final response status code", ensuring the comment
references HttpMetricsInterceptor and the MetricsService record method name so
future readers can locate the flow.
In `@apps/core/test/observability.integration.spec.ts`:
- Around line 59-60: The metric assertion currently can match the scrape request
itself; tighten the regex in the test (observability.integration.spec.ts) so the
prometheus metric line includes the route label for the endpoint you expect
(e.g., route="/health") in addition to method="GET" and status="200" — update
the expect(res.text).toMatch(...) for publy_http_requests_total to require
route="..."/health so the test cannot be satisfied by the /metrics scrape line.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: abf817e6-433f-4589-bd06-429ef8834ce1
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (12)
.env.exampleapps/core/src/app.module.tsapps/core/src/main.tsapps/core/src/modules/observability/http-metrics.interceptor.tsapps/core/src/modules/observability/metrics.controller.tsapps/core/src/modules/observability/metrics.service.tsapps/core/src/modules/observability/observability.module.tsapps/core/src/observability/tracing.tsapps/core/test/observability.integration.spec.tsdocs/discours-patterns/42-observability-prom-otel.mddocs/discours-patterns/README.mdpackage.json
| // Express sets `req.route?.path` after matching — fall back to the raw | ||
| // URL path (stripped of query string) for 404s where no route matched. | ||
| const route = req.route?.path ?? (typeof req.path === 'string' ? req.path : '<unknown>'); | ||
| this.metrics.recordHttp(req.method, route, res.statusCode, durationSec); |
There was a problem hiding this comment.
Bound the fallback route label; avoid raw req.path on unmatched routes.
Using raw path for 404/unmatched traffic creates unbounded label cardinality (e.g., scanners/random URLs), which can degrade or break metrics storage.
Suggested fix
- const route = req.route?.path ?? (typeof req.path === 'string' ? req.path : '<unknown>');
+ const route = req.route?.path ?? '<unmatched>';📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Express sets `req.route?.path` after matching — fall back to the raw | |
| // URL path (stripped of query string) for 404s where no route matched. | |
| const route = req.route?.path ?? (typeof req.path === 'string' ? req.path : '<unknown>'); | |
| this.metrics.recordHttp(req.method, route, res.statusCode, durationSec); | |
| // Express sets `req.route?.path` after matching — fall back to the raw | |
| // URL path (stripped of query string) for 404s where no route matched. | |
| const route = req.route?.path ?? '<unmatched>'; | |
| this.metrics.recordHttp(req.method, route, res.statusCode, durationSec); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/core/src/modules/observability/http-metrics.interceptor.ts` around lines
44 - 47, The fallback for route labels currently uses raw req.path which can
create unbounded metric cardinality; change the fallback logic in the
http-metrics.interceptor (where route is computed and passed to
this.metrics.recordHttp) to return a bounded, constant label (e.g. "<unmatched>"
or "<unknown_route>") or a sanitized, limited form instead of req.path for cases
where req.route?.path is undefined, ensuring all 404/unmatched requests use that
constant/sanitized label.
| beforeAll(async () => { | ||
| moduleRef = await Test.createTestingModule({ imports: [AppModule] }).compile(); | ||
| app = moduleRef.createNestApplication({ bufferLogs: true }); | ||
| await app.init(); | ||
| }); |
There was a problem hiding this comment.
Add resetDatabase() in beforeEach for this integration spec.
This suite currently boots once but never resets DB state between tests. That breaks the test contract for integration specs and can introduce cross-test coupling/flakes.
As per coding guidelines, “Integration tests must be located at apps/core/test/<name>.integration.spec.ts (NOT co-located with service). They must run against the real dockerized stack and reset the DB via resetDatabase() in beforeEach.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/core/test/observability.integration.spec.ts` around lines 28 - 32, The
test suite currently initializes the Nest app in beforeAll but never resets DB
state between tests; add a beforeEach hook that calls the shared resetDatabase()
helper so each integration test runs against a clean DB. Specifically, add
beforeEach(async () => { await resetDatabase(); }); alongside the existing
beforeAll that creates moduleRef/app; ensure the resetDatabase symbol is
imported from your test helpers and invoked before each test so moduleRef/app
initialization remains unchanged.
| "@opentelemetry/api": "^1.9.1", | ||
| "@opentelemetry/auto-instrumentations-node": "^0.73.0", | ||
| "@opentelemetry/exporter-trace-otlp-http": "^0.215.0", | ||
| "@opentelemetry/resources": "^2.7.0", | ||
| "@opentelemetry/sdk-node": "^0.215.0", | ||
| "@opentelemetry/semantic-conventions": "^1.40.0", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
OpenTelemetry JS SDK 0.215.0 compatible API version
💡 Result:
The OpenTelemetry JS SDK experimental version 0.215.0 (including @opentelemetry/sdk-node 0.215.0) is compatible with @opentelemetry/api versions >=1.3.0 and <1.10.0.
Citations:
- 1: https://www.npmjs.com/package/@opentelemetry/sdk-node
- 2: https://registry.npmjs.org/%40opentelemetry%2Fsdk-node
Constrain @opentelemetry/api version to ensure long-term SDK 0.215.0 compatibility.
The @opentelemetry/sdk-node 0.215.0 is compatible with @opentelemetry/api versions >=1.3.0 and <1.10.0. While the current caret range ^1.9.1 is compatible with the currently released 1.9.1, it could pull in incompatible 1.10.0+ versions when released. Use a tilde range (~1.9.1) or explicit upper bound (>=1.9.1 <1.10.0) instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@package.json` around lines 54 - 59, Update the `@opentelemetry/api` dependency
entry in package.json to constrain the version so it cannot resolve to 1.10.0+
(replace the caret "^1.9.1" for the "@opentelemetry/api" dependency with either
a tilde "~1.9.1" or an explicit range ">=1.9.1 <1.10.0"); after changing the
dependency string, run your package manager to update the lockfile
(npm/yarn/pnpm) so the locked dependency respects the new constraint.
…dings 1. **http-metrics.interceptor.ts** — collapse the route fallback from `req.path` (unbounded cardinality for 404s, scanner traffic) to a single `<unmatched>` label. Prevents metrics-store growth under hostile input. 2. **observability.integration.spec.ts** — add the standard `resetDatabase()` + re-seed-publy pattern in `beforeEach`. This suite doesn't write, but the integration-spec contract is "DB reset before each test"; conformance beats "we happen not to need it". Short-circuit guard handles the cross-suite race where a parallel file leaves publy between reset and create. 3. **package.json** — tighten `@opentelemetry/api` from `^1.9.1` to `~1.9.1`. `@opentelemetry/sdk-node@0.215.0` is compatible with `@opentelemetry/api` `>=1.3.0 <1.10.0` per upstream; a future caret-satisfying 1.10 release would break our SDK. Verification: 2/2 observability spec passing, tsc clean, biome clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dings 1. **http-metrics.interceptor.ts** — collapse the route fallback from `req.path` (unbounded cardinality for 404s, scanner traffic) to a single `<unmatched>` label. Prevents metrics-store growth under hostile input. 2. **observability.integration.spec.ts** — add the standard `resetDatabase()` + re-seed-publy pattern in `beforeEach`. This suite doesn't write, but the integration-spec contract is "DB reset before each test"; conformance beats "we happen not to need it". Short-circuit guard handles the cross-suite race where a parallel file leaves publy between reset and create. 3. **package.json** — tighten `@opentelemetry/api` from `^1.9.1` to `~1.9.1`. `@opentelemetry/sdk-node@0.215.0` is compatible with `@opentelemetry/api` `>=1.3.0 <1.10.0` per upstream; a future caret-satisfying 1.10 release would break our SDK. Verification: 2/2 observability spec passing, tsc clean, biome clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8e3c542 to
dddf31f
Compare
…try tracing Adds the second half of the observability foundation alongside pattern 41 (slow-query logs). Logs explain WHAT happened; metrics aggregate; traces stitch sequence — together they cover the operator's "why is it slow?" path. Prometheus layer: - ObservabilityModule (@global) — `@willsoto/nestjs-prometheus` for DI, `prom-client` default registry for the exposition endpoint. - `/metrics` endpoint via our own @public controller — the library's default would trip the global JwtAuthGuard. - MetricsService — thin facade over `publy_http_requests_total` (Counter) and `publy_http_request_duration_seconds` (Histogram with 10ms–10s buckets tuned for our p95 target). - HttpMetricsInterceptor (APP_INTERCEPTOR) — records RED sample on every completed request. `route` label uses `req.route?.path` (pattern, not URL) to keep cardinality bounded. - Default node/process metrics enabled (cpu, rss, heap) — day-1 baseline. OpenTelemetry layer: - `apps/core/src/observability/tracing.ts` — NodeSDK + auto-instrumentations (HTTP, Express, Prisma, ioredis). Env-gated on OTEL_EXPORTER_OTLP_ENDPOINT — unset is a no-op (dev/test default), set activates OTLP/HTTP push. - Side-effect `startTracing()` at module bottom + `import './observability/ tracing'` as the FIRST statement in main.ts so require-hooks patch upstream modules before any app code loads. - Graceful shutdown (SIGTERM/SIGINT) flushes in-flight spans. Tests + docs: - observability.integration.spec.ts — smoke-level: /metrics is unauth, emits Prometheus format, HTTP counter increments after a /health hit. - 42-observability-prom-otel.md — pattern doc, cardinality guidance, import-order footgun warning. - .env.example — OTEL_EXPORTER_OTLP_ENDPOINT + OTEL_SERVICE_NAME docs. Dependencies added: @willsoto/nestjs-prometheus, prom-client, @opentelemetry/{api,sdk-node,auto-instrumentations-node,exporter-trace- otlp-http,resources,semantic-conventions}. +18 + 141 packages, 0 vulns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dings 1. **http-metrics.interceptor.ts** — collapse the route fallback from `req.path` (unbounded cardinality for 404s, scanner traffic) to a single `<unmatched>` label. Prevents metrics-store growth under hostile input. 2. **observability.integration.spec.ts** — add the standard `resetDatabase()` + re-seed-publy pattern in `beforeEach`. This suite doesn't write, but the integration-spec contract is "DB reset before each test"; conformance beats "we happen not to need it". Short-circuit guard handles the cross-suite race where a parallel file leaves publy between reset and create. 3. **package.json** — tighten `@opentelemetry/api` from `^1.9.1` to `~1.9.1`. `@opentelemetry/sdk-node@0.215.0` is compatible with `@opentelemetry/api` `>=1.3.0 <1.10.0` per upstream; a future caret-satisfying 1.10 release would break our SDK. Verification: 2/2 observability spec passing, tsc clean, biome clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### T1.2 — content-type regex tightened The old regex OR'd `text/plain` with `version=0.0.4` — a broken future content-type missing either piece would silently pass. Now asserts both parts separately, so regressions in the Prometheus exposition format surface immediately. ### T1.3 — OTel deployment.environment resource attribute Adds `deployment.environment.name` to the tracer resource, sourced from (in order) `OTEL_DEPLOYMENT_ENVIRONMENT` env → `NODE_ENV` → `"development"`. Lets Grafana/Tempo split dashboards by environment with a single label. Sampling is deliberately NOT hard-coded — the OTel SDK natively honors `OTEL_TRACES_SAMPLER=traceidratio` + `OTEL_TRACES_SAMPLER_ARG=0.1`, the standard way to tune cost in prod. Keeps operators in control without code deploys. ### T2.3 — Prisma query duration histogram - `libs/prisma/src/prisma.service.ts`: new `PrismaMetricsSink` interface + `PRISMA_METRICS_SINK` DI token. Optional — library stays standalone. Every query flows through the sink; slow ones also log. - `metrics.service.ts`: `recordPrismaQuery(durationMs, target, slow)` observes `publy_prisma_query_duration_ms` (1 ms → 10 s buckets) and ticks `publy_prisma_slow_queries_total` on slow rows. - `observability.module.ts`: factory-provides the sink against MetricsService — one-way coupling (libs/prisma knows nothing about Prometheus). ### T2.4 — EventRouter Prometheus gauges Four new metrics expose the SSE fan-out's internal counters: - `publy_event_router_clients_connected` (gauge) - `publy_event_router_clients_peak` (gauge) - `publy_event_router_events_received_total` (counter) - `publy_event_router_events_routed_total` (counter) Gauges use prom-client's `collect` callback pattern (pull at scrape time, not push per event). Counters compute deltas from a snapshot cache so they stay monotonic even though EventRouter itself exposes lifetime totals. Sink-in-metrics-service + collect-at-scrape-time avoid adding Prometheus as a dependency of EventRouterService — keeps that service's failure surface tight. Verification: /metrics integration spec 2/2, tsc clean, biome clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #12 added `@sentry/node` to main's lockfile. When this branch rebased onto the new main, we hit a package-lock.json merge conflict (additions on both sides) and took `theirs` to keep our OTel and Prometheus entries. That resolution dropped Sentry's entries. Running `npm install` restores them — the lockfile now contains every dependency package.json declares, on any of the 7 sprint-16 branches, so each branch installs cleanly in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
41bc13c to
a2bc573
Compare
dddf31f to
1e7a5c1
Compare
Summary
Sprint 16 §5.6 — adds the second half of the observability foundation alongside PR #13 (pattern 41 slow-query logs). Logs explain what happened; metrics aggregate; traces stitch sequence — together they cover the operator's "why is it slow?" path without external APM dependency.
PR 2 of 7 for Sprint 16 — stacked on top of #13.
What's in it
Prometheus layer
apps/core/src/modules/observability/withObservabilityModule(@Global).@willsoto/nestjs-prometheusfor DI;prom-client's default registry for the exposition endpoint./metricsendpoint via a custom@Publiccontroller (the library's default would trip the globalJwtAuthGuard).MetricsServicefacade +HttpMetricsInterceptor(APP_INTERCEPTOR) recording `publy_http_requests_total` (Counter) and `publy_http_request_duration_seconds` (Histogram, 10ms–10s buckets).OpenTelemetry layer
Dependencies added
Test plan
Verification after merge
```bash
curl http://localhost:3000/metrics | grep publy_
publy_http_request_duration_seconds_bucket{...,le="0.5"} 1
publy_http_request_duration_seconds_count{...} 1
publy_http_requests_total{method="GET",route="/health",status="200"} 1
To enable tracing:
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4318 npm run start:dev
```
Docs
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
/metricsendpoint exposing application and HTTP request metrics (method, route, status code, and duration).Documentation
Chores