Skip to content

feat(observability): Sprint 16 §5.6 — Prometheus metrics + OpenTelemetry tracing#14

Open
dobrodob wants to merge 4 commits intofeat/sprint-16-a-slow-query-logfrom
feat/sprint-16-b-observability
Open

feat(observability): Sprint 16 §5.6 — Prometheus metrics + OpenTelemetry tracing#14
dobrodob wants to merge 4 commits intofeat/sprint-16-a-slow-query-logfrom
feat/sprint-16-b-observability

Conversation

@dobrodob
Copy link
Copy Markdown
Member

@dobrodob dobrodob commented Apr 19, 2026

Summary

Sprint 16 §5.6 — adds the second half of the observability foundation alongside PR #13 (pattern 41 slow-query logs). Logs explain what happened; metrics aggregate; traces stitch sequence — together they cover the operator's "why is it slow?" path without external APM dependency.

PR 2 of 7 for Sprint 16 — stacked on top of #13.

What's in it

Prometheus layer

  • New apps/core/src/modules/observability/ with ObservabilityModule (@Global).
  • @willsoto/nestjs-prometheus for DI; prom-client's default registry for the exposition endpoint.
  • /metrics endpoint via a custom @Public controller (the library's default would trip the global JwtAuthGuard).
  • MetricsService facade + HttpMetricsInterceptor (APP_INTERCEPTOR) recording `publy_http_requests_total` (Counter) and `publy_http_request_duration_seconds` (Histogram, 10ms–10s buckets).
  • Default Node/process metrics enabled (cpu, rss, heap).
  • Route label uses Express's `req.route?.path` pattern, not URL — bounded cardinality.

OpenTelemetry layer

  • `apps/core/src/observability/tracing.ts` — NodeSDK + auto-instrumentations (HTTP, Express, Prisma, ioredis).
  • Env-gated on `OTEL_EXPORTER_OTLP_ENDPOINT` — unset = no-op (dev/test); set = OTLP/HTTP push to the collector URL.
  • Side-effect `startTracing()` at module bottom + `import './observability/tracing'` as the FIRST statement in `main.ts` so require-hooks patch upstream modules before any app code loads.
  • Graceful SIGTERM/SIGINT shutdown flushes in-flight spans.

Dependencies added

  • `@willsoto/nestjs-prometheus`, `prom-client` (+18 packages total)
  • `@opentelemetry/{api,sdk-node,auto-instrumentations-node,exporter-trace-otlp-http,resources,semantic-conventions}` (+141 packages total)
  • 0 vulnerabilities

Test plan

  • `npx vitest run apps/core/test/observability.integration.spec.ts` — 2/2 passing (GET /metrics emits Prometheus format unauth; `publy_http_requests_total` increments after a /health hit)
  • Full suite — 457/457 still green
  • `nest build core` succeeds

Verification after merge

```bash
curl http://localhost:3000/metrics | grep publy_

publy_http_request_duration_seconds_bucket{...,le="0.5"} 1

publy_http_request_duration_seconds_count{...} 1

publy_http_requests_total{method="GET",route="/health",status="200"} 1

To enable tracing:

OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4318 npm run start:dev
```

Docs

  • New pattern: `docs/discours-patterns/42-observability-prom-otel.md` — cardinality guidance, OTel import-order footgun, fail-closed init semantics.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a /metrics endpoint exposing application and HTTP request metrics (method, route, status code, and duration).
    • Added optional OpenTelemetry tracing support, activated via environment configuration.
  • Documentation

    • Added observability pattern documentation covering metrics and tracing setup.
  • Chores

    • Added OpenTelemetry and Prometheus monitoring dependencies.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 19, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0a44b9be-5bc5-4554-b41e-0f9bb02e5935

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds Prometheus metrics and OpenTelemetry tracing infrastructure to the application. It introduces an HTTP metrics interceptor for RED (Request/Error/Duration) tracking, exposes metrics at a /metrics endpoint, bootstraps the OpenTelemetry Node.js SDK at startup with OTLP export, and adds corresponding configuration and dependencies.

Changes

Cohort / File(s) Summary
Environment & Dependencies
.env.example, package.json
Added OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME environment variables; added OpenTelemetry packages (@opentelemetry/*) and Prometheus integration packages (@willsoto/nestjs-prometheus, prom-client).
Observability Core Modules
apps/core/src/observability/tracing.ts, apps/core/src/modules/observability/observability.module.ts, apps/core/src/modules/observability/metrics.service.ts, apps/core/src/modules/observability/metrics.controller.ts, apps/core/src/modules/observability/http-metrics.interceptor.ts
Implemented OpenTelemetry SDK bootstrap with conditional initialization based on OTEL_EXPORTER_OTLP_ENDPOINT; created global ObservabilityModule with Prometheus metric registration, HTTP RED metrics interceptor, metrics controller exposing /metrics endpoint, and metrics service for recording observations.
Application Integration
apps/core/src/main.ts, apps/core/src/app.module.ts
Added early import of observability/tracing in main.ts to bootstrap instrumentation before app module loads; registered ObservabilityModule in AppModule imports.
Testing & Documentation
apps/core/test/observability.integration.spec.ts, docs/discours-patterns/42-observability-prom-otel.md, docs/discours-patterns/README.md
Added integration test validating /metrics endpoint responses and HTTP metric recording; added pattern documentation describing the two-layer observability approach (Prometheus metrics + OTel tracing); updated pattern index.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HttpMetricsInterceptor
    participant Controller
    participant MetricsService
    participant PrometheusRegistry
    participant Client2 as Client (/metrics)

    Client->>HttpMetricsInterceptor: HTTP GET /health
    Note over HttpMetricsInterceptor: Capture timestamp, extract req/res
    HttpMetricsInterceptor->>Controller: Forward request
    Controller->>Controller: Process request
    Controller-->>HttpMetricsInterceptor: Return response
    HttpMetricsInterceptor->>MetricsService: recordHttp(method, route, status, duration)
    MetricsService->>PrometheusRegistry: Increment counter + record histogram
    HttpMetricsInterceptor-->>Client: HTTP 200

    Client2->>Controller: GET /metrics
    Controller->>PrometheusRegistry: register.metrics()
    PrometheusRegistry-->>Controller: Prometheus exposition text
    Controller-->>Client2: HTTP 200 + exposition data
Loading
sequenceDiagram
    participant Process as Node Process
    participant TracingModule as tracing.ts
    participant OTelSDK as OpenTelemetry SDK
    participant AutoInstrument as Auto-Instrumentations
    participant AppModule

    Process->>TracingModule: import (trigger side-effect)
    TracingModule->>TracingModule: Check OTEL_EXPORTER_OTLP_ENDPOINT
    alt OTEL_EXPORTER_OTLP_ENDPOINT set
        TracingModule->>OTelSDK: new NodeSDK({ resource, traceExporter })
        TracingModule->>OTelSDK: sdk.start()
        OTelSDK->>AutoInstrument: Initialize auto-instrumentations
        AutoInstrument->>Process: Patch modules (http, fs, etc.)
        OTelSDK-->>TracingModule: SDK initialized
        TracingModule->>Process: Register SIGTERM/SIGINT handlers
    else OTEL_EXPORTER_OTLP_ENDPOINT not set
        TracingModule-->>TracingModule: Return early (tracing disabled)
    end
    Process->>AppModule: Load (after tracing patches installed)
    AppModule-->>Process: App initialized
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Poem

🐰 Metrics and traces now flow,
Red indicators glow and grow,
From /metrics we see,
How fast our requests be,
Observability steals the show! 📊✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and specifically describes the main change: adding Prometheus metrics and OpenTelemetry tracing observability infrastructure as part of Sprint 16 deliverable §5.6.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sprint-16-b-observability

Comment @coderabbitai help to get the list of available commands and usage tips.

@dobrodob
Copy link
Copy Markdown
Member Author

@coderabbitai review — this PR was marked as skipped earlier. Please review explicitly.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 19, 2026

@dobrodob, sure! Kicking off an explicit review of PR #14 now.


🧠 Learnings used
Learnt from: CR
Repo: Discours/publy-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-19T12:24:18.182Z
Learning: Stack: NestJS 11, Node.js 20 LTS, TypeScript 5.7 strict mode, Prisma 6 + PostgreSQL 16, Redis 7, Pino logger, Biome for linting/formatting, Vitest + Supertest for testing.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
apps/core/test/observability.integration.spec.ts (1)

59-60: Tighten the metric assertion to include route and avoid /metrics false positives.

The current regex can match the scrape request itself (GET /metrics, 200) even if /health wasn’t recorded correctly.

Suggested assertion shape
- /publy_http_requests_total\{[^}]*method="GET"[^}]*status="200"[^}]*\}\s+[1-9]\d*/
+ /publy_http_requests_total\{[^}]*method="GET"[^}]*route="(?:\/health|\/)"[^}]*status="200"[^}]*\}\s+[1-9]\d*/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/core/test/observability.integration.spec.ts` around lines 59 - 60, The
metric assertion currently can match the scrape request itself; tighten the
regex in the test (observability.integration.spec.ts) so the prometheus metric
line includes the route label for the endpoint you expect (e.g.,
route="/health") in addition to method="GET" and status="200" — update the
expect(res.text).toMatch(...) for publy_http_requests_total to require
route="..."/health so the test cannot be satisfied by the /metrics scrape line.
apps/core/src/modules/observability/metrics.service.ts (1)

29-31: Update the comment: it references finalize(), but interceptor currently records via tap.

Small docs drift, but worth aligning to avoid confusion during future changes.

Suggested doc tweak
- * in the `finalize()` branch so success + error paths both flow through
+ * in the interceptor's `tap({ next, error })` callbacks so success + error
+ * paths both flow through
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/core/src/modules/observability/metrics.service.ts` around lines 29 - 31,
Update the doc comment above the MetricsService method that records completed
HTTP requests to remove the stale reference to finalize() and instead mention
that the interceptor records via tap in HttpMetricsInterceptor; specifically,
change the wording in the comment that currently says "Called from
`HttpMetricsInterceptor` in the `finalize()` branch" to something like "Called
from `HttpMetricsInterceptor`, which records via `tap()` so both success and
error paths flow here with the final response status code", ensuring the comment
references HttpMetricsInterceptor and the MetricsService record method name so
future readers can locate the flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/core/src/modules/observability/http-metrics.interceptor.ts`:
- Around line 44-47: The fallback for route labels currently uses raw req.path
which can create unbounded metric cardinality; change the fallback logic in the
http-metrics.interceptor (where route is computed and passed to
this.metrics.recordHttp) to return a bounded, constant label (e.g. "<unmatched>"
or "<unknown_route>") or a sanitized, limited form instead of req.path for cases
where req.route?.path is undefined, ensuring all 404/unmatched requests use that
constant/sanitized label.

In `@apps/core/test/observability.integration.spec.ts`:
- Around line 28-32: The test suite currently initializes the Nest app in
beforeAll but never resets DB state between tests; add a beforeEach hook that
calls the shared resetDatabase() helper so each integration test runs against a
clean DB. Specifically, add beforeEach(async () => { await resetDatabase(); });
alongside the existing beforeAll that creates moduleRef/app; ensure the
resetDatabase symbol is imported from your test helpers and invoked before each
test so moduleRef/app initialization remains unchanged.

In `@package.json`:
- Around line 54-59: Update the `@opentelemetry/api` dependency entry in
package.json to constrain the version so it cannot resolve to 1.10.0+ (replace
the caret "^1.9.1" for the "@opentelemetry/api" dependency with either a tilde
"~1.9.1" or an explicit range ">=1.9.1 <1.10.0"); after changing the dependency
string, run your package manager to update the lockfile (npm/yarn/pnpm) so the
locked dependency respects the new constraint.

---

Nitpick comments:
In `@apps/core/src/modules/observability/metrics.service.ts`:
- Around line 29-31: Update the doc comment above the MetricsService method that
records completed HTTP requests to remove the stale reference to finalize() and
instead mention that the interceptor records via tap in HttpMetricsInterceptor;
specifically, change the wording in the comment that currently says "Called from
`HttpMetricsInterceptor` in the `finalize()` branch" to something like "Called
from `HttpMetricsInterceptor`, which records via `tap()` so both success and
error paths flow here with the final response status code", ensuring the comment
references HttpMetricsInterceptor and the MetricsService record method name so
future readers can locate the flow.

In `@apps/core/test/observability.integration.spec.ts`:
- Around line 59-60: The metric assertion currently can match the scrape request
itself; tighten the regex in the test (observability.integration.spec.ts) so the
prometheus metric line includes the route label for the endpoint you expect
(e.g., route="/health") in addition to method="GET" and status="200" — update
the expect(res.text).toMatch(...) for publy_http_requests_total to require
route="..."/health so the test cannot be satisfied by the /metrics scrape line.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: abf817e6-433f-4589-bd06-429ef8834ce1

📥 Commits

Reviewing files that changed from the base of the PR and between 0bdb309 and a0a24e5.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (12)
  • .env.example
  • apps/core/src/app.module.ts
  • apps/core/src/main.ts
  • apps/core/src/modules/observability/http-metrics.interceptor.ts
  • apps/core/src/modules/observability/metrics.controller.ts
  • apps/core/src/modules/observability/metrics.service.ts
  • apps/core/src/modules/observability/observability.module.ts
  • apps/core/src/observability/tracing.ts
  • apps/core/test/observability.integration.spec.ts
  • docs/discours-patterns/42-observability-prom-otel.md
  • docs/discours-patterns/README.md
  • package.json

Comment on lines +44 to +47
// Express sets `req.route?.path` after matching — fall back to the raw
// URL path (stripped of query string) for 404s where no route matched.
const route = req.route?.path ?? (typeof req.path === 'string' ? req.path : '<unknown>');
this.metrics.recordHttp(req.method, route, res.statusCode, durationSec);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the fallback route label; avoid raw req.path on unmatched routes.

Using raw path for 404/unmatched traffic creates unbounded label cardinality (e.g., scanners/random URLs), which can degrade or break metrics storage.

Suggested fix
- const route = req.route?.path ?? (typeof req.path === 'string' ? req.path : '<unknown>');
+ const route = req.route?.path ?? '<unmatched>';
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Express sets `req.route?.path` after matching — fall back to the raw
// URL path (stripped of query string) for 404s where no route matched.
const route = req.route?.path ?? (typeof req.path === 'string' ? req.path : '<unknown>');
this.metrics.recordHttp(req.method, route, res.statusCode, durationSec);
// Express sets `req.route?.path` after matching — fall back to the raw
// URL path (stripped of query string) for 404s where no route matched.
const route = req.route?.path ?? '<unmatched>';
this.metrics.recordHttp(req.method, route, res.statusCode, durationSec);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/core/src/modules/observability/http-metrics.interceptor.ts` around lines
44 - 47, The fallback for route labels currently uses raw req.path which can
create unbounded metric cardinality; change the fallback logic in the
http-metrics.interceptor (where route is computed and passed to
this.metrics.recordHttp) to return a bounded, constant label (e.g. "<unmatched>"
or "<unknown_route>") or a sanitized, limited form instead of req.path for cases
where req.route?.path is undefined, ensuring all 404/unmatched requests use that
constant/sanitized label.

Comment on lines +28 to +32
beforeAll(async () => {
moduleRef = await Test.createTestingModule({ imports: [AppModule] }).compile();
app = moduleRef.createNestApplication({ bufferLogs: true });
await app.init();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add resetDatabase() in beforeEach for this integration spec.

This suite currently boots once but never resets DB state between tests. That breaks the test contract for integration specs and can introduce cross-test coupling/flakes.

As per coding guidelines, “Integration tests must be located at apps/core/test/<name>.integration.spec.ts (NOT co-located with service). They must run against the real dockerized stack and reset the DB via resetDatabase() in beforeEach.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/core/test/observability.integration.spec.ts` around lines 28 - 32, The
test suite currently initializes the Nest app in beforeAll but never resets DB
state between tests; add a beforeEach hook that calls the shared resetDatabase()
helper so each integration test runs against a clean DB. Specifically, add
beforeEach(async () => { await resetDatabase(); }); alongside the existing
beforeAll that creates moduleRef/app; ensure the resetDatabase symbol is
imported from your test helpers and invoked before each test so moduleRef/app
initialization remains unchanged.

Comment thread package.json Outdated
Comment on lines +54 to +59
"@opentelemetry/api": "^1.9.1",
"@opentelemetry/auto-instrumentations-node": "^0.73.0",
"@opentelemetry/exporter-trace-otlp-http": "^0.215.0",
"@opentelemetry/resources": "^2.7.0",
"@opentelemetry/sdk-node": "^0.215.0",
"@opentelemetry/semantic-conventions": "^1.40.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

OpenTelemetry JS SDK 0.215.0 compatible API version

💡 Result:

The OpenTelemetry JS SDK experimental version 0.215.0 (including @opentelemetry/sdk-node 0.215.0) is compatible with @opentelemetry/api versions >=1.3.0 and <1.10.0.

Citations:


Constrain @opentelemetry/api version to ensure long-term SDK 0.215.0 compatibility.

The @opentelemetry/sdk-node 0.215.0 is compatible with @opentelemetry/api versions >=1.3.0 and <1.10.0. While the current caret range ^1.9.1 is compatible with the currently released 1.9.1, it could pull in incompatible 1.10.0+ versions when released. Use a tilde range (~1.9.1) or explicit upper bound (>=1.9.1 <1.10.0) instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@package.json` around lines 54 - 59, Update the `@opentelemetry/api` dependency
entry in package.json to constrain the version so it cannot resolve to 1.10.0+
(replace the caret "^1.9.1" for the "@opentelemetry/api" dependency with either
a tilde "~1.9.1" or an explicit range ">=1.9.1 <1.10.0"); after changing the
dependency string, run your package manager to update the lockfile
(npm/yarn/pnpm) so the locked dependency respects the new constraint.

dobrodob added a commit that referenced this pull request Apr 19, 2026
…dings

1. **http-metrics.interceptor.ts** — collapse the route fallback from
   `req.path` (unbounded cardinality for 404s, scanner traffic) to a
   single `<unmatched>` label. Prevents metrics-store growth under
   hostile input.

2. **observability.integration.spec.ts** — add the standard
   `resetDatabase()` + re-seed-publy pattern in `beforeEach`. This
   suite doesn't write, but the integration-spec contract is "DB reset
   before each test"; conformance beats "we happen not to need it".
   Short-circuit guard handles the cross-suite race where a parallel
   file leaves publy between reset and create.

3. **package.json** — tighten `@opentelemetry/api` from `^1.9.1` to
   `~1.9.1`. `@opentelemetry/sdk-node@0.215.0` is compatible with
   `@opentelemetry/api` `>=1.3.0 <1.10.0` per upstream; a future
   caret-satisfying 1.10 release would break our SDK.

Verification: 2/2 observability spec passing, tsc clean, biome clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dobrodob added a commit that referenced this pull request Apr 19, 2026
…dings

1. **http-metrics.interceptor.ts** — collapse the route fallback from
   `req.path` (unbounded cardinality for 404s, scanner traffic) to a
   single `<unmatched>` label. Prevents metrics-store growth under
   hostile input.

2. **observability.integration.spec.ts** — add the standard
   `resetDatabase()` + re-seed-publy pattern in `beforeEach`. This
   suite doesn't write, but the integration-spec contract is "DB reset
   before each test"; conformance beats "we happen not to need it".
   Short-circuit guard handles the cross-suite race where a parallel
   file leaves publy between reset and create.

3. **package.json** — tighten `@opentelemetry/api` from `^1.9.1` to
   `~1.9.1`. `@opentelemetry/sdk-node@0.215.0` is compatible with
   `@opentelemetry/api` `>=1.3.0 <1.10.0` per upstream; a future
   caret-satisfying 1.10 release would break our SDK.

Verification: 2/2 observability spec passing, tsc clean, biome clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dobrodob dobrodob force-pushed the feat/sprint-16-b-observability branch from 8e3c542 to dddf31f Compare April 19, 2026 23:57
dobrodob and others added 4 commits April 20, 2026 02:47
…try tracing

Adds the second half of the observability foundation alongside pattern 41
(slow-query logs). Logs explain WHAT happened; metrics aggregate; traces
stitch sequence — together they cover the operator's "why is it slow?" path.

Prometheus layer:
- ObservabilityModule (@global) — `@willsoto/nestjs-prometheus` for DI,
  `prom-client` default registry for the exposition endpoint.
- `/metrics` endpoint via our own @public controller — the library's
  default would trip the global JwtAuthGuard.
- MetricsService — thin facade over `publy_http_requests_total` (Counter)
  and `publy_http_request_duration_seconds` (Histogram with 10ms–10s
  buckets tuned for our p95 target).
- HttpMetricsInterceptor (APP_INTERCEPTOR) — records RED sample on every
  completed request. `route` label uses `req.route?.path` (pattern, not
  URL) to keep cardinality bounded.
- Default node/process metrics enabled (cpu, rss, heap) — day-1 baseline.

OpenTelemetry layer:
- `apps/core/src/observability/tracing.ts` — NodeSDK + auto-instrumentations
  (HTTP, Express, Prisma, ioredis). Env-gated on OTEL_EXPORTER_OTLP_ENDPOINT
  — unset is a no-op (dev/test default), set activates OTLP/HTTP push.
- Side-effect `startTracing()` at module bottom + `import './observability/
  tracing'` as the FIRST statement in main.ts so require-hooks patch
  upstream modules before any app code loads.
- Graceful shutdown (SIGTERM/SIGINT) flushes in-flight spans.

Tests + docs:
- observability.integration.spec.ts — smoke-level: /metrics is unauth,
  emits Prometheus format, HTTP counter increments after a /health hit.
- 42-observability-prom-otel.md — pattern doc, cardinality guidance,
  import-order footgun warning.
- .env.example — OTEL_EXPORTER_OTLP_ENDPOINT + OTEL_SERVICE_NAME docs.

Dependencies added: @willsoto/nestjs-prometheus, prom-client,
@opentelemetry/{api,sdk-node,auto-instrumentations-node,exporter-trace-
otlp-http,resources,semantic-conventions}. +18 + 141 packages, 0 vulns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dings

1. **http-metrics.interceptor.ts** — collapse the route fallback from
   `req.path` (unbounded cardinality for 404s, scanner traffic) to a
   single `<unmatched>` label. Prevents metrics-store growth under
   hostile input.

2. **observability.integration.spec.ts** — add the standard
   `resetDatabase()` + re-seed-publy pattern in `beforeEach`. This
   suite doesn't write, but the integration-spec contract is "DB reset
   before each test"; conformance beats "we happen not to need it".
   Short-circuit guard handles the cross-suite race where a parallel
   file leaves publy between reset and create.

3. **package.json** — tighten `@opentelemetry/api` from `^1.9.1` to
   `~1.9.1`. `@opentelemetry/sdk-node@0.215.0` is compatible with
   `@opentelemetry/api` `>=1.3.0 <1.10.0` per upstream; a future
   caret-satisfying 1.10 release would break our SDK.

Verification: 2/2 observability spec passing, tsc clean, biome clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### T1.2 — content-type regex tightened

The old regex OR'd `text/plain` with `version=0.0.4` — a broken future
content-type missing either piece would silently pass. Now asserts both
parts separately, so regressions in the Prometheus exposition format
surface immediately.

### T1.3 — OTel deployment.environment resource attribute

Adds `deployment.environment.name` to the tracer resource, sourced from
(in order) `OTEL_DEPLOYMENT_ENVIRONMENT` env → `NODE_ENV` → `"development"`.
Lets Grafana/Tempo split dashboards by environment with a single label.

Sampling is deliberately NOT hard-coded — the OTel SDK natively honors
`OTEL_TRACES_SAMPLER=traceidratio` + `OTEL_TRACES_SAMPLER_ARG=0.1`, the
standard way to tune cost in prod. Keeps operators in control without
code deploys.

### T2.3 — Prisma query duration histogram

- `libs/prisma/src/prisma.service.ts`: new `PrismaMetricsSink` interface
  + `PRISMA_METRICS_SINK` DI token. Optional — library stays standalone.
  Every query flows through the sink; slow ones also log.
- `metrics.service.ts`: `recordPrismaQuery(durationMs, target, slow)`
  observes `publy_prisma_query_duration_ms` (1 ms → 10 s buckets) and
  ticks `publy_prisma_slow_queries_total` on slow rows.
- `observability.module.ts`: factory-provides the sink against
  MetricsService — one-way coupling (libs/prisma knows nothing about
  Prometheus).

### T2.4 — EventRouter Prometheus gauges

Four new metrics expose the SSE fan-out's internal counters:
- `publy_event_router_clients_connected` (gauge)
- `publy_event_router_clients_peak` (gauge)
- `publy_event_router_events_received_total` (counter)
- `publy_event_router_events_routed_total` (counter)

Gauges use prom-client's `collect` callback pattern (pull at scrape
time, not push per event). Counters compute deltas from a snapshot
cache so they stay monotonic even though EventRouter itself exposes
lifetime totals.

Sink-in-metrics-service + collect-at-scrape-time avoid adding Prometheus
as a dependency of EventRouterService — keeps that service's failure
surface tight.

Verification: /metrics integration spec 2/2, tsc clean, biome clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #12 added `@sentry/node` to main's lockfile. When this branch
rebased onto the new main, we hit a package-lock.json merge conflict
(additions on both sides) and took `theirs` to keep our OTel and
Prometheus entries. That resolution dropped Sentry's entries.

Running `npm install` restores them — the lockfile now contains every
dependency package.json declares, on any of the 7 sprint-16 branches,
so each branch installs cleanly in isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dobrodob dobrodob force-pushed the feat/sprint-16-a-slow-query-log branch from 41bc13c to a2bc573 Compare April 20, 2026 00:53
@dobrodob dobrodob force-pushed the feat/sprint-16-b-observability branch from dddf31f to 1e7a5c1 Compare April 20, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant