Skip to content

feat(observability): add Prometheus metrics, audit logging, and OpenTelemetry tracing#21

Merged
DeFiVC merged 2 commits into
ChainLearnOfficial:mainfrom
brightpixel-dev:feat/issue-9-observability-stack
Jun 20, 2026
Merged

feat(observability): add Prometheus metrics, audit logging, and OpenTelemetry tracing#21
DeFiVC merged 2 commits into
ChainLearnOfficial:mainfrom
brightpixel-dev:feat/issue-9-observability-stack

Conversation

@brightpixel-dev

@brightpixel-dev brightpixel-dev commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Closes #9

What changed

  • src/metrics/index.ts — prom-client registry with 6 metrics: HTTP request counter, HTTP latency histogram, Stellar tx duration histogram (per-method, success/error), quiz submission counter (passed/failed), reward claims counter (success/queued), credentials minted counter; default Node.js process metrics collected
  • src/metrics/fastify-hook.ts — Fastify onRequest/onResponse hooks that record count and latency for every route, labelled by method, route pattern, and status code
  • src/server.ts — registers metrics hook, exposes GET /metrics (Prometheus text format), imports initTracing() before Fastify setup, calls shutdownTracing() on clean exit
  • src/audit/index.ts — typed auditLog(event, fields) helper that emits a pino log entry with audit: true; events: quiz.submitted, reward.claimed, reward.queued, credential.minted, auth.login, auth.login_failed
  • src/tracing.ts — OTel NodeSDK with OTLP/HTTP exporter (configurable via OTEL_EXPORTER_OTLP_ENDPOINT), Fastify, pg, and ioredis auto-instrumentations; disabled when OTEL_SDK_DISABLED=true
  • reward.service.ts — Stellar tx duration measured around invokeContract, rewardClaimsTotal incremented on success/queued, auditLog called on both paths
  • credential.service.ts — same Stellar tx timing pattern for mint_credential, credentialsMintedTotal incremented, auditLog on success
  • quiz.service.tsquizSubmissionsTotal incremented with passed/failed label, auditLog on submission

Why

Closes issue #9 — without metrics there is no visibility into request latency, Stellar contract call health, or financial event throughput. Audit logs provide a tamper-evident record of every on-chain operation for compliance. OTel traces allow distributed request correlation across Fastify → Postgres → Redis → Stellar.

How to test

# Start the server
npm run dev

# Scrape metrics endpoint
curl http://localhost:3000/metrics

# Look for these metric families:
#   http_requests_total
#   http_request_duration_seconds
#   stellar_tx_duration_seconds
#   quiz_submissions_total
#   reward_claims_total
#   credentials_minted_total
#   process_cpu_user_seconds_total   (default Node metrics)

# For tracing: point OTEL_EXPORTER_OTLP_ENDPOINT at a local Jaeger/Tempo
# For audit logs: grep for "audit":true in pino output after a quiz submit

@DeFiVC DeFiVC left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid observability implementation that covers metrics, audit logging, and tracing — but must resolve merge conflicts before merging.

Blocking Issues

  1. ❌ Merge ConflictsmergeStateStatus: "DIRTY". The PR is based on stale main. src/server.ts has new health check endpoints (/health, /health/live, /health/ready) and a stellarClient import that the PR's diff doesn't account for. Rebase needed:

    git fetch origin
    git rebase origin/main
    # resolve conflicts in src/server.ts
    git rebase --continue
  2. ⚠️ CI Not RunstatusCheckRollup is empty (likely due to conflict state). Must verify npm run typecheck, npm run lint, and npm test pass after rebase.

Code Issues

  1. Missing system gauges — Issue #9 requested db_active_connections and redis_connected gauges for infrastructure health. Not implemented.

  2. src/audit/index.ts — Missing ip and userAgent fields in AuditFields for security event tracing (issue #9 specified these for failed auth attempts).

What's Good

  • Clean module separation: metrics/, audit/, tracing/
  • Correct metric types: counters for totals, histograms for latency with sensible buckets
  • Audit logging covers all financial events (reward claims, credential mints, quiz submissions)
  • Tracing disabled by default via OTEL_SDK_DISABLED — safe for dev
  • initTracing() before Fastify setup captures full request lifecycle
  • shutdownTracing() on clean exit prevents span data loss

Please rebase on current main, resolve conflicts, and verify CI passes.

…elemetry tracing

- Add prom-client registry with HTTP latency histogram, Stellar tx duration
  histogram, quiz submissions counter, reward claims counter, and credentials
  minted counter; expose /metrics endpoint for scraping
- Register Fastify onRequest/onResponse hooks to instrument every route with
  request count and duration labels (method, route, status_code)
- Add structured audit logger (src/audit/index.ts) emitting pino entries with
  audit: true for quiz.submitted, reward.claimed/queued, and credential.minted
- Initialize OpenTelemetry NodeSDK in src/tracing.ts with OTLP exporter and
  Fastify, pg, and ioredis auto-instrumentations; call initTracing() before
  Fastify setup and shutdownTracing() on SIGTERM/SIGINT
- Instrument reward.service, credential.service, and quiz.service with metrics
  increments, Stellar tx timing, and audit log calls on every financial event

Closes ChainLearnOfficial#9
…gent, resolve conflict

- Add db_active_connections Gauge (samples pool.totalCount - pool.idleCount on
  each /metrics scrape) and redis_connected Gauge (0/1 via ioredis connect/close
  events) as requested in issue ChainLearnOfficial#9 and reviewer feedback
- Export pool from config/database.ts so setupInfraMetrics() can bind to it;
  call setupInfraMetrics(pool, redis) in buildApp() before route registration
- Add ip and userAgent fields to AuditFields interface for security event tracing
- Resolve rebase conflict in quiz.service.ts: keep generateQuizFromAI import
  from upstream AI-quiz PR alongside observability imports

Closes ChainLearnOfficial#9
@brightpixel-dev brightpixel-dev force-pushed the feat/issue-9-observability-stack branch from fc9cab2 to ef4c985 Compare June 20, 2026 13:13
@brightpixel-dev

Copy link
Copy Markdown
Contributor Author

Hi @DeFiVC — all review items have been addressed and the branch has been force-pushed. Here's a summary of what was done:

1. Merge conflict resolved
Rebased on current main. The only conflict was in quiz.service.ts imports — the AI-quiz PR (#20) had added generateQuizFromAI. Kept both that import and the observability imports.

2. Missing infra gauges added

  • db_active_connections — prom-client collect() callback samples pool.totalCount - pool.idleCount on each /metrics scrape (zero overhead at idle)
  • redis_connected — 0/1 gauge updated via ioredis connect/close/error events
  • Exported pool from config/database.ts and call setupInfraMetrics(pool, redis) in buildApp() before route registration

3. ip and userAgent added to AuditFields
Both fields are now in the interface so auth routes can pass request.ip and request.headers['user-agent'] when calling auditLog("auth.login_failed", { ip, userAgent, ... }).

4. Typecheck passes
npx tsc --noEmit exits clean after rebase and all changes.

Ready for re-review.

@DeFiVC DeFiVC left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All blocking issues from the previous review have been addressed. The PR now delivers a complete observability stack that fully satisfies issue #9.

What's Good

  • Infrastructure gauges: db_active_connections (samples pool.totalCount - pool.idleCount) and redis_connected (tracks connect/close/error events) — both use prom-client's collect() callback for lazy evaluation
  • Audit fields complete: ip and userAgent now present in AuditFields for security event tracing
  • Clean rebase: Conflicts resolved, upstream AI-quiz PR changes incorporated
  • CI green: Lint & Typecheck and Test both pass

Summary

  • 6 Prometheus metrics (HTTP, Stellar tx, quiz, reward, credential) + 2 infrastructure gauges
  • Structured audit logging for all financial events
  • OpenTelemetry tracing with Fastify, pg, and ioredis auto-instrumentation
  • /metrics endpoint for Prometheus scraping

Approving — the implementation is solid, well-structured, and addresses all requirements from issue #9.

@DeFiVC DeFiVC merged commit 3178be9 into ChainLearnOfficial:main Jun 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Expert] Implement full observability stack: Prometheus metrics, structured audit logging, and OpenTelemetry tracing

2 participants