Skip to content

[Expert] Implement full observability stack: Prometheus metrics, structured audit logging, and OpenTelemetry tracing #9

Description

@DeFiVC

Description

The API has zero observability infrastructure. There are no Prometheus metrics, no structured audit trails for financial operations, and no distributed tracing. This makes it impossible to monitor the system, detect anomalies, debug production issues, or meet compliance requirements for a financial platform.

Problem Analysis

What is missing

  1. No metrics: No request latency histograms, no error rate counters, no business metrics (quiz completions/hour, reward claims/day)
  2. No audit trail: Reward claims and credential mints modify both DB and blockchain, but there is no audit log recording who did what, when, and the full transaction lifecycle
  3. No distributed tracing: A single request spans JWT verify → DB query → Redis lookup → Stellar RPC → DB update, with no way to trace the full path
  4. No Prometheus endpoint: No way for Grafana/Datadog to scrape metrics
  5. No alerting hooks: No way to detect spikes in error rates or latency

Current logging

The codebase uses Pino (src/utils/logger.ts) for basic request logging, but:

  • Logs are not structured for aggregation (no consistent field names)
  • No business event logging (e.g., "reward claimed" with amount, user, tx hash)
  • No security event logging (e.g., "failed auth attempt" with IP, address)

Required Implementation

A. Prometheus Metrics

Install: npm install prom-client

// New file: src/metrics/index.ts
import { Registry, Counter, Histogram, Gauge } from "prom-client";

export const register = new Registry();

// HTTP metrics
export const httpRequestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],
  registers: [register],
});

export const httpRequestTotal = new Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route", "status_code"],
  registers: [register],
});

// Stellar metrics
export const stellarTxTotal = new Counter({
  name: "stellar_transactions_total",
  help: "Total Stellar transactions submitted",
  labelNames: ["method", "status"],
  registers: [register],
});

export const stellarTxDuration = new Histogram({
  name: "stellar_transaction_duration_seconds",
  help: "Duration of Stellar transaction submission",
  labelNames: ["method"],
  buckets: [0.5, 1, 2, 5, 10, 30],
  registers: [register],
});

// Business metrics
export const rewardClaimsTotal = new Counter({
  name: "reward_claims_total",
  help: "Total reward claims",
  labelNames: ["status", "amount_bucket"],
  registers: [register],
});

export const credentialsMintedTotal = new Counter({
  name: "credentials_minted_total",
  help: "Total credentials minted",
  labelNames: ["course_id"],
  registers: [register],
});

export const quizzesCompletedTotal = new Counter({
  name: "quizzes_completed_total",
  help: "Total quizzes completed",
  labelNames: ["passed"],
  registers: [register],
});

// System metrics
export const activeConnections = new Gauge({
  name: "db_active_connections",
  help: "Number of active database connections",
  registers: [register],
});

export const redisConnected = new Gauge({
  name: "redis_connected",
  help: "Redis connection status (1=connected, 0=disconnected)",
  registers: [register],
});

B. Metrics Endpoint

// In server.ts
import { register } from "./metrics/index.js";

app.get("/metrics", async (request, reply) => {
  reply.header("Content-Type", register.contentType);
  return reply.send(await register.metrics());
});

C. Fastify Metrics Hook

// New file: src/metrics/fastify-hook.ts
import type { FastifyInstance } from "fastify";
import { httpRequestDuration, httpRequestTotal } from "./index.js";

export function registerMetricsHook(app: FastifyInstance) {
  app.addHook("onResponse", async (request, reply) => {
    const duration = (reply.elapsedTime || 0) / 1000;
    const labels = {
      method: request.method,
      route: request.routeOptions?.url ?? request.url,
      status_code: reply.statusCode,
    };

    httpRequestDuration.observe(labels, duration);
    httpRequestTotal.inc(labels);
  });
}

D. Structured Audit Logging

// New file: src/audit/index.ts
import { logger } from "../utils/logger.js";

export interface AuditEvent {
  event: string;
  userId?: string;
  stellarAddress?: string;
  resource: string;
  resourceId?: string;
  action: string;
  result: "success" | "failure";
  txHash?: string;
  amount?: number;
  metadata?: Record<string, unknown>;
  ip?: string;
  userAgent?: string;
}

export function auditLog(event: AuditEvent) {
  logger.info({
    audit: true,
    ...event,
    timestamp: new Date().toISOString(),
  }, `[AUDIT] ${event.event}`);
}

Usage in reward service:

auditLog({
  event: "reward_claimed",
  userId,
  stellarAddress: user.stellarAddress,
  resource: "quiz_submission",
  resourceId: submissionId,
  action: "claim_reward",
  result: "success",
  txHash,
  amount: REWARD_AMOUNT,
  ip: request.ip,
});

E. OpenTelemetry Distributed Tracing

Install: npm install @opentelemetry/sdk-node @opentelemetry/api @opentelemetry/instrumentation-fastify @opentelemetry/instrumentation-pg @opentelemetry/instrumentation-redis

// New file: src/tracing.ts (must be imported FIRST in server.ts)
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { SimpleSpanProcessor, ConsoleSpanExporter } from "@opentelemetry/sdk-trace-base";
import { FastifyInstrumentation } from "@opentelemetry/instrumentation-fastify";
import { PgInstrumentation } from "@opentelemetry/instrumentation-pg";
import { RedisInstrumentation } from "@opentelemetry/instrumentation-redis";

const provider = new NodeTracerProvider({
  instrumentations: [
    new FastifyInstrumentation(),
    new PgInstrumentation(),
    new RedisInstrumentation(),
  ],
});

provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
provider.register();

F. Business Event Dashboard Queries

-- Reward claims per hour (last 24h)
SELECT date_trunc('hour', created_at) AS hour, COUNT(*), SUM(amount)
FROM audit_events WHERE event = 'reward_claimed'
AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY hour ORDER BY hour;

-- Error rate per endpoint (last 1h)
SELECT route, status_code, COUNT(*)
FROM metrics WHERE timestamp > NOW() - INTERVAL '1 hour'
AND status_code >= 500
GROUP BY route, status_code;

-- Active users per day
SELECT date_trunc('day', created_at) AS day, COUNT(DISTINCT user_id)
FROM audit_events WHERE event IN ('reward_claimed', 'quiz_submitted')
GROUP BY day ORDER BY day DESC;

Files to create

  • New: src/metrics/index.ts — Prometheus metrics definitions
  • New: src/metrics/fastify-hook.ts — Fastify metrics collection
  • New: src/audit/index.ts — Structured audit logging
  • New: src/tracing.ts — OpenTelemetry setup
  • Modify: src/server.ts — register metrics hook, add /metrics endpoint, import tracing
  • Modify: src/modules/rewards/reward.service.ts — add audit logging
  • Modify: src/modules/credentials/credential.service.ts — add audit logging
  • Modify: src/modules/quizzes/quiz.service.ts — add business metrics

Dependencies to Add

npm install prom-client @opentelemetry/sdk-node @opentelemetry/api \
  @opentelemetry/instrumentation-fastify @opentelemetry/instrumentation-pg \
  @opentelemetry/instrumentation-redis

Testing Requirements

  • Verify /metrics endpoint returns valid Prometheus text format
  • Verify http_request_duration histogram records correct buckets
  • Verify audit logs contain all required fields
  • Verify OpenTelemetry spans are created for Fastify requests
  • Load test: verify metrics collection does not add significant latency (< 1ms per request)

References

Metadata

Metadata

Labels

GrantFox OSSIssue tracked in GrantFox OSSMaybe RewardedIssue may be eligible for a GrantFox rewardOfficial CampaignCampaign: Official CampaignadvancedAdvanced difficultyenhancementNew feature or requesttypescriptTypeScript language

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions