A production-grade learning system that simulates Kafka-like reliability, ACK/replay protocols, and real-time observability — all in one monolithic repo.
┌─────────────────────────────────────────────────────────┐
│ TRACEHUB ARCHITECTURE │
│ │
│ ┌──────────────┐ batch+seq ┌──────────────────┐ │
│ │ auth-svc │──────────────▶│ │ │
│ │ payment-svc │ POST /ingest │ Backend API │ │
│ │ notif-svc │◀─────────────│ (Express.js) │ │
│ └──────────────┘ ACK/missing └────────┬─────────┘ │
│ (Producer Sim) │ │
│ ┌─────▼──────┐ │
│ │ Redis │ │
│ │ queue:logs│ │
│ │ queue:retry│ │
│ │ producer:*│ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ Log Worker │ │
│ │ (batch │ │
│ │ insert) │ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ PostgreSQL │ │
│ │ logs table│ │
│ │ ack_state │ │
│ │ replays │ │
│ └────────────┘ │
│ │
│ ┌──────────────┐ Socket.io ┌──────────────────┐ │
│ │ Next.js │◀───────────────│ Backend WS │ │
│ │ Dashboard │ │ (real-time) │ │
│ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────┘
# Clone and start everything
git clone <repo>
cd tracehub
docker-compose up --build
# Services:
# Dashboard: http://localhost:3000
# Backend: http://localhost:3001
# PostgreSQL: localhost:5432
# Redis: localhost:6379tracehub/
├── apps/
│ ├── backend/ # Node.js + Express + Socket.io
│ │ └── src/
│ │ ├── index.ts # App entry + Socket.io setup
│ │ ├── routes/
│ │ │ ├── ingest.ts # POST /ingest — receive log batches
│ │ │ ├── logs.ts # GET /logs — query + replay API
│ │ │ └── metrics.ts # GET /metrics + POST /control
│ │ ├── services/
│ │ │ ├── database.ts # PostgreSQL pool + helpers
│ │ │ ├── redis.ts # Redis client + queue ops
│ │ │ ├── ackService.ts # ACK/replay protocol
│ │ │ └── metricsService.ts
│ │ └── workers/
│ │ └── logWorker.ts # Queue consumer + batch insert
│ │
│ ├── dashboard/ # Next.js 14 + Tailwind + Recharts
│ │ └── src/
│ │ ├── app/
│ │ │ ├── page.tsx # Overview
│ │ │ ├── live-logs/ # Real-time log stream
│ │ │ ├── replay/ # Replay center
│ │ │ ├── queue/ # Queue monitor
│ │ │ └── services/ # Per-service metrics
│ │ ├── components/
│ │ │ ├── SocketProvider.tsx
│ │ │ ├── dashboard/
│ │ │ └── ui/
│ │ └── hooks/
│ │ └── useMetrics.ts # Socket.io hooks
│ │
│ └── producer-simulator/ # Fake EC2 services
│ └── src/index.ts # auth + payment + notification producers
│
├── postgres/init.sql # Schema: logs, ack_state, replay_requests
├── redis/redis.conf
├── shared/src/index.ts # Shared TypeScript types
└── docker-compose.yml
Every log carries a monotonically increasing seq number per service:
{
"seq": 1001,
"service": "payment-service",
"level": "error",
"message": "payment failed",
"requestId": "req_abc123",
"timestamp": "2024-01-15T10:30:00.000Z"
}Producer → Backend flow:
- Producer generates logs, saves them in Redis sorted set (
producer:{service}) keyed by seq - Producer sends batch via
POST /ingest - Backend enqueues logs into Redis
queue:logs - Backend computes ACK response:
ackTill= highest contiguous seq receivedmissing= gaps in the sequence window
- Producer removes ACKed logs from its buffer
- Missing seqs are queued in
replay_requeststable
ACK Response Format:
{
"batchId": "ack_1705312200000",
"ackTill": 1099,
"missing": [1088, 1092, 1095],
"status": "partial"
}queue:logs LIST → main processing queue (RPUSH / LPOP)
queue:retry LIST → failed/retry queue
producer:{svc} ZSET → producer buffer sorted by seq (for replay lookup)
ack:{svc} STRING → current ACK state per service
The LogWorker runs continuously:
- Pops up to 50 logs from
queue:logs - Deduplicates by
service:seq - Batch inserts into PostgreSQL
- Checks
replay_requestsfor pending replays - Fetches buffered logs from Redis for replay
- Marks replays completed
- In-memory sliding window Set of
service:seqpairs - PostgreSQL
ON CONFLICT DO NOTHINGon insert - Maximum 10,000 entries tracked before eviction
| Control | Effect |
|---|---|
| Crash Worker | Stops log processing for 5s, auto-recovers |
| Network Failure | Drops all /ingest requests for 10s |
| Delay ACK | Adds 3s latency to ACK responses |
| Flush Queue | Drops all queued logs (demonstrates data loss) |
| Reset All | Clears all failure states |
| Page | Path | Features |
|---|---|---|
| Overview | / |
System metrics, EPS chart, service status, fault controls |
| Live Logs | /live-logs |
Real-time log stream, filters, search |
| Replay Center | /replay |
Replay requests, manual trigger, live events |
| Queue Monitor | /queue |
Queue depth charts, worker status, sim events |
| Service Metrics | /services |
Per-service EPS, errors, ACK state, charts |
POST /ingest Receive log batch, return ACK
GET /logs Query logs (filters: service, level, search, from, to)
GET /logs/replay List replay requests
POST /logs/replay Trigger manual replay
GET /metrics System metrics snapshot
GET /metrics/queue Queue depth + worker stats
GET /metrics/replay Replay stats by status
POST /metrics/control Fault injection controls
GET /health Health check
| Event | Direction | Payload |
|---|---|---|
metrics:snapshot |
Server → Client | Full SystemMetrics every 2s |
log:new |
Server → Client | Individual LogEntry |
ack:sent |
Server → Client | ACK details |
replay:completed |
Server → Client | Replay result |
worker:crashed |
Server → Client | Crash notification |
worker:recovered |
Server → Client | Recovery notification |
sim:control |
Server → Client | Fault injection event |
This monolith is structured to extract into:
tracehub-ingest-service → /ingest route
tracehub-worker-service → logWorker.ts
tracehub-query-service → /logs route
tracehub-metrics-service → metricsService.ts
tracehub-replay-service → ackService.ts + replay worker
Each shares the same PostgreSQL schema and Redis keys, making migration additive rather than disruptive.
# Start dependencies
docker-compose up postgres redis -d
# Backend
cd apps/backend
npm install
npm run dev
# Producer
cd apps/producer-simulator
npm install
npm run dev
# Dashboard
cd apps/dashboard
npm install
npm run dev