Skip to content

Architecture Overview

Norm Brandinger edited this page Nov 20, 2025 · 1 revision

Architecture Deep Dive

Complete architectural documentation for the DevStack Core infrastructure project.


Table of Contents

  1. Overview
  2. Architecture Philosophy
  3. System Components
  4. Network Architecture
  5. Security Architecture
  6. Data Flow
  7. Observability Architecture
  8. Service Dependencies
  9. Deployment Architecture
  10. Scaling Considerations

Overview

DevStack Core is a container-native, infrastructure-as-code project providing a complete local development environment optimized for Apple Silicon Macs.

Core Characteristics

  • Container Runtime: Colima (Lima + containerd/Docker)
  • Orchestration: Docker Compose
  • Service Count: 23 containerized services
  • Network Model: Bridge network with static IP assignments
  • Security Model: Vault-managed credentials with optional TLS
  • Target Environment: Local development (NOT production)

Design Goals

  1. Completeness - All services needed for modern development
  2. Security - Vault-managed secrets, TLS support
  3. Observability - Full metrics, logs, and visualization stack
  4. Educational - Multiple reference implementations demonstrating patterns
  5. Reproducibility - Infrastructure as code, Docker Compose

Architecture Philosophy

Infrastructure-First Approach

Services are defined declaratively in docker-compose.yml with:

  • Explicit dependencies
  • Health checks for all services
  • Static IP assignments for predictability
  • Volume persistence for stateful services
  • Environment-based configuration

Security by Design

  • No hardcoded credentials - All passwords in Vault
  • AppRole authentication - Core services (7/16) use least-privilege AppRole auth
  • TLS optional but supported - PKI infrastructure via Vault
  • Secrets at runtime - Services fetch credentials on startup
  • Network isolation - 4-tier network segmentation (vault/data/app/observability)
  • Secret scanning - Pre-commit hooks and CI/CD

AppRole Adoption Status (as of November 2025):

  • Core Data Tier (7 services): PostgreSQL, MySQL, MongoDB, Redis (x3), RabbitMQ, Forgejo, FastAPI
  • ⚠️ Infrastructure (9 services): PGBouncer, additional reference apps, exporters, Vector - use VAULT_TOKEN
  • 🎯 Target: 95%+ adoption via Phase 4 migration

Container-Native

  • All services run in containers
  • No host dependencies (except Colima/Docker)
  • Portable across developers
  • Easy to reset/rebuild

System Components

Component Hierarchy

graph TB
    subgraph Colima["Colima VM (Lima)"]
        Docker["Docker Daemon"]

        subgraph Network["Docker Compose Network<br/>(dev-services: 172.20.0.0/16)"]
            Vault["Secrets Management<br/>- Vault (PKI + KV)"]

            subgraph DataServices["Data Services"]
                PostgreSQL["PostgreSQL"]
                MySQL["MySQL"]
                MongoDB["MongoDB"]
                RedisCluster["Redis Cluster (3 nodes)"]
                RabbitMQ["RabbitMQ"]
            end

            subgraph AppServices["Application Services"]
                Forgejo["Forgejo (Git)"]
                PgBouncer["PgBouncer"]
                APIs["5 Reference APIs"]
            end

            subgraph Observability["Observability Stack"]
                Prometheus["Prometheus (metrics)"]
                Grafana["Grafana (visualization)"]
                Loki["Loki (logs)"]
                Vector["Vector (pipeline)"]
                Promtail["Promtail (log collector)"]
                cAdvisor["cAdvisor (container)"]
                Exporters["Exporters (3x Redis)"]
            end
        end
    end

    Docker --> Network
    Vault -->|all services depend| DataServices
    DataServices --> AppServices
    AppServices --> Observability
Loading

Service Catalog

Service Type Port(s) Purpose
vault Secrets 8200 Secrets management & PKI
postgres Database 5432 PostgreSQL (Forgejo backend)
pgbouncer Proxy 6432 PostgreSQL connection pooler
mysql Database 3306 MySQL database
mongodb Database 27017 MongoDB NoSQL
redis-1/2/3 Cache 6379+ Redis cluster (3 masters)
rabbitmq Queue 5672, 15672 Message queue + mgmt UI
forgejo Git 3000, 2222 Git server
reference-api App 8000, 8443 FastAPI code-first
api-first App 8001, 8444 FastAPI API-first
golang-api App 8002, 8445 Go reference
nodejs-api App 8003, 8446 Node.js/Express reference
rust-api App 8004, 8447 Rust/Actix-web reference
prometheus Metrics 9090 Metrics collection
grafana Viz 3001 Dashboard & visualization
loki Logs 3100 Log aggregation
promtail Logs - Log shipping (internal)
vector Pipeline 8686 Unified observability
cadvisor Metrics 8080 Container metrics
redis-exporter-1/2/3 Metrics 9121+ Redis metrics (per node)

Reference API Implementations

This project includes 5 reference API implementations demonstrating identical functionality across different technology stacks. Each implementation showcases best practices for integrating with the infrastructure services.

Purpose and Philosophy

Why 5 Implementations?

  1. Educational - Demonstrate patterns across different languages and frameworks
  2. Comparison - Enable performance and architecture comparisons
  3. Best Practices - Show idiomatic approaches in each ecosystem
  4. Pattern Library - Reference implementations for common integration patterns

Shared Functionality:

  • HashiCorp Vault integration for secrets management
  • PostgreSQL, MySQL, and MongoDB database connections
  • Redis cluster integration with connection pooling
  • RabbitMQ message queue integration
  • Comprehensive health checks for all services
  • Structured logging with security best practices
  • Prometheus metrics exposition
  • Optional TLS/SSL support
  • RESTful API design

Implementation Details

1. Python FastAPI (Code-First) - Port 8000/8443

Location: reference-apps/fastapi/

Technology Stack:

  • Framework: FastAPI (async/await, Pydantic validation)
  • Language: Python 3.11+
  • Approach: Code-first (define routes in code, generate OpenAPI)
  • Key Libraries: asyncpg, motor (MongoDB), redis-py, aio-pika

Characteristics:

  • Fully asynchronous architecture
  • Type hints and Pydantic models for validation
  • Auto-generated OpenAPI/Swagger documentation
  • Comprehensive endpoint coverage (all services)
  • Production-ready logging and error handling

Use Cases:

  • Rapid prototyping and development
  • ML/AI integration scenarios
  • Data-heavy applications
  • Teams familiar with Python ecosystem
2. Python FastAPI (API-First) - Port 8001/8444

Location: reference-apps/fastapi-api-first/

Technology Stack:

  • Framework: FastAPI (async/await, Pydantic validation)
  • Language: Python 3.11+
  • Approach: API-first (OpenAPI spec → code generation)
  • Key Libraries: asyncpg, motor, redis-py, aio-pika

Characteristics:

  • OpenAPI specification drives implementation
  • Contract-first design methodology
  • Identical runtime behavior to code-first
  • Demonstrates spec-driven development workflow
  • Scaffolded structure for code generation

Use Cases:

  • Contract-first API development
  • Multi-team coordination (API contracts)
  • Client SDK generation scenarios
  • Governance and compliance requirements
3. Go (Gin Framework) - Port 8002/8445

Location: reference-apps/golang/

Technology Stack:

  • Framework: Gin (HTTP router and middleware)
  • Language: Go 1.23+
  • Approach: Code-first with strong typing
  • Key Libraries: pgx (PostgreSQL), mongo-go-driver, go-redis, amqp091-go

Characteristics:

  • Compiled binary for fast startup
  • Strong static typing and compile-time checks
  • Excellent concurrency with goroutines
  • Low memory footprint
  • Structured logging with logrus

Use Cases:

  • High-performance requirements
  • Microservices architectures
  • Cloud-native deployments
  • Systems programming background teams
4. Node.js (Express) - Port 8003/8446

Location: reference-apps/nodejs/

Technology Stack:

  • Framework: Express (minimalist web framework)
  • Language: Node.js (JavaScript/TypeScript)
  • Approach: Code-first with async/await
  • Key Libraries: pg, mongodb, ioredis, amqplib

Characteristics:

  • Event-driven, non-blocking I/O
  • Large ecosystem (npm)
  • Async/await for clean asynchronous code
  • Full infrastructure integration
  • JSON-native processing

Use Cases:

  • JavaScript/TypeScript-centric teams
  • Real-time applications (WebSockets)
  • Rapid iteration and prototyping
  • Microservices with npm ecosystem
5. Rust (Actix-web) - Port 8004/8447

Location: reference-apps/rust/

Technology Stack:

  • Framework: Actix-web (async actor framework)
  • Language: Rust (memory-safe systems language)
  • Approach: Partial implementation (~40% complete) with comprehensive testing
  • Key Libraries: tokio, serde, reqwest, actix-cors

Characteristics:

  • Zero-cost abstractions and memory safety
  • Exceptional performance and low latency
  • Compile-time guarantees (no runtime errors)
  • Comprehensive test coverage (5 unit + 11 integration tests)
  • High-performance async runtime (Tokio)
  • Production-ready patterns (CORS, logging, environment config)

Use Cases:

  • Ultra-high-performance requirements
  • Safety-critical applications
  • Resource-constrained environments
  • Teams prioritizing performance and safety

API Parity and Testing

Parity Tests: tests/api-parity-tests.sh

  • Validates identical behavior across implementations
  • Tests all common endpoints
  • Ensures consistent responses and error handling

Performance Benchmarks: tests/performance-benchmark.sh

  • Compares throughput and latency
  • Measures resource utilization
  • Identifies performance characteristics per stack

Comparison Matrix

Feature FastAPI (Code) FastAPI (API) Go/Gin Node.js Rust
Startup Time Medium (~2s) Medium (~2s) Fast (<1s) Fast (~1s) Fast (<1s)
Memory Footprint Medium (~80MB) Medium (~80MB) Low (~20MB) Medium (~60MB) Very Low (~10MB)
Development Speed Fast Medium Medium Fast Slow
Type Safety Runtime Runtime Compile-time Runtime* Compile-time
Concurrency Model async/await async/await Goroutines Event loop async/await
Ecosystem Size Large (PyPI) Large (PyPI) Medium Very Large (npm) Growing
Learning Curve Low Medium Medium Low High
Production Maturity High High Very High Very High High

*Runtime with TypeScript, compile-time checks available


Network Architecture

Network Topology

DevStack Core uses 4-tier network segmentation for security isolation and logical service grouping:

4-Tier Network Segmentation:

graph TB
    subgraph VaultNet["vault-network (172.20.1.0/24)"]
        Vault1[Vault<br/>172.20.1.10]
        PG_Auth[PostgreSQL AppRole]
        MySQL_Auth[MySQL AppRole]
        Redis_Auth[Redis AppRole]
    end

    subgraph DataNet["data-network (172.20.2.0/24)"]
        PG[PostgreSQL<br/>172.20.2.10]
        PGBOUNCER[PgBouncer<br/>172.20.2.11]
        MySQL[MySQL<br/>172.20.2.12]
        Redis1[Redis-1<br/>172.20.2.13]
        Redis2[Redis-2<br/>172.20.2.16]
        Redis3[Redis-3<br/>172.20.2.17]
        RabbitMQ[RabbitMQ<br/>172.20.2.14]
        MongoDB[MongoDB<br/>172.20.2.15]
    end

    subgraph AppNet["app-network (172.20.3.0/24)"]
        Forgejo[Forgejo<br/>172.20.3.10]
        RefAPI[Reference APIs<br/>172.20.3.20-24]
    end

    subgraph ObsNet["observability-network (172.20.4.0/24)"]
        Prometheus[Prometheus<br/>172.20.4.10]
        Grafana[Grafana<br/>172.20.4.11]
        Loki[Loki<br/>172.20.4.12]
        Vector[Vector<br/>172.20.4.13]
    end

    Vault1 -.->|AppRole Auth| PG
    Vault1 -.->|AppRole Auth| MySQL
    Vault1 -.->|AppRole Auth| Redis1
    Vault1 -.->|AppRole Auth| Redis2
    Vault1 -.->|AppRole Auth| Redis3
    AppNet -.->|Query| DataNet
    Forgejo -->|Metadata| PG
    RefAPI -.->|Connect| DataNet
    ObsNet -.->|Scrape Metrics| DataNet
    ObsNet -.->|Scrape Metrics| AppNet

    style VaultNet fill:#ffa726,stroke:#f57c00,stroke-width:3px
    style DataNet fill:#66bb6a,stroke:#388e3c,stroke-width:3px
    style AppNet fill:#42a5f5,stroke:#1976d2,stroke-width:3px
    style ObsNet fill:#ab47bc,stroke:#7b1fa2,stroke-width:3px
Loading

Network Isolation:

  • vault-network (172.20.1.0/24): Isolated for secrets management and AppRole authentication
  • data-network (172.20.2.0/24): Database, cache, and message queue services
  • app-network (172.20.3.0/24): Application services (Forgejo, reference APIs)
  • observability-network (172.20.4.0/24): Monitoring and logging infrastructure

Static IP Assignments

Vault Network (172.20.1.0/24):
  172.20.1.10 - vault

Data Network (172.20.2.0/24):
  172.20.2.10 - postgres
  172.20.2.11 - pgbouncer
  172.20.2.12 - mysql
  172.20.2.13 - redis-1
  172.20.2.14 - rabbitmq
  172.20.2.15 - mongodb
  172.20.2.16 - redis-2
  172.20.2.17 - redis-3

Application Network (172.20.3.0/24):
  172.20.3.10 - forgejo
  172.20.3.20 - reference-api (FastAPI code-first)
  172.20.3.21 - api-first (FastAPI API-first)
  172.20.3.22 - golang-api (Go reference)
  172.20.3.23 - nodejs-api (Node.js/Express reference)
  172.20.3.24 - rust-api (Rust/Actix-web reference)

Observability Network (172.20.4.0/24):
  172.20.4.10 - prometheus
  172.20.4.11 - grafana
  172.20.4.12 - loki
  172.20.4.13 - vector
  172.20.4.14 - promtail
  172.20.4.15 - cadvisor
  172.20.4.16 - redis-exporter-1
  172.20.4.17 - redis-exporter-2
  172.20.4.18 - redis-exporter-3

Port Exposure Strategy

Exposed to Host:

  • Web UIs: Grafana (3001), RabbitMQ (15672), Prometheus (9090), Loki (3100)
  • Databases: PostgreSQL (5432), MySQL (3306), MongoDB (27017), Redis (6379+)
  • Applications: APIs on 8000-8004 (HTTP) and 8443-8447 (HTTPS)
  • Git: Forgejo HTTP (3000), SSH (2222)
  • Vault: 8200

Internal Only:

  • Container metrics (cAdvisor)
  • Log shipping (Promtail)
  • Exporters (internal scraping)

DNS Resolution

Services resolve each other by service name across networks:

  • vault resolves to 172.20.1.10 (vault-network)
  • postgres resolves to 172.20.2.10 (data-network)
  • forgejo resolves to 172.20.3.10 (app-network)
  • prometheus resolves to 172.20.4.10 (observability-network)

Docker's embedded DNS handles resolution across all networks. Services connected to multiple networks can reach services on any of their connected networks.


Security Architecture

Secrets Management Flow

graph TD
    Init["Vault Init<br/>(One-time: creates unseal keys & root token)"]
    Unseal["Vault Unseal<br/>(Auto: runs on container start)"]
    Bootstrap["Vault Bootstrap<br/>(Required: populates credentials)"]
    KV["Enable KV engine (secret/)"]
    PKI["Setup PKI (Root + Intermediate CA)"]
    Roles["Create certificate roles (9 services)"]
    Passwords["Generate & store passwords"]
    Policies["Create Vault policies"]
    Export["Export CA certificates"]
    Services["Services fetch credentials on startup:<br/>service → Vault API → secret/{service-name} → credentials"]

    Init --> Unseal
    Unseal --> Bootstrap
    Bootstrap --> KV
    Bootstrap --> PKI
    Bootstrap --> Roles
    Bootstrap --> Passwords
    Bootstrap --> Policies
    Bootstrap --> Export
    Export --> Services
Loading

PKI Architecture

Two-Tier Certificate Authority:

graph TD
    RootCA["Root CA (pki/)<br/>- TTL: 10 years (87600h)<br/>- Key: RSA 2048"]
    IntermediateCA["Intermediate CA (pki_int/)<br/>- TTL: 5 years (43800h)<br/>- Key: RSA 2048"]
    ServiceCerts["Service Certificates<br/>- TTL: 1 year (8760h)<br/>- Roles: postgres-role, mysql-role, redis-1-role, etc.<br/>- SANs: service name, IP address, localhost"]

    RootCA -->|Signs| IntermediateCA
    IntermediateCA -->|Issues| ServiceCerts
Loading

Certificate Issuance Flow:

  1. Service requests cert from Vault PKI
  2. Vault validates request against role
  3. Intermediate CA signs certificate
  4. Service receives cert + private key
  5. Service configures TLS with cert

TLS Configuration

Optional TLS (Development Mode):

  • Controlled by tls_enabled flag in Vault
  • Default: true for all services
  • Services check flag on startup
  • If enabled: configure TLS
  • If disabled: plain connections

TLS Endpoints:

  • PostgreSQL: Port 5432 (TLS)
  • MySQL: Port 3306 (TLS)
  • MongoDB: Port 27017 (preferTLS)
  • Redis: Ports 6390-6392 (TLS on separate ports)
  • RabbitMQ: Port 5671 (TLS)
  • APIs: Ports 8443-8447 (HTTPS)

Credential Storage

In Vault (secret/ KV engine):

secret/postgresql
  ├─ username: dev_admin
  ├─ password: <25-char random>
  ├─ database: dev_database
  └─ tls_enabled: true

secret/mysql
  ├─ root_password: <25-char random>
  ├─ username: dev_user
  ├─ password: <25-char random>
  ├─ database: dev_database
  └─ tls_enabled: true

secret/redis-1, redis-2, redis-3
  ├─ password: <shared 25-char random>
  └─ tls_enabled: true

secret/rabbitmq
  ├─ username: dev_user
  ├─ password: <25-char random>
  ├─ vhost: dev_vhost
  └─ tls_enabled: true

secret/mongodb
  ├─ username: dev_user
  ├─ password: <25-char random>
  ├─ database: dev_database
  └─ tls_enabled: true

Network Security

  • No host network mode - All services use bridge
  • Static IPs - Predictable, no dynamic assignment
  • Internal-only services - Many services not exposed to host
  • Firewall-ready - Port exposure controlled via Docker

Data Flow

Service Startup Flow

graph TD
    ColimaStart["1. Colima VM starts"]
    Docker["Docker daemon initializes"]

    VaultStart["Step 1: Vault Container starts"]
    VaultUnseal["Auto-unseal script runs"]
    VaultAPI["Vault API becomes available"]
    VaultHealthy["Health check: healthy"]

    DataStart["Step 2: Data Services (parallel)"]
    PG["PostgreSQL:<br/>- Waits for Vault<br/>- Fetches credentials<br/>- Initializes database<br/>- Configures TLS<br/>- Health check: healthy"]
    MySQL_DB["MySQL (same pattern)"]
    Mongo["MongoDB (same pattern)"]
    Redis["Redis-1/2/3 (same pattern)"]
    Rabbit["RabbitMQ (same pattern)"]

    AppStart["Step 3: Application Services"]
    Forgejo_App["Forgejo (depends on PostgreSQL)"]
    PgBouncer_App["PgBouncer (depends on PostgreSQL)"]
    APIs_App["Reference APIs (depend on all data services)"]

    ObsStart["Step 4: Observability (parallel)"]
    Prom["Prometheus (scrapes metrics)"]
    Graf["Grafana (visualizes from Prometheus)"]
    Loki_Obs["Loki (receives logs)"]
    Vector_Obs["Vector (collects & forwards)"]
    Promtail_Obs["Promtail (ships logs to Loki)"]
    cAdvisor_Obs["cAdvisor (collects container metrics)"]
    Exporters_Obs["Redis Exporters (expose Redis metrics)"]

    ColimaStart --> Docker
    Docker --> VaultStart
    VaultStart --> VaultUnseal
    VaultUnseal --> VaultAPI
    VaultAPI --> VaultHealthy

    VaultHealthy --> DataStart
    DataStart --> PG
    DataStart --> MySQL_DB
    DataStart --> Mongo
    DataStart --> Redis
    DataStart --> Rabbit

    PG --> AppStart
    MySQL_DB --> AppStart
    Mongo --> AppStart
    Redis --> AppStart
    Rabbit --> AppStart

    AppStart --> Forgejo_App
    AppStart --> PgBouncer_App
    AppStart --> APIs_App

    APIs_App --> ObsStart
    ObsStart --> Prom
    ObsStart --> Graf
    ObsStart --> Loki_Obs
    ObsStart --> Vector_Obs
    ObsStart --> Promtail_Obs
    ObsStart --> cAdvisor_Obs
    ObsStart --> Exporters_Obs
Loading

Request Flow (FastAPI Example)

graph LR
    Client["Client"]
    FastAPI["FastAPI API (port 8000)"]

    subgraph HealthCheck["Health Check Request"]
        CheckVault["Check Vault connectivity"]
        CheckPG["Check PostgreSQL connectivity"]
        CheckMySQL["Check MySQL connectivity"]
        CheckMongo["Check MongoDB connectivity"]
        CheckRedis["Check Redis cluster status"]
        CheckRabbit["Check RabbitMQ connectivity"]
        ReturnHealth["Return aggregated health status"]
    end

    subgraph DBQuery["Database Query Request"]
        FetchCredsDB["Fetch credentials from Vault (cached)"]
        ConnectDB["Connect to database (connection pool)"]
        ExecQuery["Execute query over TLS"]
        ReturnResults["Return results"]
        RecordMetricsDB["Record metrics (Prometheus)"]
    end

    subgraph CacheOp["Cache Operation Request"]
        FetchCredsCache["Fetch Redis credentials from Vault (cached)"]
        ConnectCache["Connect to Redis cluster"]
        ExecCmd["Execute command (redirected to correct node)"]
        ReturnCache["Return result"]
        RecordMetricsCache["Record metrics"]
    end

    Client --> FastAPI
    FastAPI -.-> CheckVault
    CheckVault --> CheckPG
    CheckPG --> CheckMySQL
    CheckMySQL --> CheckMongo
    CheckMongo --> CheckRedis
    CheckRedis --> CheckRabbit
    CheckRabbit --> ReturnHealth

    FastAPI -.-> FetchCredsDB
    FetchCredsDB --> ConnectDB
    ConnectDB --> ExecQuery
    ExecQuery --> ReturnResults
    ReturnResults --> RecordMetricsDB

    FastAPI -.-> FetchCredsCache
    FetchCredsCache --> ConnectCache
    ConnectCache --> ExecCmd
    ExecCmd --> ReturnCache
    ReturnCache --> RecordMetricsCache
Loading

Metrics Collection Flow

graph TD
    Services["Services expose metrics (Prometheus format)"]
    FastAPI_Metrics["FastAPI: /metrics"]
    Redis_Metrics["Redis Exporters: :9121/metrics (per node)"]
    cAdvisor_Metrics["cAdvisor: :8080/metrics"]
    App_Metrics["Application custom metrics"]

    Prometheus["Prometheus scrapes every 15s"]
    Store["Stores time-series data"]
    Query["Makes available for querying"]

    Grafana["Grafana queries Prometheus"]
    PromQL["Dashboard panels execute PromQL"]
    Visualize["Visualize metrics over time"]
    Present["Present to user (port 3001)"]

    Services --> FastAPI_Metrics
    Services --> Redis_Metrics
    Services --> cAdvisor_Metrics
    Services --> App_Metrics

    FastAPI_Metrics --> Prometheus
    Redis_Metrics --> Prometheus
    cAdvisor_Metrics --> Prometheus
    App_Metrics --> Prometheus

    Prometheus --> Store
    Prometheus --> Query

    Query --> Grafana
    Grafana --> PromQL
    PromQL --> Visualize
    Visualize --> Present
Loading

Log Collection Flow

graph TD
    Stdout["Container stdout/stderr"]
    DockerLog["Docker logging driver"]
    Promtail["Promtail (reads Docker logs)"]
    Parse["Parses log format"]
    Labels["Adds labels (container, service)"]
    Ship["Ships to Loki"]

    Loki["Loki aggregates logs"]
    Index["Indexes by labels (not content)"]
    StoreLogs["Stores log data"]
    QueryLogs["Makes available for querying"]

    GrafanaLogs["Grafana queries Loki"]
    LogQL["LogQL queries"]
    Filter["Filter by service, time, etc."]
    Display["Display logs in Explore view"]

    Stdout --> DockerLog
    DockerLog --> Promtail
    Promtail --> Parse
    Parse --> Labels
    Labels --> Ship

    Ship --> Loki
    Loki --> Index
    Loki --> StoreLogs
    Loki --> QueryLogs

    QueryLogs --> GrafanaLogs
    GrafanaLogs --> LogQL
    LogQL --> Filter
    Filter --> Display
Loading

Observability Architecture

Three Pillars

  1. Metrics (Prometheus + Grafana)
  2. Logs (Loki + Promtail + Grafana)
  3. Traces (Future: OpenTelemetry)

Metrics Pipeline

graph TD
    Prometheus["Prometheus (collector)"]
    FastAPI_Met["FastAPI /metrics"]
    cAdvisor_Met["cAdvisor metrics"]
    Redis_Met["Redis Exporter"]

    FastAPI_Met -->|scrapes every 15s| Prometheus
    cAdvisor_Met -->|scrapes every 15s| Prometheus
    Redis_Met -->|scrapes every 15s| Prometheus
Loading

Metric Types:

  • Counters: Request counts, error counts
  • Gauges: Active connections, memory usage
  • Histograms: Request durations, response sizes
  • Summaries: Percentiles (p50, p95, p99)

Grafana Dashboards

Pre-configured dashboards in configs/grafana/dashboards/:

  1. redis-cluster-dashboard.json - Redis cluster health
  2. postgres-dashboard.json - PostgreSQL metrics
  3. mysql-dashboard.json - MySQL metrics
  4. mongodb-dashboard.json - MongoDB metrics
  5. application-metrics.json - API metrics
  6. infrastructure-overview.json - Overall health

Log Aggregation Strategy

Structured Logging:

  • JSON format for all application logs
  • Consistent fields: timestamp, level, message, service, request_id
  • Easy to parse and query

Label Strategy:

{service="fastapi", container="dev-reference-api", level="error"}
{service="postgres", container="dev-postgres"}
{service="redis-1", container="dev-redis-1"}

Retention:

  • Development: 7 days (configurable)
  • Logs stored in Docker volumes

Service Dependencies

Dependency Graph

graph TD
    Vault["Vault (no dependencies)"]

    Vault --> Postgres["postgres"]
    Vault --> MySQL["mysql"]
    Vault --> MongoDB["mongodb"]
    Vault --> Redis["redis-1/2/3"]
    Vault --> RabbitMQ["rabbitmq"]

    Postgres --> Forgejo["forgejo"]
    Postgres --> PgBouncer["pgbouncer"]

    Redis --> RedisExporter["redis-exporter-1/2/3"]

    Postgres --> RefAPIs["5 Reference APIs<br/>(FastAPI x2, Go, Node.js, Rust)<br/>(depend on all data services)"]
    MySQL --> RefAPIs
    MongoDB --> RefAPIs
    Redis --> RefAPIs
    RabbitMQ --> RefAPIs
Loading

Health Check Cascade

Each service has a health check that validates:

  1. Process is running
  2. Port is listening
  3. Service-specific checks (e.g., DB can execute queries)

Docker Compose won't start dependent services until dependencies are healthy.

AppRole Authentication Flow

Services using AppRole follow this authentication sequence:

sequenceDiagram
    participant Container as Service Container
    participant Script as init-approle.sh
    participant FS as Filesystem
    participant Vault as Vault API
    participant Service as Service Process

    Note over Container: Container starts with init-approle.sh entrypoint

    Container->>Script: Execute wrapper script
    Script->>Vault: Wait for Vault health check
    Vault-->>Script: Vault healthy (200 OK)

    Script->>FS: Read /vault-approles/{service}/role-id
    FS-->>Script: role-id (e.g., abc123...)

    Script->>FS: Read /vault-approles/{service}/secret-id
    FS-->>Script: secret-id (e.g., xyz789...)

    Script->>Vault: POST /v1/auth/approle/login
    Note over Script,Vault: {"role_id": "abc123...", "secret_id": "xyz789..."}
    Vault-->>Script: Service token (hvs.CAESIE..., 1h TTL)

    Script->>Vault: GET /v1/secret/data/{service}
    Note over Script,Vault: X-Vault-Token: hvs.CAESIE...
    Vault-->>Script: Credentials (user, password, database)

    Script->>Script: Export environment variables
    Note over Script: POSTGRES_USER=devuser<br/>POSTGRES_PASSWORD=***<br/>POSTGRES_DB=devdb

    Script->>Service: exec docker-entrypoint.sh
    Service->>Service: Service starts with credentials

    Note over Container,Service: Service token expires after 1 hour (renewable)
Loading

AppRole Security Benefits:

  1. No Root Token in Containers - Core services never see root token
  2. Least Privilege - Each service policy allows access ONLY to own secrets
  3. Short-Lived Tokens - Service tokens expire after 1 hour
  4. Audit Trail - All AppRole logins logged by Vault
  5. Policy Enforcement - Cross-service access prevented by Vault policies

Services Using AppRole (7):

  • PostgreSQL, MySQL, MongoDB, Redis (3 nodes), RabbitMQ, Forgejo, Reference API (FastAPI)

Services Using Root Token (9):

  • PGBouncer, API-First, Golang API, Node.js API, Rust API, Redis Exporters (3), Vector

Startup Order

sequenceDiagram
    participant Colima as Colima VM
    participant Vault as Vault
    participant PG as PostgreSQL
    participant MYSQL as MySQL
    participant REDIS as Redis Cluster
    participant RABBIT as RabbitMQ
    participant MONGO as MongoDB
    participant FORGEJO as Forgejo
    participant API as Reference APIs
    participant PROM as Prometheus

    Note over Colima: User runs: ./devstack start --profile standard

    Colima->>Colima: Start VM (5-10s)
    Colima->>Vault: Start container
    Note over Vault: Initialize & Unseal (5-10s)
    Vault->>Vault: Vault healthy ✓

    par Data Services Start (depend on Vault)
        Vault->>PG: AppRole auth
        Note over PG: Fetch credentials from Vault
        PG->>PG: Initialize database (10-15s)
        PG->>PG: PostgreSQL healthy ✓

        Vault->>MYSQL: AppRole auth
        Note over MYSQL: Fetch credentials from Vault
        MYSQL->>MYSQL: Initialize database (10-15s)
        MYSQL->>MYSQL: MySQL healthy ✓

        Vault->>REDIS: AppRole auth (all 3 nodes)
        Note over REDIS: Fetch credentials from Vault
        REDIS->>REDIS: Start 3 nodes (10s)
        Note over REDIS: redis-cluster-init required
        REDIS->>REDIS: Redis nodes healthy ✓

        Vault->>RABBIT: AppRole auth
        Note over RABBIT: Fetch credentials from Vault
        RABBIT->>RABBIT: Initialize (15-20s)
        RABBIT->>RABBIT: RabbitMQ healthy ✓

        Vault->>MONGO: AppRole auth
        Note over MONGO: Fetch credentials from Vault
        MONGO->>MONGO: Initialize (10-15s)
        MONGO->>MONGO: MongoDB healthy ✓
    end

    par Application Services Start (depend on databases)
        PG->>FORGEJO: Database ready
        Vault->>FORGEJO: AppRole auth
        FORGEJO->>FORGEJO: Initialize (10-15s)
        FORGEJO->>FORGEJO: Forgejo healthy ✓

        PG-->>API: All data services ready
        MYSQL-->>API: All data services ready
        REDIS-->>API: All data services ready
        RABBIT-->>API: All data services ready
        MONGO-->>API: All data services ready
        Vault->>API: AppRole auth
        API->>API: Start 5 APIs (5-10s)
        API->>API: All APIs healthy ✓
    end

    Note over PROM: Observability starts independently
    PROM->>PROM: Start Prometheus, Grafana, Loki (5s)
    PROM->>PG: Begin scraping metrics
    PROM->>MYSQL: Begin scraping metrics
    PROM->>REDIS: Begin scraping metrics

    Note over Colima,PROM: Total Startup Time: ~90-120 seconds
Loading

Startup Sequence Summary:

1. Vault (5-10s to unseal)
2. Data Services (30-60s for initialization)
   - PostgreSQL, MySQL, MongoDB
   - Redis cluster (needs all 3 nodes)
   - RabbitMQ
3. Application Services (10-20s)
   - Forgejo (waits for PostgreSQL)
   - PgBouncer (waits for PostgreSQL)
   - 5 Reference APIs (wait for all data services)
     * FastAPI code-first (port 8000)
     * FastAPI API-first (port 8001)
     * Go/Gin (port 8002)
     * Node.js/Express (port 8003)
     * Rust/Actix-web (port 8004)
4. Observability (starts immediately, waits for targets)
   - Prometheus, Grafana, Loki start fast
   - Begin scraping/collecting once targets available

Total Startup Time: ~90-120 seconds from cold start


Deployment Architecture

Colima VM Specifications

Default Configuration:

  • CPU: 4 cores
  • Memory: 8 GB
  • Disk: 60 GB
  • Architecture: ARM64 (Apple Silicon)
  • Runtime: Docker
  • Networking: Bridged (VZ framework)

Customizable via devstack.sh:

COLIMA_CPU=8 COLIMA_MEMORY=16 COLIMA_DISK=100 ./devstack.sh start

Volume Strategy

Named Volumes (Persistent):

  • postgres_data - PostgreSQL database files
  • mysql_data - MySQL database files
  • mongodb_data - MongoDB database files
  • redis_data_1/2/3 - Redis persistence (3 volumes)
  • rabbitmq_data - RabbitMQ message store
  • vault_data - Vault storage backend
  • forgejo_data - Git repositories
  • prometheus_data - Time-series metrics
  • grafana_data - Dashboard configs
  • loki_data - Log storage

Bind Mounts (Configuration):

  • ./configs/{service}/ → Container config directories
  • Configuration files are version-controlled

Benefits:

  • Data persists across container restarts
  • Can backup volumes independently
  • Easy to reset individual services

Resource Allocation

Per-Service Limits (if configured):

  • Not set by default (development mode)
  • Can add via deploy.resources in docker-compose.yml
  • Recommended for resource-constrained environments

Observed Resource Usage (28 services):

  • Total Memory: ~4-6 GB
  • Total CPU: ~1-2 cores average
  • Disk: ~10-15 GB (with data)

Scaling Considerations

Current Limitations (Development Mode)

  1. Single-Node Redis Cluster

    • 3 masters, no replicas
    • No high availability
    • Suitable for development only
  2. Single Instance Per Service

    • No load balancing
    • No redundancy
    • Fast restarts instead
  3. File-Based Vault Storage

    • Not HA-capable
    • Single point of failure
    • Fine for development

Production Adaptation Strategies

If adapting for production:

  1. Redis Cluster

    • Add replicas: 3 masters + 3 replicas minimum
    • Enable cluster failover
    • Use Redis Sentinel or Redis Cluster mode
  2. Database Replication

    • PostgreSQL: Streaming replication (primary + standby)
    • MySQL: Master-slave or Galera cluster
    • MongoDB: Replica sets (3+ nodes)
  3. Vault

    • Consul or etcd storage backend
    • 3+ Vault nodes for HA
    • Auto-unsealing via cloud KMS
  4. Load Balancing

    • Add nginx/traefik for API load balancing
    • Multiple API instances
    • Session affinity if needed
  5. Observability

    • Prometheus federation for multiple clusters
    • Remote write to long-term storage (Thanos, Cortex)
    • Centralized Loki for multi-cluster logs

Horizontal Scaling

Services that can scale horizontally:

  • ✅ Reference APIs (stateless)
  • ✅ PgBouncer (connection pooler)
  • ⚠️ Forgejo (needs shared storage)

Services that require special handling:

  • ❌ Databases (need replication setup)
  • ❌ Redis (needs cluster reconfiguration)
  • ❌ RabbitMQ (needs cluster mode)
  • ❌ Vault (needs HA storage backend)

Architectural Patterns

Initialization Pattern

All stateful services follow this pattern:

#!/bin/bash
# init.sh

1. Wait for Vault to be ready (health check loop)
2. Fetch credentials from Vault (secret/{service})
3. Parse credentials (jq)
4. Configure service with credentials
5. Start service process
6. Health check validates service is ready

Configuration Pattern

Environment Variables (from docker-compose.yml)
  │
  ▼
Service init script (./init.sh)
  │
  ├─► Fetch secrets from Vault
  ├─► Generate config files
  └─► Export environment

Service starts with configuration

Health Check Pattern

healthcheck:
  test: ["CMD", "command", "to", "test", "health"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 30s

Progressive Health Checks:

  1. Start period: Service initialization time
  2. Interval: How often to check
  3. Retries: How many failures before unhealthy
  4. Timeout: Max time for check command

Future Architecture Considerations

Potential Enhancements

  1. Service Mesh (Istio/Linkerd)

    • mTLS between services
    • Advanced traffic management
    • Observability built-in
  2. Kubernetes Migration

    • Convert docker-compose to K8s manifests
    • Use Helm charts
    • Enable true cloud-native operations
  3. GitOps Integration

    • ArgoCD or Flux
    • Declarative configuration management
    • Automated drift detection
  4. Multi-Environment Support

    • Dev, staging, production configs
    • Environment-specific overrides
    • Promotion workflows

Reference Documentation


For operational procedures, see TROUBLESHOOTING.md. For performance optimization, see PERFORMANCE_TUNING.md.

Clone this wiki locally