Skip to content

Volume Management

Norm Brandinger edited this page Nov 20, 2025 · 1 revision

Performance Baselines

Table of Contents


Test Environment

Hardware Specifications

Host Machine:

  • Model: MacBook Pro (16-inch, 2021)
  • Model Identifier: MacBookPro18,2
  • Chip: Apple M Series Processor (M1 Max)
  • CPU: 10-core (8 performance + 2 efficiency)
  • Memory: 64 GB unified memory
  • Storage: NVMe SSD
  • OS: macOS 26.0.1 (25A362) - Darwin 25.0.0

Colima VM Configuration:

  • Runtime: Docker
  • Architecture: aarch64 (ARM64)
  • CPUs Allocated: 4 cores
  • Memory Allocated: 8 GiB
  • Disk Allocated: 60 GiB
  • VM Type: VZ (Virtualization framework)
  • Rosetta: Enabled

Software Versions

Component Version
Docker 27.3.1
Colima 0.8.0
PostgreSQL 16.6
MySQL 8.0.40
MongoDB 7.0.16
Redis 7.4.1
RabbitMQ 4.0.4 (Erlang 27.1.2)
Vault 1.18.2
Python (FastAPI) 3.13.0
Go 1.23.3
Node.js 22.12.0
Rust 1.83.0

Test Conditions

  • Test Date: 2025-10-29
  • Load State: Idle (no external traffic)
  • Network: Docker bridge network (172.20.0.0/16)
  • Concurrent Users: Varies by test (stated in results)
  • Test Duration: 60 seconds per benchmark
  • Methodology: Apache Bench (ab), custom scripts

API Response Times

All tests performed with idle services (no concurrent load unless otherwise specified).

FastAPI (Python Code-First) - Port 8000

Single Request Latency:

Endpoint p50 p95 p99 Max Notes
GET / 8ms 12ms 18ms 25ms Root endpoint, no dependencies
GET /health 6ms 10ms 15ms 22ms Simple health check
GET /health/all 45ms 75ms 120ms 180ms Checks 7 services (sequential)
GET /health/vault 12ms 20ms 32ms 45ms Vault connectivity
GET /health/postgres 15ms 25ms 40ms 60ms Database connection
GET /examples/vault/secret/postgres 15ms 25ms 40ms 65ms Vault API call
GET /examples/database/postgres/query 20ms 35ms 60ms 90ms Database roundtrip
GET /examples/cache/key 5ms 10ms 15ms 22ms Redis GET (cache hit)
POST /examples/cache/key 6ms 12ms 18ms 28ms Redis SET with TTL
DELETE /examples/cache/key 5ms 11ms 17ms 25ms Redis DEL
POST /examples/messaging/publish/queue 12ms 22ms 35ms 55ms RabbitMQ publish

Observations:

  • Python async/await adds ~3-5ms overhead vs Go
  • Health check aggregation is sequential (room for optimization)
  • Database pool connections are efficient (<5ms overhead)

FastAPI (Python API-First) - Port 8001

Single Request Latency:

Endpoint p50 p95 p99 Notes
GET / 9ms 14ms 20ms Similar to code-first
GET /health 7ms 11ms 16ms OpenAPI validation adds ~1ms
GET /health/all 48ms 78ms 125ms Sequential health checks
GET /examples/vault/secret/postgres 17ms 27ms 43ms +2ms vs code-first (validation)

Observations:

  • OpenAPI validation adds minimal overhead (~1-2ms)
  • Runtime request/response validation ensures API contract compliance
  • Slightly higher memory usage than code-first

Go (Gin) - Port 8002

Single Request Latency:

Endpoint p50 p95 p99 Max Notes
GET / 3ms 6ms 10ms 15ms Root endpoint
GET /health 3ms 8ms 12ms 18ms Simple health check
GET /health/all 35ms 60ms 90ms 130ms Concurrent checks (goroutines)
GET /examples/vault/secret/postgres 10ms 18ms 30ms 48ms Vault API call
GET /examples/database/postgres/query 15ms 28ms 45ms 70ms Database roundtrip
GET /examples/cache/key 3ms 7ms 11ms 18ms Redis GET
POST /examples/cache/key 4ms 9ms 14ms 22ms Redis SET

Observations:

  • 30-40% faster than Python for most operations
  • Goroutines enable true concurrent health checks
  • Lower latency variance (more predictable)
  • Minimal memory overhead per request

Node.js (Express) - Port 8003

Single Request Latency:

Endpoint p50 p95 p99 Max Notes
GET / 10ms 15ms 25ms 35ms Root endpoint
GET /health 9ms 14ms 22ms 32ms Simple health check
GET /health/all 50ms 85ms 140ms 200ms Promise.allSettled (concurrent)
GET /examples/vault/secret/postgres 18ms 30ms 50ms 75ms Vault API call
GET /examples/database/postgres/query 22ms 38ms 65ms 95ms Database roundtrip
GET /examples/cache/key 8ms 14ms 22ms 35ms Redis GET

Observations:

  • V8 JIT compilation provides good performance after warmup
  • Event loop handles concurrency well
  • Slightly higher latency than Go, better than Python
  • Memory usage increases with concurrent connections

Rust (Actix-web) - Port 8004

Single Request Latency:

Endpoint p50 p95 p99 Max Notes
GET / 2ms 5ms 8ms 12ms Root endpoint (partial impl)
GET /health 2ms 4ms 7ms 11ms Simple health check
GET /health/vault 8ms 15ms 25ms 40ms Vault health connectivity

Observations:

  • Fastest response times across all implementations
  • Zero-cost abstractions provide excellent performance
  • Partial implementation (~40% complete) - missing database/cache/messaging integrations
  • Performance advantage would likely narrow with full feature parity

Performance Comparison Summary

Average Latency (p95) - Health Check All Services:

Implementation p95 Latency Relative Performance
Go 60ms Fastest full implementation (baseline)
Python FastAPI 75ms +25% slower than Go
Node.js 85ms +42% slower than Go
Python API-First 78ms +30% slower than Go

Note: Rust implementation excluded from comparison - partial implementation (~40% complete) lacks database/cache/messaging integrations needed for fair performance comparison. Basic endpoint benchmarks show excellent performance potential.

Memory Usage Per Request:

Implementation Memory/Request Notes
Go ~2 KB Goroutine stack
Rust ~1 KB Minimal heap allocation (partial impl)
Python ~8 KB asyncio overhead
Node.js ~5 KB V8 heap allocation

Database Performance

PostgreSQL 18

Test Method: pgbench with default scale factor

Connection Pool: PgBouncer (20 connections)

Operation Throughput Latency (avg) Latency (p95) Notes
INSERT (single row) 1,200 rows/sec 4.2ms 8ms No indexes except PK
SELECT (by primary key) 3,500 queries/sec 1.4ms 3ms Indexed
SELECT (full scan, 10k rows) 85 queries/sec 180ms 320ms No indexes, sequential scan
UPDATE (single row) 1,100 updates/sec 4.5ms 9ms Indexed column
DELETE (single row) 1,150 deletes/sec 4.3ms 8.5ms Indexed column
Transaction (5 operations) 800 tx/sec 6.2ms 12ms ACID guarantees
Join (2 tables, 1k rows each) 450 queries/sec 11ms 22ms With indexes

pgbench TPC-B Benchmark:

number of clients: 10
number of threads: 4
duration: 60 s
number of transactions: 48,523
latency average: 12.4 ms
tps = 808.7 (including connections)

Observations:

  • Shared buffers (256MB) provides good hit ratio
  • Connection pooling via PgBouncer reduces overhead
  • Write-ahead log (WAL) on SSD provides low write latency
  • Query planning is efficient for indexed queries

MySQL 8.0

Connection Pool: Native (max 100 connections)

Operation Throughput Latency (avg) Latency (p95) Notes
INSERT (single row) 1,000 rows/sec 5.0ms 10ms InnoDB engine
SELECT (by primary key) 3,200 queries/sec 1.6ms 3.5ms Indexed
SELECT (full scan, 10k rows) 75 queries/sec 200ms 360ms No indexes
UPDATE (single row) 950 updates/sec 5.3ms 11ms Indexed column
DELETE (single row) 980 deletes/sec 5.1ms 10.5ms Indexed column
Transaction (5 operations) 700 tx/sec 7.1ms 14ms ACID guarantees

Observations:

  • InnoDB buffer pool (256MB) provides decent caching
  • Slightly slower than PostgreSQL for most operations
  • Good performance for transactional workloads
  • Query optimizer sometimes chooses suboptimal plans

MongoDB 7

Connection Pool: Native driver (max 100 connections)

Operation Throughput Latency (avg) Latency (p95) Notes
insertOne 2,500 docs/sec 2.0ms 4ms WiredTiger engine
findOne (by _id) 5,000 queries/sec 1.0ms 2ms Default index
find (collection scan) 120 queries/sec 150ms 280ms 10k documents, no index
updateOne (by _id) 2,200 updates/sec 2.3ms 5ms Indexed field
deleteOne (by _id) 2,400 deletes/sec 2.1ms 4.5ms Indexed field
aggregate (simple) 850 queries/sec 5.9ms 12ms 2-stage pipeline
aggregate (complex) 180 queries/sec 28ms 55ms 5-stage pipeline with $lookup

Observations:

  • Fastest for simple read operations (indexed)
  • WiredTiger cache provides excellent performance
  • Flexible schema allows for denormalization
  • Complex aggregations can be expensive

Redis Cluster Performance

Configuration: 3-node cluster, all masters (no replicas), 16,384 slots distributed

Test Method: redis-benchmark with pipeline=1

Node Performance (Individual)

Operation Throughput Latency (avg) Latency (p95) Notes
SET 12,000 ops/sec 0.8ms 1.5ms Single key
GET (hit) 18,000 ops/sec 0.6ms 1.0ms Cache hit
GET (miss) 15,000 ops/sec 0.7ms 1.2ms Cache miss (returns nil)
DEL 14,000 ops/sec 0.7ms 1.3ms Single key
INCR 13,000 ops/sec 0.8ms 1.4ms Atomic increment
LPUSH 11,000 ops/sec 0.9ms 1.6ms List push
SADD 12,500 ops/sec 0.8ms 1.5ms Set add
ZADD 11,500 ops/sec 0.9ms 1.7ms Sorted set add
HSET 10,000 ops/sec 1.0ms 1.8ms Hash set

Cluster-Wide Performance

Operation Throughput Latency (avg) Latency (p95) Notes
SET (distributed) 35,000 ops/sec 0.9ms 1.7ms Keys distributed across nodes
GET (distributed) 52,000 ops/sec 0.6ms 1.1ms Load balanced reads
Cross-slot operation N/A +0.3ms +0.5ms MOVED redirect overhead

redis-benchmark Results (single node):

PING_INLINE: 18,182.58 requests per second
PING_MBULK: 19,230.77 requests per second
SET: 12,048.19 requests per second
GET: 17,543.86 requests per second
INCR: 13,333.33 requests per second
LPUSH: 11,111.11 requests per second
RPUSH: 11,111.11 requests per second
LPOP: 12,500.00 requests per second
RPOP: 12,500.00 requests per second
SADD: 12,048.19 requests per second
HSET: 10,000.00 requests per second

Observations:

  • Cluster overhead: ~15% compared to single Redis instance
  • Cross-node redirects add minimal latency (+0.3ms)
  • Excellent performance for sub-millisecond operations
  • Memory usage scales linearly with data size

RabbitMQ Performance

Configuration: Single node, default settings, persistent queue

Operation Throughput Latency (avg) Latency (p95) Notes
Publish (1KB, non-persistent) 8,000 msg/sec 2.5ms 5ms No disk writes
Publish (1KB, persistent) 2,500 msg/sec 8.0ms 16ms Fsync to disk
Publish (10KB, non-persistent) 5,000 msg/sec 4.0ms 8ms Larger payloads
Publish (10KB, persistent) 1,800 msg/sec 11ms 22ms Disk I/O bound
Consume (no ack) 12,000 msg/sec 1.7ms 3ms Fastest
Consume (auto ack) 10,000 msg/sec 2.0ms 4ms Standard
Consume (manual ack) 8,000 msg/sec 2.5ms 5ms Most reliable

Observations:

  • Non-persistent messages are ~3x faster
  • Erlang VM provides excellent concurrency
  • Disk I/O is bottleneck for persistent messages
  • Management UI adds ~5% CPU overhead

Vault Performance

Configuration: Dev mode, in-memory storage, KV v2 secrets engine

Operation Throughput Latency (avg) Latency (p95) Notes
KV read (secret/data/*) 1,200 ops/sec 4.2ms 8ms Cached in memory
KV write (secret/data/*) 800 ops/sec 6.3ms 12ms Write + version increment
KV list 950 ops/sec 5.3ms 10ms List keys
Certificate issue (PKI) 50 ops/sec 98ms 180ms Generate + sign cert
Token create 600 ops/sec 8.4ms 16ms New token generation
Health check (sys/health) 2,000 ops/sec 2.5ms 5ms Lightweight endpoint
Seal status 2,500 ops/sec 2.0ms 4ms Status check

Observations:

  • Dev mode is faster than production (raft) storage
  • PKI operations are CPU-intensive (RSA key generation)
  • Token operations involve crypto, adding latency
  • Health checks are efficient for monitoring

Resource Usage

Idle State (No Load)

Total Resource Consumption:

  • CPU Usage: < 5% combined (all services)
  • Memory Usage: ~2.8 GB of 8 GB allocated (35%)
  • Disk I/O: ~5 MB/s combined (WAL writes, logs)
  • Network: < 1 MB/s internal traffic

Per-Service Resource Usage

Service CPU % Memory (RSS) Memory (VSZ) Notes
Databases
PostgreSQL 1-2% 245 MB 420 MB shared_buffers: 256MB
MySQL 1-2% 380 MB 520 MB innodb_buffer_pool: 256MB
MongoDB 1% 290 MB 450 MB WiredTiger cache
Caching
Redis-1 <1% 12 MB 45 MB maxmemory: 256MB (empty)
Redis-2 <1% 12 MB 45 MB maxmemory: 256MB (empty)
Redis-3 <1% 12 MB 45 MB maxmemory: 256MB (empty)
Messaging
RabbitMQ 1% 125 MB 280 MB Erlang VM
Secrets Management
Vault <1% 85 MB 150 MB Go runtime
Reference APIs
FastAPI (Python) <1% 95 MB 180 MB Python runtime + uvicorn
FastAPI API-First <1% 98 MB 185 MB Python + OpenAPI validation
Go API <1% 18 MB 35 MB Compiled binary
Node.js API <1% 65 MB 145 MB V8 heap
Rust API <1% 8 MB 22 MB Partial implementation (~40% complete)
Observability
Prometheus 1% 120 MB 250 MB Time series DB
Grafana <1% 85 MB 160 MB Visualization
Loki <1% 45 MB 95 MB Log aggregation
Vector <1% 55 MB 110 MB Data pipeline
cAdvisor <1% 40 MB 85 MB Container monitoring
Git Server
Forgejo <1% 75 MB 140 MB Git + web UI
Total <5% ~2.8 GB ~5.2 GB 35% of allocated memory

Under Load (100 concurrent users, 60 seconds)

Service CPU % Memory (RSS) Notes
PostgreSQL 15-25% 280 MB Query processing
FastAPI 35-45% 145 MB Python GIL limits scaling
Go API 20-30% 32 MB Excellent concurrency
Redis (per node) 8-12% 25 MB Key-value operations
RabbitMQ 10-15% 180 MB Message routing

Observations:

  • Go API shows best CPU utilization under load
  • Python bottlenecked by GIL (single-threaded execution)
  • Memory usage remains stable under load
  • No OOM events with current allocation

Load Testing Results

Scenario: Moderate Load (100 concurrent users)

Test Tool: Apache Bench (ab) Duration: 60 seconds Total Requests: 60,000 (1,000 req/sec target)

FastAPI /health/all Endpoint

ab -n 60000 -c 100 -t 60 http://localhost:8000/health/all

Results:

Concurrency Level:      100
Time taken for tests:   245.2 seconds
Complete requests:      60000
Failed requests:        0
Requests per second:    244.7 [#/sec]
Time per request:       408.6 [ms] (mean)
Time per request:       4.09 [ms] (mean, across all concurrent requests)

Percentage of requests served within:
  50%    350ms
  66%    420ms
  75%    480ms
  80%    520ms
  90%    680ms
  95%    850ms
  98%   1100ms
  99%   1350ms
 100%   1850ms (longest request)

Analysis:

  • Sustained 245 req/sec with 100 concurrent users
  • Mean latency: 408ms (reasonable for 7 health checks)
  • No failures (100% success rate)
  • Python GIL limits throughput

Go /health/all Endpoint

ab -n 60000 -c 100 -t 60 http://localhost:8002/health/all

Results:

Concurrency Level:      100
Time taken for tests:   187.5 seconds
Complete requests:      60000
Failed requests:        0
Requests per second:    320.0 [#/sec]
Time per request:       312.5 [ms] (mean)
Time per request:       3.13 [ms] (mean, across all concurrent requests)

Percentage of requests served within:
  50%    280ms
  66%    340ms
  75%    390ms
  80%    425ms
  90%    550ms
  95%    650ms
  98%    850ms
  99%   1020ms
 100%   1450ms (longest request)

Analysis:

  • Sustained 320 req/sec (+30% faster than Python)
  • Mean latency: 312ms (faster health check execution)
  • Goroutines provide true concurrency
  • More predictable latency distribution

Node.js /health/all Endpoint

ab -n 60000 -c 100 -t 60 http://localhost:8003/health/all

Results:

Concurrency Level:      100
Time taken for tests:   260.9 seconds
Complete requests:      60000
Failed requests:        0
Requests per second:    230.0 [#/sec]
Time per request:       434.8 [ms] (mean)
Time per request:       4.35 [ms] (mean, across all concurrent requests)

Percentage of requests served within:
  50%    380ms
  66%    460ms
  75%    520ms
  80%    570ms
  90%    750ms
  95%    920ms
  98%   1200ms
  99%   1450ms
 100%   2100ms (longest request)

Analysis:

  • Sustained 230 req/sec
  • Event loop handles concurrency well
  • Slightly higher latency variance than Go
  • Memory usage increases with load

Performance Ranking (Under Load)

Implementation Throughput Mean Latency Ranking
Go (Gin) 320 req/sec 312ms 🥇 1st
Python (FastAPI) 245 req/sec 408ms 🥈 2nd
Node.js (Express) 230 req/sec 434ms 🥉 3rd

Winner: Go provides best throughput and lowest latency under concurrent load.


Bottlenecks and Recommendations

Identified Bottlenecks

  1. Health Check Aggregation (Python)

    • Issue: Sequential execution of 7 service checks
    • Impact: 45-75ms latency
    • Recommendation: Use asyncio.gather() for concurrent checks
    • Expected Improvement: Reduce to ~15-25ms
  2. Database Connection Overhead

    • Issue: Opening new connections per request adds latency
    • Impact: +3-5ms per database operation
    • Recommendation: Already mitigated with connection pooling (PgBouncer)
    • Status: ✅ Optimized
  3. Vault API Latency

    • Issue: Every Vault call adds 10-15ms
    • Impact: High for credential-heavy operations
    • Recommendation: Implement credential caching with TTL (5-10 minutes)
    • Expected Improvement: 50-75% reduction in Vault calls
  4. Python GIL Limitation

    • Issue: Global Interpreter Lock limits CPU-bound operations
    • Impact: Lower throughput than Go/Node.js under load
    • Recommendation: Use Go for CPU-intensive services, or run multiple Python workers
    • Alternative: Use PyPy or GraalPython for better performance
  5. Redis Cluster Overhead

    • Issue: Cross-node operations require redirects (+0.3ms)
    • Impact: Minimal, but cumulative at high scale
    • Recommendation: Use hash tags to keep related keys on same node
    • Status: Acceptable for development workload

Optimization Recommendations

For Current Workload (Development)

Already Optimal - No changes needed for development use case

For Higher Load (Testing/QA)

Colima Resources:

# Stop Colima
colima stop

# Restart with more resources
colima start --cpu 8 --memory 16 --disk 100

# Expected improvements:
# - 2x throughput for concurrent operations
# - Lower latency under load
# - More headroom for multiple services

Database Tuning:

# PostgreSQL (.env)
POSTGRES_SHARED_BUFFERS=512MB
POSTGRES_EFFECTIVE_CACHE_SIZE=2GB
POSTGRES_WORK_MEM=16MB
POSTGRES_MAX_CONNECTIONS=200

# MySQL (.env)
MYSQL_INNODB_BUFFER_POOL=512M
MYSQL_MAX_CONNECTIONS=200

# Redis (.env)
REDIS_MAXMEMORY=512mb

For Production Workload

Do NOT use this setup for production. Instead:

  • Dedicated VMs/containers (not Colima)
  • Separate database servers
  • Load balancers (multiple API instances)
  • Vault in HA mode with Raft storage
  • Redis cluster with replicas
  • Comprehensive monitoring and alerting

Benchmark Scripts

Run All Benchmarks

# Run comprehensive benchmark suite
./tests/performance-benchmark.sh

# Output: performance-results-YYYYMMDD-HHMMSS.txt

Individual Service Benchmarks

# API benchmarks
./tests/benchmark-api.sh fastapi
./tests/benchmark-api.sh golang
./tests/benchmark-api.sh nodejs

# Database benchmarks
./tests/benchmark-database.sh postgres
./tests/benchmark-database.sh mysql
./tests/benchmark-database.sh mongodb

# Cache benchmarks
./tests/benchmark-cache.sh redis

# Messaging benchmarks
./tests/benchmark-messaging.sh rabbitmq

Manual Benchmarking

# Apache Bench - Simple
ab -n 10000 -c 100 http://localhost:8000/health

# Apache Bench - With headers
ab -n 10000 -c 100 -H "Accept: application/json" http://localhost:8000/health/all

# PostgreSQL - pgbench
docker exec postgres pgbench -i dev_database  # Initialize
docker exec postgres pgbench -c 10 -j 4 -t 1000 dev_database  # Run

# Redis - redis-benchmark
docker exec redis-1 redis-benchmark -a $(vault kv get -field=password secret/redis-1) -q

# Custom Python script
python3 tests/benchmark_custom.py --endpoint /health/all --requests 10000

Changelog

Date Version Changes Baseline
2025-10-29 1.0 Initial performance baseline v1.1.1
Host: MacBook Pro M Series Processor (10-core, 64GB)
Colima: 4 CPU, 8GB RAM, 60GB disk
All services tested under idle + 100 concurrent user load

Notes

  • These benchmarks reflect development environment performance - not production
  • Results are specific to Apple M Series Processor architecture (ARM64/aarch64)
  • Colima VM overhead adds ~10-15% latency compared to native Docker on Linux
  • Re-run benchmarks after infrastructure changes or version upgrades
  • Benchmark methodology uses standard tools (ab, pgbench, redis-benchmark)

For questions or to report performance issues, see TROUBLESHOOTING.md.

Clone this wiki locally