[Epic] ⚙️ Performance - Production Server Tuning

# ⚙️ Performance - Production Server Tuning

## Goal

Optimize default production server configuration for performance and reliability:
1. **Enable Gunicorn preload_app** by default for memory efficiency
2. **Create/update gunicorn.config.py** with optimized production settings
3. **Update .env.example** with all performance-related configuration flags
4. **Update run-gunicorn.sh** with better production defaults
5. **Add performance tuning documentation** explaining all settings
6. **Create benchmarking guide** for load testing and optimization

This ensures optimal resource usage, better reliability, and predictable performance under load with production-ready defaults out of the box.

## Why Now?

Production configuration directly impacts performance, cost, and reliability:

1. **Memory Efficiency**: Preload reduces memory footprint by 30-50% per worker
2. **Resource Optimization**: Proper worker count balances CPU/memory usage effectively
3. **Reliability**: Correct timeouts prevent worker crashes and deadlocks
4. **Cost Savings**: Optimized config means fewer servers needed (20-40% reduction)
5. **Predictable Performance**: Well-tuned servers handle load spikes gracefully
6. **Best Practices**: Production-ready defaults out of the box reduce deployment issues

---

## 📖 User Stories

<details>
<summary>US-1: DevOps Engineer - Deploy with Optimal Defaults</summary>

**As a** DevOps Engineer
**I want** production-ready default configuration
**So that** deployments perform well without manual tuning

**Acceptance Criteria:**

```gherkin
Given I am deploying MCP Gateway for the first time
When I use default configuration
Then Gunicorn should use optimal worker count (2*CPU+1)
And preload_app should be enabled for memory efficiency
And all timeouts should be reasonable for typical workloads
And the server should handle 1000+ concurrent connections

Given I am running on a 4-CPU server
When the gateway starts
Then it should spawn 9 workers (2*4+1)
And each worker should use ~60MB RAM (with preload)
And total memory usage should be <1GB
And CPU usage should be balanced across all cores

Given I need to adjust configuration
When I set environment variables
Then changes should apply without code modifications
And settings should be documented in .env.example
And I should understand the impact of each setting
```

**Technical Requirements:**
- Comprehensive gunicorn.config.py with production settings
- Preload enabled by default in run-gunicorn.sh
- Environment variable overrides for all settings
- Documentation explaining each configuration option

</details>

<details>
<summary>US-2: Site Reliability Engineer - Monitor and Optimize Performance</summary>

**As a** Site Reliability Engineer
**I want** to monitor performance metrics and tune configuration
**So that** I can optimize resource usage and ensure reliability

**Acceptance Criteria:**

```gherkin
Given I am monitoring the production gateway
When I view Prometheus metrics
Then I should see worker count and memory usage
And I should see request rates and latency percentiles
And I should see database connection pool usage
And I should see cache hit ratios

Given I notice high memory usage per worker
When I adjust max_requests setting
Then workers should restart more frequently
And memory leaks should be prevented
And total memory usage should stabilize

Given I need to perform load testing
When I run benchmarks
Then I should have documented load test procedures
And I should be able to compare against baselines
And I should identify bottlenecks (CPU, memory, DB, network)
```

**Technical Requirements:**
- Enhanced health check with performance metrics
- Prometheus metrics for all key indicators
- Grafana dashboard template for monitoring
- Load testing scripts (wrk, locust)
- Performance tuning guide

</details>

<details>
<summary>US-3: Production Deployment - Zero-Downtime Updates</summary>

**As a** Production Deployment
**I want** graceful shutdown and reload capabilities
**So that** updates don't disrupt active users

**Acceptance Criteria:**

```gherkin
Given I am deploying a new version
When I send SIGHUP signal to Gunicorn
Then workers should reload gracefully
And in-flight requests should complete
And no requests should be dropped
And downtime should be zero

Given a worker is processing a long request
When graceful timeout expires
Then the worker should be allowed to finish
And a warning should be logged
And the worker should not be force-killed prematurely

Given the application is shutting down
When lifespan shutdown runs
Then all connections should close cleanly
And caches should be flushed
And database connections should be released
And shutdown should complete within 30 seconds
```

**Technical Requirements:**
- Graceful timeout configured (30s)
- Lifespan shutdown cleanup
- Worker reload without dropping requests
- Production deployment checklist

</details>

---

## 🏗 Architecture

### Worker Process Model

```mermaid
graph TD
 A[Gunicorn Master Process] --> B[Worker 1: UvicornWorker]
 A --> C[Worker 2: UvicornWorker]
 A --> D[Worker 3: UvicornWorker]
 A --> E[Worker N: UvicornWorker]

 B --> F[FastAPI App Instance]
 C --> G[FastAPI App Instance]
 D --> H[FastAPI App Instance]
 E --> I[FastAPI App Instance]

 J[Preload App] -.->|Shared Memory| F
 J -.->|Shared Memory| G
 J -.->|Shared Memory| H
 J -.->|Shared Memory| I

 style A fill:#FFE4B5
 style J fill:#90EE90
```

### Resource Usage with Preload

```mermaid
graph LR
 A[Without Preload] --> B[Worker 1: 100MB]
 A --> C[Worker 2: 100MB]
 A --> D[Worker 3: 100MB]
 A --> E[Worker 4: 100MB]
 A --> F[Total: 400MB]

 G[With Preload] --> H[Shared Code: 60MB]
 G --> I[Worker 1: 40MB]
 G --> J[Worker 2: 40MB]
 G --> K[Worker 3: 40MB]
 G --> L[Worker 4: 40MB]
 G --> M[Total: 220MB]

 style F fill:#FFB6C1
 style M fill:#90EE90
```

### Graceful Shutdown Flow

```mermaid
sequenceDiagram
 participant Master as Gunicorn Master
 participant Worker as Worker Process
 participant App as FastAPI App
 participant DB as Database

 Note over Master,DB: Graceful Shutdown (SIGTERM)
 Master->>Worker: SIGTERM signal
 Worker->>Worker: Stop accepting new requests
 Worker->>App: Call lifespan shutdown
 App->>DB: Close connections
 App->>App: Flush caches
 App->>App: Cleanup resources
 App->>Worker: Shutdown complete
 Worker->>Worker: Finish in-flight requests
 Worker->>Worker: Wait up to graceful_timeout (30s)
 Worker->>Master: Exit cleanly
 Master->>Master: All workers stopped
```

### Implementation Examples

```python
# gunicorn.config.py - Comprehensive Production Configuration

import multiprocessing
import os

# =====================================
# Server Socket Configuration
# =====================================
bind = f"0.0.0.0:{os.getenv('PORT', '4444')}"
backlog = 2048 # Maximum number of pending connections

# =====================================
# Worker Process Configuration
# =====================================

# Worker count formula: (2 * CPU cores) + 1
# This is optimal for I/O-bound applications like API servers
# CPU-bound apps should use: CPU cores + 1
cpu_count = multiprocessing.cpu_count()
default_workers = min((2 * cpu_count) + 1, 16) # Cap at 16 workers
workers = int(os.getenv("GUNICORN_WORKERS", default_workers))

# Worker class - UvicornWorker for ASGI apps (FastAPI)
# Automatically enables HTTP/2 if h2 library is installed
worker_class = os.getenv("GUNICORN_WORKER_CLASS", "uvicorn.workers.UvicornWorker")

# Max simultaneous connections per worker
worker_connections = int(os.getenv("GUNICORN_WORKER_CONNECTIONS", "1000"))

# =====================================
# Worker Lifecycle Configuration
# =====================================

# Restart worker after N requests (prevents memory leaks)
# Default: 100,000 requests
max_requests = int(os.getenv("GUNICORN_MAX_REQUESTS", "100000"))

# Add jitter to max_requests (prevent thundering herd)
# Actual restart will be between (max_requests - jitter) and (max_requests + jitter)
max_requests_jitter = int(os.getenv("GUNICORN_MAX_REQUESTS_JITTER", "100"))

# =====================================
# Timeout Configuration
# =====================================

# Worker silent for this long → restart
# Default: 600 seconds (10 minutes) for long-running operations
timeout = int(os.getenv("GUNICORN_TIMEOUT", "600"))

# Time to finish requests during graceful shutdown
# Workers get this much time to complete in-flight requests
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))

# HTTP Keep-Alive timeout
# Keeps connections alive for this many seconds
keepalive = int(os.getenv("GUNICORN_KEEPALIVE", "5"))

# =====================================
# Preloading Configuration
# =====================================

# Preload app before forking workers
# Benefits: 30-50% memory savings (shared code), faster startup
# Drawbacks: Slower reload (must restart all workers)
preload_app = os.getenv("GUNICORN_PRELOAD_APP", "true").lower() == "true"

# =====================================
# Logging Configuration
# =====================================

accesslog = os.getenv("GUNICORN_ACCESS_LOG", "-") # stdout
errorlog = os.getenv("GUNICORN_ERROR_LOG", "-") # stderr
loglevel = os.getenv("LOG_LEVEL", "info").lower()

# Access log format with timing information
access_log_format = os.getenv(
 "GUNICORN_ACCESS_LOG_FORMAT",
 '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" %(D)s'
)

# =====================================
# Process Naming & Management
# =====================================

proc_name = "mcpgateway"
daemon = False
pidfile = os.getenv("GUNICORN_PID_FILE", "/tmp/mcpgateway-gunicorn.pid")

# =====================================
# SSL/TLS Configuration (Optional)
# =====================================

if os.getenv("SSL", "false").lower() == "true":
 keyfile = os.getenv("KEY_FILE", "certs/key.pem")
 certfile = os.getenv("CERT_FILE", "certs/cert.pem")

# =====================================
# Server Hooks
# =====================================

def on_starting(server):
 """Called just before the master process is initialized."""
 print(f"🚀 MCP Gateway starting with {workers} workers")
 print(f" Preload: {preload_app}")
 print(f" Worker class: {worker_class}")

def on_reload(server):
 """Called to recycle workers during a reload."""
 print("♻️ Reloading workers...")

def worker_int(worker):
 """Called when a worker receives INT or QUIT signal."""
 print(f"⚠️ Worker {worker.pid} interrupted")

def worker_abort(worker):
 """Called when a worker times out."""
 print(f"❌ Worker {worker.pid} aborted (timeout after {timeout}s)")

def post_worker_init(worker):
 """Called after a worker is initialized."""
 print(f"✅ Worker {worker.pid} initialized")

def worker_exit(server, worker):
 """Called when a worker exits."""
 print(f"👋 Worker {worker.pid} exited")
```

```bash
# run-gunicorn.sh - Updated with production defaults

#!/bin/bash
set -e

# Default configuration (can be overridden by environment)
export GUNICORN_WORKERS=${GUNICORN_WORKERS:-auto}
export GUNICORN_TIMEOUT=${GUNICORN_TIMEOUT:-600}
export GUNICORN_PRELOAD_APP=${GUNICORN_PRELOAD_APP:-true} # Changed to true
export GUNICORN_KEEPALIVE=${GUNICORN_KEEPALIVE:-5}
export GUNICORN_MAX_REQUESTS=${GUNICORN_MAX_REQUESTS:-100000}

# Start Gunicorn with config file
if [ -f "gunicorn.config.py" ]; then
 echo "Starting Gunicorn with gunicorn.config.py"
 exec gunicorn -c gunicorn.config.py mcpgateway.main:app
else
 echo "Warning: gunicorn.config.py not found, using inline config"
 exec gunicorn mcpgateway.main:app \
 --bind 0.0.0.0:${PORT:-4444} \
 --workers ${GUNICORN_WORKERS} \
 --worker-class uvicorn.workers.UvicornWorker \
 --timeout ${GUNICORN_TIMEOUT} \
 --preload \
 --keepalive ${GUNICORN_KEEPALIVE}
fi
```

---

## 📋 Implementation Tasks

### Phase 1: Enable Preload by Default ✅

- [ ] **Update run-gunicorn.sh**
 - [ ] Change GUNICORN_PRELOAD_APP default from false to true (line ~254)
 - [ ] Add comment explaining preload benefits (30-50% memory savings)
 - [ ] Add comment about preload trade-offs (slower reload)
 - [ ] Test with preload enabled: `make serve`
 - [ ] Measure memory usage per worker with `ps aux | grep gunicorn`

- [ ] **Verify Preload Behavior**
 - [ ] Test startup time (should be slightly slower)
 - [ ] Test memory usage (should be 30-50% lower per worker)
 - [ ] Test reload behavior (SIGHUP should reload all workers)
 - [ ] Verify no issues with shared state

### Phase 2: Create Comprehensive gunicorn.config.py ✅

- [ ] **Create Configuration File**
 - [ ] Create gunicorn.config.py in project root
 - [ ] Add comprehensive comments for each section
 - [ ] Configure server socket (bind, backlog)
 - [ ] Configure workers (count, class, connections)
 - [ ] Configure worker lifecycle (max_requests, jitter)
 - [ ] Configure timeouts (timeout, graceful_timeout, keepalive)
 - [ ] Configure logging (access_log, error_log, format)
 - [ ] Configure SSL/TLS (optional, env-based)

- [ ] **Add Server Hooks**
 - [ ] Implement on_starting() hook (startup message)
 - [ ] Implement on_reload() hook (reload message)
 - [ ] Implement worker_int() hook (interrupt handling)
 - [ ] Implement worker_abort() hook (timeout handling)
 - [ ] Implement post_worker_init() hook (worker ready message)
 - [ ] Implement worker_exit() hook (worker cleanup message)

- [ ] **Test Configuration**
 - [ ] Test with `gunicorn -c gunicorn.config.py --check-config`
 - [ ] Start server: `gunicorn -c gunicorn.config.py mcpgateway.main:app`
 - [ ] Verify all hooks execute correctly
 - [ ] Verify environment variable overrides work

### Phase 3: Optimize Worker Configuration ✅

- [ ] **Document Worker Count Formula**
 - [ ] Add formula explanation in gunicorn.config.py
 - [ ] CPU-bound: workers = CPU cores + 1
 - [ ] I/O-bound (API): workers = (2 * CPU cores) + 1
 - [ ] Max recommended: 16 workers (diminishing returns)
 - [ ] Formula: min((2 * CPU + 1), 16)

- [ ] **Add Environment Variable Overrides**
 - [ ] GUNICORN_WORKERS (default: auto-calculated)
 - [ ] GUNICORN_WORKER_CLASS (default: UvicornWorker)
 - [ ] GUNICORN_WORKER_CONNECTIONS (default: 1000)
 - [ ] Document all overrides in .env.example

- [ ] **Benchmark Worker Counts**
 - [ ] Test with 1, 2, 4, 8 workers
 - [ ] Run load test: `wrk -t4 -c100 -d30s http://localhost:4444/tools`
 - [ ] Measure requests/sec, latency, CPU usage
 - [ ] Document optimal count for typical hardware

### Phase 4: Optimize Timeout Settings ✅

- [ ] **Document Timeout Configuration**
 - [ ] timeout: Worker silent for this long → restart (default: 600s)
 - [ ] graceful_timeout: Time to finish requests during shutdown (default: 30s)
 - [ ] keepalive: HTTP Keep-Alive timeout (default: 5s)
 - [ ] Add comprehensive comments in gunicorn.config.py

- [ ] **Add Environment Variable Overrides**
 - [ ] GUNICORN_TIMEOUT (default: 600)
 - [ ] GUNICORN_GRACEFUL_TIMEOUT (default: 30)
 - [ ] GUNICORN_KEEPALIVE (default: 5)
 - [ ] Document all timeout settings in .env.example

- [ ] **Tune Timeouts for Scenarios**
 - [ ] Long operations: timeout=3600 (1 hour)
 - [ ] Standard API: timeout=600 (10 minutes)
 - [ ] Fast responses: timeout=120 (2 minutes)
 - [ ] Document recommendations in performance guide

### Phase 5: Configure Worker Lifecycle ✅

- [ ] **Document Worker Restart Settings**
 - [ ] max_requests: Restart worker after N requests (prevents memory leaks)
 - [ ] max_requests_jitter: Add randomness (prevent all workers restarting at once)
 - [ ] Example: max_requests=100000, jitter=100 → restart between 99,900-100,100
 - [ ] Add comprehensive comments in gunicorn.config.py

- [ ] **Tune max_requests Based on Profiling**
 - [ ] Run load test and monitor memory growth per worker
 - [ ] Set max_requests to prevent workers exceeding memory limit
 - [ ] Default 100,000 is safe for most applications
 - [ ] Add jitter to prevent thundering herd (100-1000)

- [ ] **Test Worker Lifecycle**
 - [ ] Verify workers restart after max_requests
 - [ ] Verify jitter prevents synchronized restarts
 - [ ] Verify no dropped requests during restart
 - [ ] Monitor logs for restart messages

### Phase 6: Update .env.example ✅

- [ ] **Add Performance Configuration Section**
 - [ ] Add section header: "# Performance Optimization"
 - [ ] Add subsection: "# Gunicorn Server Configuration"
 - [ ] Add subsection: "# HTTP/2 Support"
 - [ ] Add subsection: "# Response Compression"
 - [ ] Add subsection: "# JSON Serialization"
 - [ ] Add subsection: "# Static Asset Caching"
 - [ ] Add subsection: "# Response Caching (Redis)"
 - [ ] Add subsection: "# Database Connection Pooling"
 - [ ] Add subsection: "# Access Logging"

- [ ] **Document All Performance Settings**
 - [ ] GUNICORN_WORKERS=auto (with explanation)
 - [ ] GUNICORN_WORKER_CLASS=uvicorn.workers.UvicornWorker
 - [ ] GUNICORN_TIMEOUT=600
 - [ ] GUNICORN_MAX_REQUESTS=100000
 - [ ] GUNICORN_MAX_REQUESTS_JITTER=100
 - [ ] GUNICORN_PRELOAD_APP=true
 - [ ] GUNICORN_GRACEFUL_TIMEOUT=30
 - [ ] GUNICORN_KEEPALIVE=5
 - [ ] HTTP2_ENABLED=true
 - [ ] COMPRESSION_ENABLED=true
 - [ ] COMPRESSION_MINIMUM_SIZE=500
 - [ ] COMPRESSION_LEVEL=6
 - [ ] JSON_SERIALIZER=orjson
 - [ ] STATIC_CACHE_MAX_AGE=31536000
 - [ ] CACHE_DEFAULT_TTL=300
 - [ ] CACHE_STALE_TTL=3600
 - [ ] CACHE_WARMING_ENABLED=true
 - [ ] DB_POOL_SIZE=200
 - [ ] DB_MAX_OVERFLOW=10
 - [ ] DB_POOL_TIMEOUT=30
 - [ ] DB_POOL_RECYCLE=3600
 - [ ] DISABLE_ACCESS_LOG=false

### Phase 7: Performance Documentation ✅

- [ ] **Create Performance Tuning Guide**
 - [ ] Create docs/performance/tuning-guide.md
 - [ ] Section: Worker Configuration (count, class, connections)
 - [ ] Section: Timeout Configuration (timeout, graceful, keepalive)
 - [ ] Section: Preload App (benefits, trade-offs, when to use)
 - [ ] Section: Worker Lifecycle (max_requests, memory leak prevention)
 - [ ] Section: Monitoring (metrics, dashboards, alerts)
 - [ ] Section: Scaling Strategies (vertical, horizontal, database, caching)

- [ ] **Create Benchmarking Guide**
 - [ ] Create docs/performance/benchmarking-guide.md
 - [ ] Section: Tools (wrk, locust, ab, hey)
 - [ ] Section: Scenarios (single endpoint, mixed workload, concurrent users)
 - [ ] Section: Metrics (requests/sec, p50/p95/p99 latency, error rate)
 - [ ] Section: Baseline Benchmarks (for comparison)
 - [ ] Section: Interpreting Results (bottlenecks, optimization strategies)

### Phase 8: Enhanced Health Check ✅

- [ ] **Enhance GET /health Endpoint**
 - [ ] Add uptime_seconds field
 - [ ] Add performance section with worker count
 - [ ] Add cache_hit_ratio metric
 - [ ] Add avg_response_time_ms metric
 - [ ] Add db_pool_available connections
 - [ ] Add memory_used_mb metric
 - [ ] Return as JSON response

- [ ] **Add Detailed Health Check**
 - [ ] Create GET /health?detailed=true endpoint
 - [ ] Include all metrics from basic health check
 - [ ] Add database connection status
 - [ ] Add Redis connection status
 - [ ] Add recent error counts
 - [ ] Document format in API docs

### Phase 9: Prometheus Metrics ✅

- [ ] **Add Performance Metrics**
 - [ ] worker_count gauge (number of workers)
 - [ ] worker_memory_bytes gauge per worker
 - [ ] request_duration_seconds histogram (p50, p95, p99)
 - [ ] active_requests gauge (current in-flight)
 - [ ] worker_restarts_total counter

- [ ] **Integrate into /metrics Endpoint**
 - [ ] Verify metrics exposed on GET /metrics
 - [ ] Test Prometheus scraping
 - [ ] Verify labels work correctly
 - [ ] Document metrics in Prometheus format

### Phase 10: Grafana Dashboard ✅

- [ ] **Create Dashboard Template**
 - [ ] Create deployment/grafana/mcpgateway-dashboard.json
 - [ ] Panel: Request rate (requests/sec)
 - [ ] Panel: Response latency (p50, p95, p99)
 - [ ] Panel: Worker count and memory usage
 - [ ] Panel: Cache hit ratio
 - [ ] Panel: Database connection pool usage
 - [ ] Panel: Error rate (4xx, 5xx)

- [ ] **Document Dashboard Import**
 - [ ] Add import instructions to docs
 - [ ] Document Prometheus data source configuration
 - [ ] Add screenshots of dashboard
 - [ ] Document alerting rules

### Phase 11: Load Testing Scripts ✅

- [ ] **Create wrk Load Test Script**
 - [ ] Create scripts/load-test/wrk-simple.lua
 - [ ] Add authorization header handling
 - [ ] Add custom header support
 - [ ] Document usage and examples

- [ ] **Create Locust Load Test Script**
 - [ ] Create scripts/load-test/locustfile.py
 - [ ] Define GatewayUser with wait times
 - [ ] Add task for GET /tools (weight: 3)
 - [ ] Add task for GET /servers (weight: 2)
 - [ ] Add task for GET /health (weight: 1)
 - [ ] Add task for POST /tools (weight: 1)

- [ ] **Create Load Test Documentation**
 - [ ] Create scripts/load-test/README.md
 - [ ] Document wrk usage and examples
 - [ ] Document locust usage and examples
 - [ ] Document how to interpret results
 - [ ] Add baseline benchmark results

- [ ] **Run Baseline Load Tests**
 - [ ] Run wrk: `wrk -t4 -c100 -d30s http://localhost:4444/tools`
 - [ ] Run locust: `locust -f locustfile.py --host http://localhost:4444`
 - [ ] Document results (req/s, latency, errors)
 - [ ] Create benchmark report

### Phase 12: Graceful Shutdown ✅

- [ ] **Review Lifespan Shutdown**
 - [ ] Open mcpgateway/main.py lifespan function
 - [ ] Verify shutdown logic exists
 - [ ] Add connection drain period (await asyncio.sleep(1))
 - [ ] Ensure database connections closed
 - [ ] Ensure Redis connections closed
 - [ ] Ensure cache flushed (if needed)

- [ ] **Test Graceful Shutdown**
 - [ ] Start server: `make serve`
 - [ ] Send SIGTERM: `kill -TERM <pid>`
 - [ ] Verify shutdown message in logs
 - [ ] Verify connections closed cleanly
 - [ ] Verify no error traces during shutdown

### Phase 13: Production Deployment Checklist ✅

- [ ] **Create Deployment Checklist**
 - [ ] Create docs/deployment/production-checklist.md
 - [ ] ☑ Environment variables configured
 - [ ] ☑ Secrets rotated from defaults
 - [ ] ☑ TLS/SSL certificates valid
 - [ ] ☑ Worker count optimized for hardware
 - [ ] ☑ Timeouts appropriate for workload
 - [ ] ☑ Preload enabled (unless hot-reload needed)
 - [ ] ☑ Compression enabled
 - [ ] ☑ HTTP/2 enabled
 - [ ] ☑ Redis caching configured
 - [ ] ☑ Database pool sized correctly
 - [ ] ☑ Monitoring/metrics enabled
 - [ ] ☑ Health checks configured
 - [ ] ☑ Log aggregation setup
 - [ ] ☑ Backup/restore tested
 - [ ] ☑ Load testing completed
 - [ ] ☑ Rollback plan documented

### Phase 14: Load Testing & Benchmarking ✅

- [ ] **Baseline Load Test (Default Settings)**
 - [ ] Run `wrk -t4 -c100 -d30s http://localhost:4444/health`
 - [ ] Record: requests/sec, latency percentiles (p50, p95, p99)
 - [ ] Record: CPU usage, memory usage
 - [ ] Record: error rate

- [ ] **Optimized Load Test**
 - [ ] Enable all optimizations (preload, compression, caching, HTTP/2, orjson)
 - [ ] Run same wrk test
 - [ ] Measure improvement percentage
 - [ ] Compare CPU/memory usage

- [ ] **Stress Test to Find Breaking Point**
 - [ ] Gradually increase concurrent users (100, 500, 1000, 2000)
 - [ ] Identify bottlenecks (CPU, memory, DB, network)
 - [ ] Document breaking point and limits
 - [ ] Document recommendations for scaling

### Phase 15: Documentation ✅

- [ ] **Update CLAUDE.md**
 - [ ] Add section on production tuning
 - [ ] Document gunicorn.config.py configuration
 - [ ] Explain preload_app benefits
 - [ ] Add performance optimization checklist

- [ ] **Create Performance Guide**
 - [ ] Consolidate tuning guide
 - [ ] Consolidate benchmarking guide
 - [ ] Add troubleshooting section
 - [ ] Add optimization flowchart

### Phase 16: Quality Assurance ✅

- [ ] **Code Quality**
 - [ ] Run `make autoflake isort black` to format code
 - [ ] Run `make flake8` and fix any issues
 - [ ] Run `make pylint` and address warnings
 - [ ] Pass `make verify` checks

- [ ] **Testing**
 - [ ] Verify all existing tests still pass
 - [ ] Test graceful shutdown
 - [ ] Test worker reload (SIGHUP)
 - [ ] Test with different worker counts
 - [ ] Test under load

---

## ✅ Success Criteria

- [ ] Gunicorn preload enabled by default in run-gunicorn.sh
- [ ] Comprehensive gunicorn.config.py created with all production settings
- [ ] .env.example updated with all performance flags and documentation
- [ ] Performance tuning guide written with worker/timeout/preload documentation
- [ ] Benchmarking guide created with tools and scenarios
- [ ] Load testing scripts created (wrk, locust) and tested
- [ ] Production deployment checklist complete
- [ ] Baseline benchmarks documented for comparison
- [ ] Performance monitoring configured (Prometheus, Grafana)
- [ ] Graceful shutdown tested and working
- [ ] Load tests show measurable improvements with optimizations
- [ ] Memory usage reduced by 30-50% with preload
- [ ] Documentation complete and accurate

---

## 🏁 Definition of Done

- [ ] run-gunicorn.sh updated with preload=true default
- [ ] gunicorn.config.py created with optimized production settings
- [ ] Worker configuration optimized and documented (formula, overrides)
- [ ] Timeout settings tuned and documented (timeout, graceful, keepalive)
- [ ] Worker lifecycle configured (max_requests, jitter)
- [ ] .env.example updated with comprehensive performance section
- [ ] Performance tuning guide written (worker, timeout, preload, monitoring)
- [ ] Benchmarking guide written (tools, scenarios, metrics, interpretation)
- [ ] Load testing scripts created (wrk, locust) with documentation
- [ ] Production deployment checklist created
- [ ] Graceful shutdown tested (SIGTERM, connection cleanup)
- [ ] Health check enhanced with performance metrics
- [ ] Performance monitoring dashboard template created (Grafana)
- [ ] Baseline load testing completed with results documented
- [ ] Code passes `make verify` checks
- [ ] Documentation complete (CLAUDE.md, guides, checklist)
- [ ] Ready for production deployment

---

## 📝 Additional Notes

🔹 **Worker Count Guidelines**:
 - **Development**: 1-2 workers (easier debugging, hot reload)
 - **Small deployment** (2GB RAM, 2 CPU): 2-4 workers
 - **Medium deployment** (4GB RAM, 4 CPU): 4-8 workers
 - **Large deployment** (8GB+ RAM, 8+ CPU): 8-16 workers
 - **Rule**: Don't exceed 16 workers per instance (use horizontal scaling instead)
 - **Formula**: min((2 * CPU + 1), 16) for I/O-bound apps

🔹 **Memory Usage** (typical):
 - **Base app**: ~100MB per worker without preload
 - **With preload**: ~60MB per worker (40% reduction)
 - **Under load**: +20-50MB per worker (request processing)
 - **Total estimate**: (workers * 60MB) + overhead with preload
 - **Example**: 8 workers = 480MB + overhead = ~600MB total

🔹 **Performance Bottlenecks** (ranked by impact):
 1. **Database queries**: Cache aggressively, use indexes, optimize queries
 2. **JSON serialization**: Use orjson (2-3x faster than stdlib)
 3. **Network I/O**: Enable compression, HTTP/2, connection pooling
 4. **Python GIL**: Use multiple workers (not threads) for parallelism
 5. **Memory allocation**: Enable preload, tune max_requests to prevent leaks

🔹 **Monitoring Checklist**:
 - ✅ Request rate (requests/sec) - track trends over time
 - ✅ Response time (p50, p95, p99) - identify slow requests
 - ✅ Error rate (4xx, 5xx) - catch issues early
 - ✅ Worker count and restarts - detect crashes
 - ✅ Memory usage per worker - detect memory leaks
 - ✅ CPU utilization - identify CPU bottlenecks
 - ✅ Database connection pool - prevent exhaustion
 - ✅ Cache hit ratio - optimize TTLs
 - ✅ Disk I/O (logs, database) - prevent I/O bottlenecks

🔹 **Scaling Strategies**:
 - **Vertical**: Increase CPU/RAM, add more workers (limited by hardware)
 - **Horizontal**: Add more servers behind load balancer (unlimited scaling)
 - **Database**: Read replicas, connection pooling, query optimization
 - **Caching**: Redis cluster, CDN for static assets, aggressive TTLs
 - **Microservices**: Split gateway into smaller services (tools, resources, admin)

🔹 **Preload Trade-offs**:
 - **Benefits**: 30-50% memory savings, faster worker startup
 - **Drawbacks**: Slower reload (must restart all workers), shared state issues
 - **When to use**: Production (memory efficiency matters)
 - **When to disable**: Development (need hot reload), code has shared mutable state

🔹 **Timeout Guidelines**:
 - **timeout=600**: Good default for most API operations (10 minutes)
 - **timeout=3600**: For long-running operations (1 hour)
 - **timeout=120**: For fast APIs (2 minutes)
 - **graceful_timeout=30**: Standard graceful shutdown period
 - **keepalive=5**: Standard HTTP Keep-Alive (5-10 seconds)

🔹 **Load Testing Best Practices**:
 - **Start low**: Begin with low concurrency, gradually increase
 - **Warm up**: Run for 1-2 minutes before measuring
 - **Measure steady state**: Record metrics after warm-up
 - **Monitor resources**: Watch CPU, memory, network, disk I/O
 - **Test scenarios**: Single endpoint, mixed workload, edge cases
 - **Compare baselines**: Always compare against previous benchmarks

---

## 🔗 Related Issues

- Part of Performance Optimization initiative
- Ties together all other performance epics (#1282, HTTP/2, orjson, static assets, Redis caching)
- Foundation for production deployment
- Enables horizontal scaling strategies

---

## 📚 References

- [Gunicorn Configuration](https://docs.gunicorn.org/en/stable/settings.html)
- [Gunicorn Design](https://docs.gunicorn.org/en/stable/design.html)
- [Gunicorn Deployment](https://docs.gunicorn.org/en/stable/deploy.html)
- [FastAPI Performance](https://fastapi.tiangolo.com/deployment/concepts/)
- [Uvicorn Workers](https://www.uvicorn.org/deployment/#gunicorn)
- [Uvicorn Settings](https://www.uvicorn.org/settings/)
- [wrk HTTP benchmarking](https://github.com/wg/wrk)
- [Locust load testing](https://locust.io/)
- [Prometheus Python Client](https://github.com/prometheus/client_python)
- [Grafana Dashboards](https://grafana.com/docs/grafana/latest/dashboards/)

[Epic] ⚙️ Performance - Production Server Tuning #1297

Description

⚙️ Performance - Production Server Tuning

Goal

Why Now?

📖 User Stories

🏗 Architecture

Worker Process Model

Resource Usage with Preload

Graceful Shutdown Flow

Implementation Examples

📋 Implementation Tasks

Phase 1: Enable Preload by Default ✅

Phase 2: Create Comprehensive gunicorn.config.py ✅

Phase 3: Optimize Worker Configuration ✅

Phase 4: Optimize Timeout Settings ✅

Phase 5: Configure Worker Lifecycle ✅

Phase 6: Update .env.example ✅

Phase 7: Performance Documentation ✅

Phase 8: Enhanced Health Check ✅

Phase 9: Prometheus Metrics ✅

Phase 10: Grafana Dashboard ✅

Phase 11: Load Testing Scripts ✅

Phase 12: Graceful Shutdown ✅

Phase 13: Production Deployment Checklist ✅

Phase 14: Load Testing & Benchmarking ✅

Phase 15: Documentation ✅

Phase 16: Quality Assurance ✅

✅ Success Criteria

🏁 Definition of Done

📝 Additional Notes

🔗 Related Issues

📚 References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions