🌐 Live Demo: auto-sre.vercel.app
A production-like Site Reliability Engineering platform built to demonstrate observability, failure handling, and automated recovery in distributed systems.
AutoSRE simulates real-world production challenges that SRE teams face daily. It's a working demonstration of how to design, monitor, break, and heal microservices at scale.
Built to showcase:
- Instrumentation and observability (Prometheus metrics)
- Self-healing patterns (circuit breakers, retries)
- Chaos engineering (controlled failure injection)
- Incident response (automated recovery + postmortems)
- Instrumented API Service - FastAPI with RED metrics (Rate, Errors, Duration)
- Prometheus Integration - Counter, Histogram, and Gauge metrics with proper labeling
- Health Checks - Liveness probes for container orchestration
- Docker Containerization - Reproducible deployments across environments
- Auto-generated API Docs - Interactive documentation at
/docs
Phase 2: Observability Stack (In Progress)
- Prometheus auto-scraping with 15s intervals
- Grafana dashboards with SLO tracking
- Alert definitions based on SLI violations
Phase 3: Multi-Service Architecture
- Worker service (async job processing)
- Auth service (with circuit breaker pattern)
- PostgreSQL + Redis integration
Phase 4: Chaos Engineering
- Failure injection endpoints
- Latency injection
- Database connection failures
- Self-healing validation
Phase 5: Production Readiness
- Incident postmortems (3 documented scenarios)
- Runbook automation
- Load testing results
- Performance benchmarks
- Docker Desktop installed
- Git
# Clone the repository
git clone https://github.com/Raynzler/Auto-SRE.git
cd Auto-SRE
# Start all services
docker-compose up --build- API Service: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Metrics Endpoint: http://localhost:8000/metrics
- Prometheus: http://localhost:9090 (Phase 2)
- Grafana: http://localhost:3000 (Phase 2)
# Health check
curl http://localhost:8000/health
# Create an order
curl -X POST http://localhost:8000/orders \
-H "Content-Type: application/json" \
-d '{"item": "laptop", "quantity": 1}'
# View metrics
curl http://localhost:8000/metricsBackend: Python, FastAPI, Pydantic
Observability: Prometheus, Grafana
Infrastructure: Docker, Docker Compose
Database: PostgreSQL (Phase 3)
Cache: Redis (Phase 3)
Proxy: Nginx (Phase 3)
Deployment: Vercel (landing page), Docker (services)
- Request Rate: Requests per second by endpoint and method
- Error Rate: Percentage of failed requests (4xx, 5xx)
- Latency: p50, p95, p99 response times
- Service Health: Binary indicator (up/down)
- Saturation: Resource usage (CPU, memory) (Phase 3)
This project demonstrates:
- ✅ SRE fundamentals (SLIs, SLOs, error budgets)
- ✅ Instrumentation best practices (RED metrics)
- ✅ Container orchestration
- ✅ Failure mode analysis
- ✅ Automated recovery patterns
- ✅ Incident documentation
- Architecture Decisions (coming soon)
- Incident Postmortems (Phase 5)
- Runbook (Phase 5)
This is a personal learning project, but feedback is welcome! Open an issue or PR if you spot improvements.
Built by: Hamza Shaikh
GitHub: @Raynzler
Project Status: 🟢 Active Development
This project is designed for portfolio demonstration and SRE role interviews. It mirrors production patterns used at companies like Google, Netflix, and Stripe.