Skip to content

Invariants0/Vybe

Repository files navigation

Vybe 🔗

Production-grade URL Shortener built for resilience, observability, and scale. Engineered for the MLH Production Engineering Quest with full incident response and reliability testing.

Vybe Banner


🚀 Quick Start

Get Vybe up and running in 30 seconds:

git clone https://github.com/Invariants0/Vybe.git && cd Vybe
just dev-up

Note

Make sure you have overmind and just installed
Visit http://localhost to access the dashboard.
API Docs: http://localhost/api/v1/docs
Monitoring: Grafana at http://localhost:3001 (admin/admin)


✨ Features at a Glance

  • ⚡ Blazing Fast: 45ms P95 latency at 500 RPS.
  • 🛡️ Built for Failure: Resilient to DB/Cache outages and container crashes.
  • 👁️ Full Visibility: Prometheus metrics, Grafana dashboards, and structured JSON logs.
  • 🧪 Battle Tested: 7 verified failure scenarios and automated integration tests.
  • 📖 Operator-First: Comprehensive runbooks, architecture guides, and capacity plans.

🏗️ Architecture (Quick Overview)

graph TD
    User([External Request]) --> Nginx[NGINX Reverse Proxy]
    Nginx --> App1[Flask App Instance 1]
    Nginx --> App2[Flask App Instance 2]
    
    App1 --> DB[(PostgreSQL)]
    App2 --> DB
    App1 -.-> Cache((Redis))
    App2 -.-> Cache
    
    subgraph Observability
        App1 & App2 & DB --> Prom[Prometheus]
        Prom --> Graf[Grafana + AlertManager]
    end
Loading

📚 Documentation

👨‍💻 For Developers

Understand the codebase and start contributing:

  1. Quick Start Index - Orientation (5m)
  2. Architecture Guide - Deep dive (15m)
  3. API Reference - All 18 endpoints
  4. Local Dev Setup - Environment config
🛠️ For DevOps / SRE

Operational guidance for production:

  1. Deployment Guide - Local to Cloud
  2. Config Reference - Env var tuning
  3. Capacity Plan - Scaling limits
  4. Runbooks - Incident procedures
🚨 For On-Call Engineers

Fast response when things break:

  1. Incident Runbooks - Step-by-step fixes
  2. Troubleshooting Guide - Root cause diagnosis
  3. Alert Definitions - What each alert means
📑 Complete Documentation Index
Document Audience Time
Architecture Engineers 40m
API Reference Devs 20m
Deployment SRE 30m
Troubleshooting On-Call 25m
Runbooks On-Call 30m
Decision Log Architects 20m

⚙️ Setup & Operations

🛠️ Manual Development Setup

  1. Prepare Environment:
    uv sync
    cp backend/.env.example backend/.env
  2. Start Dependencies:
    docker compose up -d db redis
  3. Init & Run:
    uv run python scripts/init_db.py
    uv run python run.py

Important

Ensure your .env contains the correct database credentials. The system uses specific passwords for Redis and Grafana by default (see docker-compose.yml).

Tip

To optimize performance for 500+ RPS, ensure Redis is healthy and REDIS_CACHE_ENABLED is set to true.

📁 Project Structure
backend/       # Flask API, SQL Alchemy models, Business logic
frontend/      # Next.js Dashboard and UI
infra/         # Nginx configs, Dockerfiles
monitoring/    # Prometheus & Grafana provisioning
scripts/       # DB init, Chaos testing, Load testing
tests/         # Unit and Integration test suites
📊 Observability & Monitoring

Vybe tracks RED metrics (Rate, Errors, Duration) for every request.

Alert Trigger Recovery Action
Instance Down 0 healthy targets Auto-failover by Nginx
High Error Rate >5% failures Log analysis via Grafana
P95 Latency >1s duration Scale app or check DB load
DB Pool Exhaust >90% usage Increase DB_POOL_SIZE
🧪 Testing & Coverage
uv run pytest tests/ --cov=backend  # All tests
uv run pytest tests/unit/           # Unit only

[!WARNING]
Current coverage is 45%. Priority: Increase coverage for url_service and auth modules.

⚠️ Safety & Best Practices

[!CAUTION]
Never perform manual DELETE operations on the PostgreSQL urls table in production. This causes cache inconsistency. Use the SCRUB_DATA API endpoint instead.


🏆 Quest Status (Tiers Complete)

  • Bronze: Architecture & Data Flow documented.
  • Silver: 45-min Deployment guide & Config reference complete.
  • Gold: 7 failure scenarios tested and documented in Runbooks.

Incident Verification (Apr 5, 2026): Tested: Database Down, CPU Spike, Redis Loss, High Error Rate. Result: 100% Resilience Success.


🔮 Future Scope

  • Redis Cluster: High availability for caching.
  • Read Replicas: Scale to 2000+ RPS.
  • Rate Limiting: Per-IP/User throttling.
  • Custom Domains: Enterprise link support.

❓ FAQ

Is this production-grade? Yes. It includes gracefully handling failures, health checks, connection pooling, and automated failover.
How do I test the resilience? Run `bash scripts/chaos.sh` to simulate failures and watch Grafana alerts trigger.

📄 Support & License

  • Emergency? Check Runbooks.
  • Issues? Raise a GitHub Issue.
  • License: Apache 2.0 (Commercial use allowed).

Maintained for Production Performance (April 2026)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors