SentinelOps AI is an enterprise-grade, autonomous SRE (Site Reliability Engineering) and DevOps incident response platform. Built on an agent swarm architecture using LangGraph, it orchestrates specialized AI nodes to detect anomalies, query RAG databases for runbooks, investigate git commit histories, execute remediation plans, and generate production-ready Markdown postmortem reports.
[OUTAGE INJECTED]
│
▼
┌───────────────┐ Streams Logs, Metrics, & Alerts
│ Incident ├────────────────────────────────────────┐
│ Simulator │ │
└───────────────┘ ▼
┌───────────────────┐
│ Monitoring Agent │
└─────────┬─────────┘
│ (Anomalous Service Isolated)
▼
┌───────────────────┐
│ RCA Agent │ ◄─── Reviews commits, diffs, & logs
└─────────┬─────────┘
│
┌─────────────┴─────────────┐
│ Is confidence > 80%? │
├─────────────┬─────────────┤
│ Yes │ No │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐
│ Resolution │ │ Retrieval │ ◄─── Pulls SRE runbooks & history
│ Agent │ │ Agent │
└──────┬───────┘ └──────┬───────┘
│ │
│ ▼
│ ┌──────────────┐
│ │ RCA Agent │ (Re-evaluates with retrieved guides)
│ └──────┬───────┘
│ │
▼◄───────────────────┘
┌──────────────┐
│ Postmortem │
│ Agent │ ◄─── Compiles Markdown SRE Incident Postmortem Report
└──────────────┘
Our LangGraph plexus orchestrates four specialized nodes for investigation and mitigation:
- Monitoring and Detection Agent: Parses dynamic telemetry streams and log records to identify anomalies and isolate the affected microservice.
- Root Cause Analysis (RCA) Agent: Examines system error traces, container environments, and git repository commits to isolate buggy lines.
- SRE RAG Retrieval Agent: Queries a localized semantic vector store of past outages and Kubernetes cheat sheets to fetch relevant runbooks.
- Postmortem and Resolution Agent: Formulates step-by-step remediation plans and compiles detailed incident reports.
An interactive, high-fidelity SVG network topology graph replaces simple visualizations:
- Active Node Highlighting: Connects directly to the backend SSE events to illuminate the executing agent node.
- Animated Data Packets: Renders sliding packet elements flowing along connection vectors to illustrate system data paths.
- Dynamic Context Inspector: Includes an interactive sidebar that displays CPU, memory, and functional specs when clicking nodes.
A secure, hands-free voice command system powered by Web Audio and Web Speech APIs:
- Clap Spike Trigger: Uses Web Audio API analyser nodes to monitor sound slopes. A sharp clap or snap spikes volume and automatically powers the uplink.
- Wake Word Trigger: Listens silently for the keyword "Sentinel Activate" or "Sentinel" to toggle listening.
- Persistent Connection: Uses reference-state monitoring (
isListeningRef) to automatically restart speech recognition on silence timeouts, ensuring an infinite hands-free link. - SRE Action Parser: Voice commands map directly to dashboard actions:
- "Trigger incident" / "Inject fault" -> Initiates random outage simulation.
- "Deploy swarm" / "Analyze logs" -> Spins up the LangGraph diagnostic team.
- "Resolve outage" / "Rollback commit" -> Reverts the bug and restores systems.
- "Status check" / "Explain status" -> Synthesizes vocal briefings.
Provides five curated, state-of-the-art themes that persist via localStorage and dynamically transition background colors, highlighting colors, border-radius layouts, and font hierarchies:
- Cyber Obsidian (Default): Sleek green highlights, round borders, modern sans-serif fonts.
- Nebula Abyss: Deep space violet/magenta hues, high rounded container headers, Orbitron sci-fi fonts.
- Crimson Protocol: Cyber-alert rose/maroon shades, industrial sharp square panels, Share Tech Mono monospace fonts.
- Matrix Code: Retro neon lime highlights, light technical borders, Japanese DotGothic grid styling.
- Solar Flare: Aerospace technical amber/orange highlights, compact container sizing, Fira Code font stack.
SentinelOps AI simulates real-world production crash scenarios to prove agent reasoning depth:
DB_CONNECTION_EXHAUSTION(Severity: CRITICAL): An unreleased database socket inpayment-servicethreadpool workers starves Hikari connections, triggering 504 Gateway errors.AUTH_SERVICE_MEMORY_LEAK(Severity: HIGH): An unbounded global caching variable inauth-serviceleaks V8 heap space, provoking cgroup memory eviction (OOMKilled, ExitCode 137).API_GATEWAY_TIMEOUT(Severity: HIGH): Nginx upstream route updates point towards an unregistered internal cluster hostname, leading to proxy read timeouts.MISSING_ENV_CONFIG(Severity: CRITICAL): A recently rate-limiter upgrade asserts strict environment checks (os.environ['REDIS_URL']) without configuration definitions, placing pods in a persistentCrashLoopBackOff.DISK_SPACE_EXHAUSTION_LOGGER(Severity: MEDIUM): Disabled log rotation merged with excessive DEBUG level transaction traces triggers host system DiskPressure warnings.
To boot up the SRE Command Center in your local development environment:
Initialize the virtual environment and boot the Uvicorn server:
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python run.pyThe server will start on http://localhost:8000 with active Server-Sent Events (SSE) channels.
Install npm dependencies and spin up Vite:
cd frontend
npm install
npm run devThe dashboard will be served on http://localhost:5173.
Developed and maintained by Arjun R.