AI-powered incident root cause analysis platform for DevOps and SRE teams.
During outages, engineers waste valuable time searching logs, dashboards, and alerts to identify the root cause.
Solution: An AI agent that connects with monitoring tools like Datadog, Grafana, or New Relic, analyzes logs and incidents in real-time, identifies probable root causes, and suggests fixes instantly.
- Multi-agent workflow orchestration using LangGraph
- Retrieval-Augmented Generation (RAG) for historical incident matching
- FAISS vector similarity search
- Monitoring platform connector architecture
- Automated incident timeline generation
- Impacted service detection
- Dynamic incident metrics visualization
- AI system evaluation dashboard
- Downloadable incident reports
- Streamlit-based observability dashboard
- Python
- Streamlit
- LangGraph
- FAISS
- Groq LLM API
- SentenceTransformers
git clone https://github.com/Anucool419/OpsMind-AI.git
cd OpsMind-AIpython -m venv venvActivate environment:
venv\Scripts\activatesource venv/bin/activatepip install -r requirements.txtCreate a .env file:
GROQ_API_KEY=your_api_keystreamlit run app/streamlit_app.py
Video link : https://www.youtube.com/watch?v=OTj5cE5ortQ
Deployed link : https://opsmind-ai-fuonkmwfprksqhivxcddh6.streamlit.app/
OpsMind AI includes an evaluation layer to measure system reliability and incident analysis quality.
| Metric | Description |
|---|---|
| Retrieval Accuracy | Measures whether relevant historical incidents were retrieved correctly |
| RCA Match Accuracy | Measures similarity between generated RCA and expected RCA |
| Severity Accuracy | Evaluates incident severity classification correctness |
| Average Latency | Measures end-to-end AI analysis response time |
| Correlation Confidence | Indicates confidence in incident correlation analysis |
- Real-time observability ingestion
- Slack/MS Teams alert integrations
- Kubernetes event streaming
- Live Datadog/New Relic APIs
- Autonomous remediation agents
- Multi-tenant incident analytics
This project uses simulated observability logs and monitoring connectors to demonstrate incident analysis workflows in a production-inspired environment. The architecture is designed to support integration with real monitoring platforms such as Datadog, Grafana, and New Relic APIs.
- Ananya Srinivasan
- AI Agent Workflow
- RAG + FAISS Retrieval
- LangGraph Orchestration
- Streamlit Dashboard
- Evaluation Framework