Skip to content

Anucool419/OpsMind-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpsMind AI — Multi-Agent Incident RCA Architecture

image

AI-powered incident root cause analysis platform for DevOps and SRE teams.

Problem Statement

During outages, engineers waste valuable time searching logs, dashboards, and alerts to identify the root cause.

Solution: An AI agent that connects with monitoring tools like Datadog, Grafana, or New Relic, analyzes logs and incidents in real-time, identifies probable root causes, and suggests fixes instantly.

Features

  • Multi-agent workflow orchestration using LangGraph
  • Retrieval-Augmented Generation (RAG) for historical incident matching
  • FAISS vector similarity search
  • Monitoring platform connector architecture
  • Automated incident timeline generation
  • Impacted service detection
  • Dynamic incident metrics visualization
  • AI system evaluation dashboard
  • Downloadable incident reports
  • Streamlit-based observability dashboard

Architecture

Dia drawio

Tech Stack

  • Python
  • Streamlit
  • LangGraph
  • FAISS
  • Groq LLM API
  • SentenceTransformers

Installation

1. Clone the Repository

git clone https://github.com/Anucool419/OpsMind-AI.git

cd OpsMind-AI

2. Create Virtual Environment

python -m venv venv

Activate environment:

Windows

venv\Scripts\activate

Mac/Linux

source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment Variables

Create a .env file:

GROQ_API_KEY=your_api_key

5. Run the Application

streamlit run app/streamlit_app.py

Screenshots

Screenshot 2026-05-24 172751 Screenshot 2026-05-24 172819 Screenshot 2026-05-24 172835

Demo

Evaluation Metrics

OpsMind AI includes an evaluation layer to measure system reliability and incident analysis quality.

Metrics Tracked

Metric Description
Retrieval Accuracy Measures whether relevant historical incidents were retrieved correctly
RCA Match Accuracy Measures similarity between generated RCA and expected RCA
Severity Accuracy Evaluates incident severity classification correctness
Average Latency Measures end-to-end AI analysis response time
Correlation Confidence Indicates confidence in incident correlation analysis

Future Improvements

  • Real-time observability ingestion
  • Slack/MS Teams alert integrations
  • Kubernetes event streaming
  • Live Datadog/New Relic APIs
  • Autonomous remediation agents
  • Multi-tenant incident analytics

Note

This project uses simulated observability logs and monitoring connectors to demonstrate incident analysis workflows in a production-inspired environment. The architecture is designed to support integration with real monitoring platforms such as Datadog, Grafana, and New Relic APIs.

Contributors

  • Ananya Srinivasan
    • AI Agent Workflow
    • RAG + FAISS Retrieval
    • LangGraph Orchestration
    • Streamlit Dashboard
    • Evaluation Framework

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages