An intelligent agent designed to assist Site Reliability Engineers (SREs) during incidents. This system leverages Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to analyze logs and metrics, diagnose root causes, and propose remediation plans grounded in your runbooks.
- Automated Context Collection: Fetches logs and metrics for the affected service.
- Intelligent Diagnosis: Analyzes incident data to identify likely root causes with confidence scores.
- RAG-Grounded Remediation: Generates remediation plans strictly based on existing runbooks to ensure safety and compliance.
- Human-in-the-Loop: Includes a mandatory approval step where SREs can review, approve, or reject (with feedback) the proposed plan.
- Iterative Refinement: If a plan is rejected, the agent uses the feedback to re-diagnose and generate a new plan.
- Automated Execution: Once approved, the agent executes the remediation steps automatically.
The system is built using LangGraph to manage the stateful workflow:
- Retrieve Context: Gathers logs, metrics, and relevant runbook sections.
- Diagnose: Uses an LLM to determine the root cause.
- Plan: Creates a step-by-step remediation plan.
- Human Approval: Pauses for user input.
- Approve: Proceed to execution.
- Reject: Provide feedback and loop back to diagnosis.
- Execute: Runs the approved actions.
- Python 3.10+
pip- Access to necessary LLM APIs (configured via
.env)
-
Clone the repository:
git clone <repository-url> cd sre_agent
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory and add your API keys:# Example OPENAI_API_KEY=your_api_key_here # Add other necessary keys as per 'model.py' and 'tools.py'
Run the agent from the command line:
python main.pyThe agent will start, analyze the mock incident (configured in main.py), and present you with a diagnosis and remediation plan for approval.
main.py: Entry point for the application.graph.py: Defines the LangGraph workflow and nodes.agents.py: Contains the logic for Diagnosis and Planning agents (LLM chains).tools.py: Tools for fetching logs, metrics, and searching docs.state.py: Defines the execution state schema.mock_data.py: Mock data for testing the agent without live systems.rag.py: RAG implementation details.
