AI Incident Copilot

An intelligent agent designed to assist Site Reliability Engineers (SREs) during incidents. This system leverages Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to analyze logs and metrics, diagnose root causes, and propose remediation plans grounded in your runbooks.

Features

Automated Context Collection: Fetches logs and metrics for the affected service.
Intelligent Diagnosis: Analyzes incident data to identify likely root causes with confidence scores.
RAG-Grounded Remediation: Generates remediation plans strictly based on existing runbooks to ensure safety and compliance.
Human-in-the-Loop: Includes a mandatory approval step where SREs can review, approve, or reject (with feedback) the proposed plan.
Iterative Refinement: If a plan is rejected, the agent uses the feedback to re-diagnose and generate a new plan.
Automated Execution: Once approved, the agent executes the remediation steps automatically.

Architecture

The system is built using LangGraph to manage the stateful workflow:

Retrieve Context: Gathers logs, metrics, and relevant runbook sections.
Diagnose: Uses an LLM to determine the root cause.
Plan: Creates a step-by-step remediation plan.
Human Approval: Pauses for user input.
- Approve: Proceed to execution.
- Reject: Provide feedback and loop back to diagnosis.
Execute: Runs the approved actions.

Getting Started

Prerequisites

Python 3.10+
pip
Access to necessary LLM APIs (configured via .env)

Installation

Clone the repository:
```
git clone <repository-url>
cd sre_agent
```

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables: Create a .env file in the root directory and add your API keys:

# Example
OPENAI_API_KEY=your_api_key_here
# Add other necessary keys as per 'model.py' and 'tools.py'

Usage

Run the agent from the command line:

python main.py

The agent will start, analyze the mock incident (configured in main.py), and present you with a diagnosis and remediation plan for approval.

Project Structure

main.py: Entry point for the application.
graph.py: Defines the LangGraph workflow and nodes.
agents.py: Contains the logic for Diagnosis and Planning agents (LLM chains).
tools.py: Tools for fetching logs, metrics, and searching docs.
state.py: Defines the execution state schema.
mock_data.py: Mock data for testing the agent without live systems.
rag.py: RAG implementation details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Incident Copilot

Features

Architecture

Getting Started

Prerequisites

Installation

Usage

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
agents.py		agents.py
graph.py		graph.py
langsmith_screenshot.png		langsmith_screenshot.png
main.py		main.py
mermaid_architecture_diagram.png		mermaid_architecture_diagram.png
mock_data.py		mock_data.py
model.py		model.py
rag.py		rag.py
requirements.txt		requirements.txt
state.py		state.py
tools.py		tools.py

Folders and files

Latest commit

History

Repository files navigation

AI Incident Copilot

Features

Architecture

Getting Started

Prerequisites

Installation

Usage

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages