Skip to content

RKKgithub/sre-copilot-agent

Repository files navigation

AI Incident Copilot

An intelligent agent designed to assist Site Reliability Engineers (SREs) during incidents. This system leverages Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to analyze logs and metrics, diagnose root causes, and propose remediation plans grounded in your runbooks.

Architecture Diagram

Features

  • Automated Context Collection: Fetches logs and metrics for the affected service.
  • Intelligent Diagnosis: Analyzes incident data to identify likely root causes with confidence scores.
  • RAG-Grounded Remediation: Generates remediation plans strictly based on existing runbooks to ensure safety and compliance.
  • Human-in-the-Loop: Includes a mandatory approval step where SREs can review, approve, or reject (with feedback) the proposed plan.
  • Iterative Refinement: If a plan is rejected, the agent uses the feedback to re-diagnose and generate a new plan.
  • Automated Execution: Once approved, the agent executes the remediation steps automatically.

Architecture

The system is built using LangGraph to manage the stateful workflow:

  1. Retrieve Context: Gathers logs, metrics, and relevant runbook sections.
  2. Diagnose: Uses an LLM to determine the root cause.
  3. Plan: Creates a step-by-step remediation plan.
  4. Human Approval: Pauses for user input.
    • Approve: Proceed to execution.
    • Reject: Provide feedback and loop back to diagnosis.
  5. Execute: Runs the approved actions.

Getting Started

Prerequisites

  • Python 3.10+
  • pip
  • Access to necessary LLM APIs (configured via .env)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd sre_agent
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables: Create a .env file in the root directory and add your API keys:

    # Example
    OPENAI_API_KEY=your_api_key_here
    # Add other necessary keys as per 'model.py' and 'tools.py'

Usage

Run the agent from the command line:

python main.py

The agent will start, analyze the mock incident (configured in main.py), and present you with a diagnosis and remediation plan for approval.

Project Structure

  • main.py: Entry point for the application.
  • graph.py: Defines the LangGraph workflow and nodes.
  • agents.py: Contains the logic for Diagnosis and Planning agents (LLM chains).
  • tools.py: Tools for fetching logs, metrics, and searching docs.
  • state.py: Defines the execution state schema.
  • mock_data.py: Mock data for testing the agent without live systems.
  • rag.py: RAG implementation details.

About

This repo contains the source code for my SRE copilot agent project built using LangChain for components, LangGraph for orchestration, and LangSmith for observability.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages