# Incident Diagnosis Using Edge Clues

This notebook demonstrates root cause localization for incidents using edge-based clues from service call relationships. The approach focuses on analyzing failure patterns in service-to-service communications to identify the root cause of incidents.

**Key Concepts:**
- **Edge Clues**: Information extracted from service call relationships (e.g., failure counts)
- **Root Cause Localization**: Algorithm to identify which service is the source of an incident
- **Edge Backward Factor**: Parameter controlling how much upstream failures influence downstream services

**Process Overview:**
1. Load labeled incident topologies
2. Extract and prepare edge-based failure information
3. Configure diagnosis parameters
4. Apply root cause localization algorithm
5. Evaluate prediction accuracy

## Step 1: Import Required Libraries

Import libraries for graph analysis, data processing, and the custom incident diagnosis algorithms.

In [1]:
# Core libraries for data processing and analysis
from __future__ import division
import networkx as nx  # Graph analysis for service topology
import pickle          # Data serialization
import math
import matplotlib.pyplot as plt  # Visualization
import pandas as pd    # Data manipulation
import json
import datetime
import numpy as np

# Import custom incident diagnosis algorithms
import sys
sys.path.append('../src')
from incident_diagnosis.incident_diagnosis import root_cause_localization, explain, optimize, get_weight_from_edge_info

  from .autonotebook import tqdm as notebook_tqdm


## Step 2: Load Labeled Incident Topologies

Load the topology data that has been labeled through the data labeling process. These topologies contain both the incident structure and ground truth root cause information.

In [2]:
# Load the labeled topology data from the data labeling step
# This contains incidents with known root causes for evaluation
with open('../data/issue_topoloies.pkl', 'rb') as f:
    Topologies = pickle.load(f)

## Step 3: Filter Relevant Cases

Extract only the cases that are labeled as incidents (y=1) and have identifiable root causes. This excludes normal operations and system-wide faults that cannot be attributed to specific services.

In [3]:
# Filter for actual incidents with identifiable root causes
# y=1: labeled as incident
# 'root_cause' exists and != 'All': has specific root cause (not system-wide)
all_cases = [case for case in Topologies if case['y'] == 1 and 'root_cause' in case and case['root_cause']!='All']

## Step 4: Prepare Edge Information

Transform the edge information from the original format into a structured format suitable for the diagnosis algorithm. This includes:
- **Current edge info**: Failure counts during the incident
- **Historical edge info**: Baseline failure counts from previous periods for comparison

In [4]:
# Prepare edge information for each case
for case in all_cases:
    # Initialize edge information dictionaries
    case['edge_info'] = {}        # Current incident period
    case['edge_mount_info'] = {}  # Historical baseline

    # Process each edge in the topology
    for edge in case['edges_info']:
        # Create edge identifier (source_destination)
        edge_key = edge['src']+'_'+edge['des']
        
        # Store current failure count for this edge
        case['edge_info'][edge_key] = {}
        case['edge_info'][edge_key]['FailCount'] = edge['FailCount']

        # Store historical failure count for baseline comparison
        case['edge_mount_info'][edge_key] = {}
        case['edge_mount_info'][edge_key]['FailCount'] = edge['YesterFailCount']

## Step 5: Configure Diagnosis Parameters

Set up the parameters for the root cause localization algorithm:
- **Edge clue tags**: Types of edge information to use (FailCount)
- **Node clue tags**: Types of node information to use (empty for edge-only approach)
- **Weights**: Importance of each clue type
- **Edge backward factor**: Controls how upstream failures influence downstream services

In [5]:
# Configure algorithm parameters
init_clue_tag = 'FailCount'     # Primary clue type for initialization
node_clue_tags = []             # No node-based clues in this approach
edge_clue_tags = ['FailCount']  # Use failure count as edge clue

# Set weights for each clue type (all equal weight = 1)
a = {}
for clue_tag in edge_clue_tags:
    a[clue_tag] = 1
for clue_tag in node_clue_tags:
    a[clue_tag] = 1

# Edge backward factor: controls upstream influence
# Lower values (0.3) mean upstream failures have less impact on downstream blame
edge_backward_factor = 0.3

# Apply root cause localization to each case
for case in all_cases:
    case['pred'] = root_cause_localization(
        case, 
        node_clue_tags, 
        edge_clue_tags, 
        a, 
        get_edge_weight=get_weight_from_edge_info, 
        edge_backward_factor=edge_backward_factor
    )

## Step 6: Evaluate Prediction Accuracy

Compare the algorithm's predictions with the ground truth root causes to assess performance.

In [6]:
# Compare predictions with ground truth
# Mark each case as correct (True) or incorrect (False)
for case in all_cases:
    case['right'] = case['pred'] == case['root_cause']

## Step 7: Display Results

Show the overall accuracy of the edge-based root cause localization approach.

In [7]:
# Display accuracy results
# True: Correct predictions
# False: Incorrect predictions
pd.Series([case['right'] for case in all_cases]).value_counts()

True     384
False     18
Name: count, dtype: int64