# Incident Diagnosis Using Node Clues with Continual Optimization - AIOPS2021

This notebook demonstrates an advanced incident diagnosis approach that uses node-based clues (like alert counts) combined with continual optimization to improve root cause localization accuracy over time.

**Key Features:**
- Uses node clues (AlertCount) for initial diagnosis
- Implements continual learning through optimization when predictions fail
- Evaluates performance across different data splits
- Compares optimized approach with baseline AlertCount method

**Dataset:** AIOPS2021 competition dataset
**Evaluation Metric:** A@1 (Accuracy at rank 1)

## Step 1: Import Required Libraries

Import all necessary libraries for data processing, machine learning, graph analysis, and the custom incident diagnosis modules.

In [1]:
# Data processing and visualization
import pandas as pd
import matplotlib.pyplot as plt
import os
import json
from tqdm import tqdm
import numpy as np
import pickle

# PyTorch Geometric for graph data handling
from torch_geometric.data import Data, DataLoader
import torch

# Parallel processing and optimization
from joblib import Parallel, delayed
import networkx as nx
import math
import optuna  # Hyperparameter optimization framework

# Custom incident diagnosis modules
import sys
sys.path.append('../src')
from incident_diagnosis.incident_diagnosis import root_cause_localization, explain, optimize, get_weight_from_edge_info

  from .autonotebook import tqdm as notebook_tqdm


## Step 2: Load AIOPS2021 Dataset

Load the preprocessed incident topologies from the AIOPS2021 dataset. Each topology contains:
- Network structure with nodes and edges
- Node features (like AlertCount)
- Ground truth root cause labels

In [2]:
# Load the AIOPS2021 incident topology dataset
# This contains labeled incidents with known root causes for evaluation
with open('../data/AIOPS2021.pkl', 'rb') as f:
    incident_topologies = pickle.load(f)

## Step 3: Verify Dataset Size

Check the number of incident cases loaded to ensure data integrity.

In [3]:
# Display the total number of incident cases in the dataset
len(incident_topologies)

133

## Step 4: Configure Continual Learning Parameters

Set up the initial configuration for the continual learning approach:
- **init_clue_tag**: Starting clue type for diagnosis
- **node_clue_tags**: List of node-level features to use
- **edge_clue_tags**: List of edge-level features (empty in this case)
- **edge_backward_factor**: Weight for backward propagation in the graph

In [4]:
import logging

# Suppress verbose optimization logs to keep output clean
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Initialize diagnosis configuration
init_clue_tag = 'AlertCount'  # Primary clue type for initial diagnosis
node_clue_tags = ['AlertCount']  # Node-level features to consider
edge_clue_tags = []  # Edge-level features (none used here)

# Initialize weight dictionary for different clue types
a = {}
for clue_tag in edge_clue_tags:
    a[clue_tag] = 1
for clue_tag in node_clue_tags:
    a[clue_tag] = 1

# Configuration for graph-based diagnosis
get_edge_weight = None  # No custom edge weighting function
edge_backward_factor = 0.3  # Factor for backward propagation in graph

## Step 5: Continual Learning Loop

This is the core of the continual optimization approach:
1. **Predict**: Use current model to diagnose each incident
2. **Evaluate**: Check if prediction matches ground truth
3. **Optimize**: If prediction fails, update model parameters using all cases seen so far

This creates an online learning system that improves with each misclassification.

In [5]:
# Main continual learning loop
for i, case in enumerate(incident_topologies):
    
    # Step 5.1: Perform root cause localization with current parameters
    case['pred_incremental'] = root_cause_localization(
        case, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor
    )

    # Step 5.2: Check if prediction failed (root cause not in predicted set)
    if case['root_cause'] not in case['pred_incremental'] and case['pred_incremental'] != 'None':
        
        # Step 5.3: Optimize parameters using all cases seen so far (continual learning)
        # This updates both node_clue_tags and weights (a) based on historical performance
        node_clue_tags, a = optimize(
            case,                    # Current failed case
            node_clue_tags,         # Current node clue configuration
            edge_clue_tags,         # Current edge clue configuration
            a,                      # Current weight dictionary
            get_edge_weight,        # Edge weight function
            edge_backward_factor,   # Backward propagation factor
            incident_topologies[:i+1],  # All cases seen so far (training set)
            init_clue_tag           # Initial clue tag for reference
        )

100%|██████████| 100/100 [00:01<00:00, 54.47it/s]


best trial: FrozenTrial(number=73, state=TrialState.COMPLETE, values=[0.8333333333333334, 1.0], datetime_start=datetime.datetime(2025, 6, 4, 13, 37, 40, 704098), datetime_complete=datetime.datetime(2025, 6, 4, 13, 37, 40, 723261), params={'a:AlertCount': 0.008428739408411101, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.7767187040175068}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None)}, trial_id=73, value=None)
A better solution found
a: {'AlertCount': 0.008428739408411101, 'OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.7767187040175068}


100%|██████████| 100/100 [00:02<00:00, 38.68it/s]


best trial: FrozenTrial(number=0, state=TrialState.COMPLETE, values=[0.5882352941176471, 1], datetime_start=datetime.datetime(2025, 6, 4, 13, 37, 41, 330669), datetime_complete=datetime.datetime(2025, 6, 4, 13, 37, 41, 330669), params={'a:AlertCount': 1, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': FloatDistribution(high=5.0, log=False, low=0.0, step=None)}, trial_id=0, value=None)
a: {'AlertCount': 1, 'OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0, 'JVM-Operating System_7779_JVM_JVM_CPULoad': 0}


100%|██████████| 100/100 [00:03<00:00, 29.63it/s]


best trial: FrozenTrial(number=90, state=TrialState.COMPLETE, values=[0.6111111111111112, 2.746827572678805], datetime_start=datetime.datetime(2025, 6, 4, 13, 37, 46, 870983), datetime_complete=datetime.datetime(2025, 6, 4, 13, 37, 46, 908331), params={'a:AlertCount': 0.5951518262529558, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.49555306058886855, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.7443333197061959, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.9117893661307849}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': FloatDistribution(high=5.0, log=False, low=0.0, step=None)}, trial_id=90, valu

100%|██████████| 100/100 [00:04<00:00, 21.49it/s]


best trial: FrozenTrial(number=46, state=TrialState.COMPLETE, values=[0.6666666666666666, 7.693714699433357], datetime_start=datetime.datetime(2025, 6, 4, 13, 37, 49, 252814), datetime_complete=datetime.datetime(2025, 6, 4, 13, 37, 49, 299032), params={'a:AlertCount': 0.6364573260296996, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 3.6341649616933878, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 1.1799948682528518, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.812065432275479, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 1.431032111181939}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': F

100%|██████████| 100/100 [00:06<00:00, 15.49it/s]


best trial: FrozenTrial(number=68, state=TrialState.COMPLETE, values=[0.71875, 3.985230156021407], datetime_start=datetime.datetime(2025, 6, 4, 13, 37, 56, 226818), datetime_complete=datetime.datetime(2025, 6, 4, 13, 37, 56, 293800), params={'a:AlertCount': 0.547274489769302, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.382232720797675, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.600102556289185, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.7614948972779474, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 0.9563042926877249, 'a:ig_post': 0.7378211991995727}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOC

100%|██████████| 100/100 [00:11<00:00,  8.58it/s]


best trial: FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.8, 3.985230156021407], datetime_start=datetime.datetime(2025, 6, 4, 13, 37, 58, 869131), datetime_complete=datetime.datetime(2025, 6, 4, 13, 37, 58, 869131), params={'a:AlertCount': 0.547274489769302, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.382232720797675, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.600102556289185, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.7614948972779474, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 0.9563042926877249, 'a:ig_post': 0.7378211991995727, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': FloatDistribution(high=5.0, log=False, l

100%|██████████| 100/100 [00:16<00:00,  6.17it/s]


best trial: FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.8313253012048193, 3.985230156021407], datetime_start=datetime.datetime(2025, 6, 4, 13, 38, 10, 831301), datetime_complete=datetime.datetime(2025, 6, 4, 13, 38, 10, 831301), params={'a:AlertCount': 0.547274489769302, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.382232720797675, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.600102556289185, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.7614948972779474, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 0.9563042926877249, 'a:ig_post': 0.7378211991995727, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0, 'a:severe': 0}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': FloatDistr

100%|██████████| 100/100 [00:18<00:00,  5.44it/s]


best trial: FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.8314606741573034, 3.985230156021407], datetime_start=datetime.datetime(2025, 6, 4, 13, 38, 27, 358004), datetime_complete=datetime.datetime(2025, 6, 4, 13, 38, 27, 358004), params={'a:AlertCount': 0.547274489769302, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.382232720797675, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.600102556289185, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.7614948972779474, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 0.9563042926877249, 'a:ig_post': 0.7378211991995727, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0, 'a:severe': 0, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, step=None), 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': FloatDistribution(high=5.0, log=False, low=0.0, step=None), '

100%|██████████| 100/100 [00:20<00:00,  4.80it/s]


best trial: FrozenTrial(number=82, state=TrialState.COMPLETE, values=[0.8350515463917526, 14.318742164322975], datetime_start=datetime.datetime(2025, 6, 4, 13, 39, 2, 307172), datetime_complete=datetime.datetime(2025, 6, 4, 13, 39, 2, 524650), params={'a:AlertCount': 1.2387638882250087, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.6055077542892814, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 2.0232380163735817, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.3528424159508297, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 1.2951392412742593, 'a:ig_post': 4.224479929559377, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.3061519665418823, 'a:severe': 3.9773045895561054, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.13916339024266688, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.1561509723099836}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, st

100%|██████████| 100/100 [00:26<00:00,  3.83it/s]


best trial: FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.8348623853211009, 14.318742164322975], datetime_start=datetime.datetime(2025, 6, 4, 13, 39, 7, 398491), datetime_complete=datetime.datetime(2025, 6, 4, 13, 39, 7, 398491), params={'a:AlertCount': 1.2387638882250087, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.6055077542892814, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 2.0232380163735817, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.3528424159508297, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 1.2951392412742593, 'a:ig_post': 4.224479929559377, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.3061519665418823, 'a:severe': 3.9773045895561054, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.13916339024266688, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.1561509723099836}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': FloatDistribution(high=5.0, log=False, low=0.0, ste

100%|██████████| 100/100 [00:31<00:00,  3.17it/s]

best trial: FrozenTrial(number=22, state=TrialState.COMPLETE, values=[0.8267716535433071, 6.476223985491053], datetime_start=datetime.datetime(2025, 6, 4, 13, 39, 39, 886914), datetime_complete=datetime.datetime(2025, 6, 4, 13, 39, 40, 195793), params={'a:AlertCount': 0.46503665532967553, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.5924735168814377, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.5811546871894936, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.7708957831212384, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 0.321977622220258, 'a:ig_post': 0.5657085840642232, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.26269341867717433, 'a:severe': 0.054478300562137774, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 1.0106743003797187, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.4011953702888691, 'a:OSLinux-CPU_CPU_CPUUserTime': 1.449935746776827}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCo




## Step 6: Evaluate Continual Learning Results

Calculate accuracy metrics for the continual learning approach by checking if the ground truth root cause appears in the predicted candidate set.

In [6]:
# Mark each case as correct/incorrect based on whether root cause was found
for case in incident_topologies:
    case['right_incremental'] = case['root_cause'] in case['pred_incremental']

## Step 7: Performance Evaluation - Full Dataset

Evaluate the continual learning approach on the entire dataset to get overall performance metrics.

In [7]:
# Evaluate performance on the entire dataset (from beginning)
test_target = 'incremental'
begin_to_test_ratio = 0  # Start from the beginning (0% offset)

# Calculate accuracy statistics
summary = pd.Series([
    case['right_' + test_target] 
    for case in incident_topologies[int(begin_to_test_ratio * len(incident_topologies)):]
]).value_counts()

# Calculate A@1 (Accuracy at rank 1) metric
summary['A@1'] = summary[True] / (summary[True] + summary[False])
print(summary)

True     105.000000
False     28.000000
A@1        0.789474
Name: count, dtype: float64


## Step 8: Performance Evaluation - 30% Split

Evaluate performance starting from 30% of the dataset to simulate a scenario where the model has seen some training data.

In [8]:
# Evaluate performance starting from 30% of the dataset
test_target = 'incremental'
begin_to_test_ratio = 0.3  # Skip first 30% (use as implicit training)

# Calculate accuracy on the remaining 70% of data
summary = pd.Series([
    case['right_' + test_target] 
    for case in incident_topologies[int(begin_to_test_ratio * len(incident_topologies)):]
]).value_counts()

summary['A@1'] = summary[True] / (summary[True] + summary[False])
print(summary)

True     81.000000
False    13.000000
A@1       0.861702
Name: count, dtype: float64


## Step 9: Performance Evaluation - 60% Split

Evaluate performance starting from 60% of the dataset to see how the model performs after extensive training.

In [9]:
# Evaluate performance starting from 60% of the dataset
test_target = 'incremental'
begin_to_test_ratio = 0.6  # Skip first 60% (use as implicit training)

# Calculate accuracy on the remaining 40% of data
summary = pd.Series([
    case['right_' + test_target] 
    for case in incident_topologies[int(begin_to_test_ratio * len(incident_topologies)):]
]).value_counts()

summary['A@1'] = summary[True] / (summary[True] + summary[False])
print(summary)

True     43.000000
False    11.000000
A@1       0.796296
Name: count, dtype: float64


## Step 10: Baseline Comparison - AlertCount Only

Generate predictions using only the basic AlertCount feature without any optimization for comparison with the continual learning approach.

In [10]:
# Generate baseline predictions using only AlertCount (no optimization)
for case in incident_topologies:
    case['pred_alertcount'] = root_cause_localization(
        case, 
        ['AlertCount'],  # Only use AlertCount feature
        [],              # No edge clues
        None             # No custom weights
    )

## Step 11: Evaluate Baseline Performance

Calculate accuracy metrics for the baseline AlertCount-only approach.

In [11]:
# Mark baseline predictions as correct/incorrect
for case in incident_topologies:
    case['right_alertcount'] = case['root_cause'] in case['pred_alertcount']

## Step 12: Baseline Performance - Full Dataset

Evaluate baseline AlertCount performance on the entire dataset.

In [12]:
# Evaluate baseline performance on entire dataset
test_target = 'alertcount'
ratio = 0  # Full dataset

summary = pd.Series([
    case['right_' + test_target] 
    for case in incident_topologies[int(ratio * len(incident_topologies)):]
]).value_counts()

summary['A@1'] = summary[True] / (summary[True] + summary[False])
print(summary)

True     103.000000
False     30.000000
A@1        0.774436
Name: count, dtype: float64


## Step 13: Baseline Performance - 30% Split

Evaluate baseline performance starting from 30% of the dataset for fair comparison.

In [13]:
# Evaluate baseline performance from 30% onwards
test_target = 'alertcount'
ratio = 0.3

summary = pd.Series([
    case['right_' + test_target] 
    for case in incident_topologies[int(ratio * len(incident_topologies)):]
]).value_counts()

summary['A@1'] = summary[True] / (summary[True] + summary[False])
print(summary)

True     78.000000
False    16.000000
A@1       0.829787
Name: count, dtype: float64


## Step 14: Baseline Performance - 60% Split

Evaluate baseline performance starting from 60% of the dataset.

In [14]:
# Evaluate baseline performance from 60% onwards
test_target = 'alertcount'
ratio = 0.6

summary = pd.Series([
    case['right_' + test_target] 
    for case in incident_topologies[int(ratio * len(incident_topologies)):]
]).value_counts()

summary['A@1'] = summary[True] / (summary[True] + summary[False])
print(summary)

True     40.000000
False    14.000000
A@1       0.740741
Name: count, dtype: float64


## Step 15: Re-run Continual Learning (Verification)

Re-execute the continual learning process to verify consistency and demonstrate the optimization loop.

In [15]:
# Re-run the continual learning process for verification
for i, case in enumerate(incident_topologies):
    
    # Generate prediction with current optimized parameters
    case['pred_incremental'] = root_cause_localization(
        case, node_clue_tags, edge_clue_tags, a, get_edge_weight, edge_backward_factor
    )

    # Continue optimization if prediction fails
    if case['root_cause'] not in case['pred_incremental'] and case['pred_incremental'] != 'None':
        node_clue_tags, a = optimize(
            case, node_clue_tags, edge_clue_tags, a, get_edge_weight, 
            edge_backward_factor, incident_topologies[:i+1], init_clue_tag
        )

100%|██████████| 100/100 [00:06<00:00, 14.88it/s]


best trial: FrozenTrial(number=11, state=TrialState.COMPLETE, values=[0.8333333333333334, 6.929056469403173], datetime_start=datetime.datetime(2025, 6, 4, 13, 40, 6, 114160), datetime_complete=datetime.datetime(2025, 6, 4, 13, 40, 6, 171486), params={'a:AlertCount': 0.1941146468820265, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 4.996074309726312, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.09177989693035199, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.029959717654251, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 0.002167078947019063, 'a:ig_post': 1.1434076832621523, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.027984140481235542, 'a:severe': 0.1822031374258544, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.2210645356266649, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.01576563435537286, 'a:OSLinux-CPU_CPU_CPUUserTime': 0.024535688111932723}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:Alert

100%|██████████| 100/100 [00:07<00:00, 14.10it/s]


best trial: FrozenTrial(number=20, state=TrialState.COMPLETE, values=[0.6470588235294118, 22.64554775172257], datetime_start=datetime.datetime(2025, 6, 4, 13, 40, 13, 466630), datetime_complete=datetime.datetime(2025, 6, 4, 13, 40, 13, 527396), params={'a:AlertCount': 0.5402965530877424, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 4.071782573281549, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 1.3281555274581707, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 2.009952100742306, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 2.5467403260149055, 'a:ig_post': 1.8614974235566586, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 1.283227140193638, 'a:severe': 4.779648946725299, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 1.4533671689664385, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 1.5217339376978345, 'a:OSLinux-CPU_CPU_CPUUserTime': 1.249146053998031}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'a:AlertCount': F

100%|██████████| 100/100 [00:08<00:00, 12.26it/s]


best trial: FrozenTrial(number=77, state=TrialState.COMPLETE, values=[0.6666666666666666, 18.244953485509992], datetime_start=datetime.datetime(2025, 6, 4, 13, 40, 25, 681339), datetime_complete=datetime.datetime(2025, 6, 4, 13, 40, 25, 771321), params={'a:AlertCount': 0.3627077272754309, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.5961264601175463, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.03687263357559081, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 2.5075072422414633, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 4.66535440609369, 'a:ig_post': 1.6994206383416026, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.5114719368264726, 'a:severe': 0.9467859166976407, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.43096539990931737, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 3.1209400865636776, 'a:OSLinux-CPU_CPU_CPUUserTime': 0.41139721430588294, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 2.955403823561677}, user_attrs={}, 

100%|██████████| 100/100 [00:10<00:00,  9.99it/s]


best trial: FrozenTrial(number=51, state=TrialState.COMPLETE, values=[0.7096774193548387, 8.870317900616229], datetime_start=datetime.datetime(2025, 6, 4, 13, 40, 32, 543526), datetime_complete=datetime.datetime(2025, 6, 4, 13, 40, 32, 646117), params={'a:AlertCount': 0.97225288482687, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.5418739000332797, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.1388234993280597, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.40505438315424636, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 2.6031381690247732, 'a:ig_post': 1.5217405364458585, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.5329269064488631, 'a:severe': 0.9248021844760329, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.5142754205223579, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.2044807313956351, 'a:OSLinux-CPU_CPU_CPUUserTime': 0.5090735633986255, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 0.001875721561622079}, user_attrs={}, 

100%|██████████| 100/100 [00:15<00:00,  6.37it/s]


best trial: FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.8, 8.870317900616229], datetime_start=datetime.datetime(2025, 6, 4, 13, 40, 38, 488234), datetime_complete=datetime.datetime(2025, 6, 4, 13, 40, 38, 488234), params={'a:AlertCount': 0.97225288482687, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.5418739000332797, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.1388234993280597, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.40505438315424636, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 2.6031381690247732, 'a:ig_post': 1.5217405364458585, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.5329269064488631, 'a:severe': 0.9248021844760329, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.5142754205223579, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.2044807313956351, 'a:OSLinux-CPU_CPU_CPUUserTime': 0.5090735633986255, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 0.001875721561622079}, user_attrs={}, system_attrs={},

100%|██████████| 100/100 [00:20<00:00,  4.93it/s]


best trial: FrozenTrial(number=83, state=TrialState.COMPLETE, values=[0.8433734939759037, 17.979902453976468], datetime_start=datetime.datetime(2025, 6, 4, 13, 41, 10, 714939), datetime_complete=datetime.datetime(2025, 6, 4, 13, 41, 10, 943206), params={'a:AlertCount': 1.2027918732185863, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.0010881116699530402, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.008721296832074933, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 1.8447023044687707, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 2.6419308086244744, 'a:ig_post': 4.327002538170563, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.28147874899070446, 'a:severe': 2.754783433455171, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.21579755962023212, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 1.1488296145978938, 'a:OSLinux-CPU_CPU_CPUUserTime': 0.2870202009540499, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 3.2657559633739974}, user_attrs

100%|██████████| 100/100 [00:22<00:00,  4.45it/s]


best trial: FrozenTrial(number=55, state=TrialState.COMPLETE, values=[0.8426966292134831, 15.882964753654706], datetime_start=datetime.datetime(2025, 6, 4, 13, 41, 27, 23865), datetime_complete=datetime.datetime(2025, 6, 4, 13, 41, 27, 247558), params={'a:AlertCount': 0.08644202096690612, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.022184709854909765, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.5235356643136586, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.2533902915989855, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 3.0715405481008307, 'a:ig_post': 0.005887414243080959, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 2.9083307067172, 'a:severe': 0.6299287301802545, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.7346712073816575, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.6437507193380776, 'a:OSLinux-CPU_CPU_CPUUserTime': 3.2361289473941923, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 3.7671737935649525}, user_attrs={},

100%|██████████| 100/100 [00:25<00:00,  3.90it/s]


best trial: FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.8365384615384616, 15.882964753654706], datetime_start=datetime.datetime(2025, 6, 4, 13, 41, 38, 132749), datetime_complete=datetime.datetime(2025, 6, 4, 13, 41, 38, 132749), params={'a:AlertCount': 0.08644202096690612, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 0.022184709854909765, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 0.5235356643136586, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.2533902915989855, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 3.0715405481008307, 'a:ig_post': 0.005887414243080959, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 2.9083307067172, 'a:severe': 0.6299287301802545, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.7346712073816575, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 0.6437507193380776, 'a:OSLinux-CPU_CPU_CPUUserTime': 3.2361289473941923, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 3.7671737935649525, 'a:OSLinux-OSLi

100%|██████████| 100/100 [00:28<00:00,  3.56it/s]

best trial: FrozenTrial(number=60, state=TrialState.COMPLETE, values=[0.8389830508474576, 23.65685878767432], datetime_start=datetime.datetime(2025, 6, 4, 13, 42, 19, 976995), datetime_complete=datetime.datetime(2025, 6, 4, 13, 42, 20, 263268), params={'a:AlertCount': 0.398412682827161, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKRead': 2.4211374104291368, 'a:JVM-Operating System_7779_JVM_JVM_CPULoad': 2.522823815616296, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sda_DSKTps': 0.16759971028671505, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKPercentBusy': 1.8085099743373978, 'a:ig_post': 2.530527985834182, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWrite': 0.4300935358474801, 'a:severe': 0.19140970040081473, 'a:OSLinux-OSLinux_LOCALDISK_LOCALDISK-sdb_DSKWTps': 0.304604416673523, 'a:OSLinux-OSLinux_NETWORK_NETWORK_TCP-FIN-WAIT': 1.1996827228964368, 'a:OSLinux-CPU_CPU_CPUUserTime': 3.8260853872781144, 'a:JVM-Operating System_7778_JVM_JVM_CPULoad': 4.97254344274366, 'a:OSLinux-OSLinux_FI




## Step 16: Explanation Analysis (Optional)

Analyze the explanation power of different features for a specific case to understand what drives the diagnosis decisions.

In [16]:
# Analyze explanation power for the first case
# This shows which features contribute most to the root cause prediction
case = incident_topologies[0]
sorted_refined_explanation_power = explain(case, 'root_cause')
print(sorted_refined_explanation_power)

