# Data Labelling for Incident Detection

This notebook applies labeling rules to classify the extracted topologies as incidents or non-incidents. The labeling process uses three main rules:

1. **Rule 1**: Match topologies with known injected faults and platform faults
2. **Rule 2**: Apply business logic (services with degradation policies)
3. **Rule 3**: Filter out low-impact performance issues during low-traffic periods

**Label Meanings:**
- `1`: True incident (requires attention)
- `0`: Normal operation or handled gracefully
- `-1`: Undermined

## Step 1: Import Required Libraries

Import libraries for data processing, clustering, visualization, and file handling.

In [1]:
# Data processing and analysis
import pandas as pd
import numpy as np

# Clustering for anomaly grouping
from sklearn.cluster import DBSCAN

# Visualization and graph analysis
import matplotlib.pyplot as plt
import networkx as nx

# Data serialization
import pickle
import json

## Step 2: Load Raw Topologies

Load the topology data extracted from the previous anomaly detection step. These topologies represent potential incidents that need to be labeled.

In [2]:
# Load the raw topologies extracted from anomaly detection
# Each topology represents a potential incident with its impact structure
with open('../data/raw_topoloies.pkl', 'rb') as f:
    Topologies = pickle.load(f)

## Step 3: Verify Data Loading

Check the number of topologies loaded to ensure data integrity.

In [3]:
# Verify the number of topologies loaded
# This should match the output from the anomaly detection step
len(Topologies)

26132

# LABEL RULE 1: Match with Known Faults

This rule matches topologies with known injected faults and platform faults to create ground truth labels.

## Step 4: Process Injected Faults

Load and process the injected fault data to identify which topologies correspond to real incidents. Injected faults are controlled experiments where faults were deliberately introduced.

In [4]:
# Load the injected fault data
# This contains information about deliberately introduced faults for testing
inject_df = pd.read_csv('../data/injected_faults.csv')

## Step 5: Initialize Labels

Initialize all topologies with label 0 (normal operation). Labels will be updated based on the rules.

In [5]:
# Initialize all topologies with label 0 (normal/non-incident)
# Labels will be updated based on the labeling rules
for topo in Topologies:
    topo['y'] = 0

## Step 6: Apply Injected Fault Labels

For each injected fault, find corresponding topologies and apply appropriate labels:
- Find topologies within 15 minutes of fault injection
- Label the most severe topology as incident (y=1)
- Label other related topologies as undetermined (y=-1)

**Special handling for 'excessive flow' faults**: These affect the entire system, so any service can be the root cause.

In [6]:
# Process each injected fault to label corresponding topologies
for index, inject_item in inject_df.iterrows():
    # Find topologies that occur within 15 minutes after fault injection
    # This time window captures the fault's impact period
    corressponding_topo_i_s = [
        i for i, topo in enumerate(Topologies) 
        if topo['time'] >= inject_item['time'] and topo['time'] < inject_item['time'] + 15
    ]
    
    # Handle 'excessive flow' faults (system-wide impact)
    if inject_item['inject_type'] == 'excessive flow':
        MaxFail, MaxFail_ci = 0, None
        
        # Find the topology with maximum failure count (most severe impact)
        for ci in corressponding_topo_i_s:
            if Topologies[ci]['MaxFail'] > MaxFail:
                MaxFail_ci = ci
                MaxFail = Topologies[ci]['MaxFail']
            # Mark all related topologies as undetermined first
            Topologies[ci]['y'] = -1
        
        # Label the most severe topology as a true incident
        if MaxFail_ci is not None:
            Topologies[MaxFail_ci]['y'] = 1
            Topologies[MaxFail_ci]['root_cause'] = 'All'  # System-wide fault
            Topologies[MaxFail_ci]['root_cause_type'] = inject_item['inject_type']
    
    # Handle service-specific faults
    else:
        MaxFail, MaxFail_ci = 0, None
        
        # Find the most severe topology that involves the faulty service
        for ci in corressponding_topo_i_s:
            # Check if the injected service is involved in this topology
            if (Topologies[ci]['MaxFail'] > MaxFail and 
                inject_item['inject_serive'] in Topologies[ci]['nodes']):
                MaxFail_ci = ci
                MaxFail = Topologies[ci]['MaxFail']
            # Mark all related topologies as undetermined first
            Topologies[ci]['y'] = -1
        
        # Label the most severe relevant topology as a true incident
        if MaxFail_ci is not None:
            Topologies[MaxFail_ci]['y'] = 1
            Topologies[MaxFail_ci]['root_cause'] = inject_item['inject_serive']
            Topologies[MaxFail_ci]['root_cause_type'] = inject_item['inject_type']

## Step 7: Process Platform Faults

Platform faults are infrastructure-level issues that affect service operation. These are processed similarly to injected faults but use timestamp ranges instead of fixed time windows.

In [7]:
# Load platform fault data
# These are infrastructure-level faults that affect service operation
platform_fault_df = pd.read_csv('../data/platform_faults.csv')

## Step 8: Apply Platform Fault Labels

Process platform faults using their begin and end timestamps to find affected topologies.

In [8]:
# Process each platform fault
for index, platform_fault in platform_fault_df.iterrows():
    # Find topologies within the platform fault time window
    # Use timestamp comparison for more precise matching
    corressponding_topo_i_s = [
        i for i, topo in enumerate(Topologies) 
        if (pd.to_datetime(topo['TimeStamp']) >= pd.to_datetime(platform_fault['BeginTimeStamp']) and 
            pd.to_datetime(topo['TimeStamp']) <= pd.to_datetime(platform_fault['EndTimeStamp']))
    ]
    
    MaxFail, MaxFail_ci = 0, None
    
    # Find the most severe topology involving the affected service
    for ci in corressponding_topo_i_s:
        # Only consider topologies that:
        # 1. Involve the affected service
        # 2. Haven't already been labeled as incidents (y != 1)
        # 3. Have higher failure count than current maximum
        if (Topologies[ci]['MaxFail'] > MaxFail and 
            platform_fault['service'] in Topologies[ci]['nodes'] and 
            Topologies[ci]['y'] != 1):
            MaxFail_ci = ci
            MaxFail = Topologies[ci]['MaxFail']
        
        # Mark as undertermined if not already labeled as incident
        if Topologies[ci]['y'] != 1:
            Topologies[ci]['y'] = -1
    
    # Label the most severe topology as incident
    if MaxFail_ci is not None:
        Topologies[MaxFail_ci]['y'] = 1
        Topologies[MaxFail_ci]['root_cause'] = platform_fault['service']
        Topologies[MaxFail_ci]['root_cause_type'] = 'platform_fault'
        print(MaxFail_ci)  # Print index for tracking

21304
22570


# LABEL RULE 2: Business Logic Application

Apply business-specific rules based on system design. Services with degradation policies can handle faults gracefully without affecting core functionality.

## Step 9: Apply Degradation Policy Rule

Services like 'adservice' and 'emailservice' have degradation policies implemented. When these services fail, the system continues to function with reduced features rather than complete failure. Therefore, incidents involving only these services are reclassified as non-incidents.

In [9]:
# Apply business logic: services with degradation policies
# These services have fallback mechanisms that prevent system-wide impact
for topo in Topologies:
    # If the root cause is a service with degradation policy,
    # reclassify as non-incident since the system handles it gracefully
    if (topo['y'] == 1 and 
        (topo['root_cause'] == 'adservice' or topo['root_cause'] == 'emailservice')):
        topo['y'] = 0  # Change from incident to normal

## Step 10: Check Label Distribution After Rule 2

Verify how the label distribution changed after applying the degradation policy rule.

In [10]:
# Check the distribution of labels after applying Rule 2
# This shows how many incidents were reclassified due to degradation policies
pd.Series([topo['y'] for topo in Topologies]).value_counts()

-1    20012
 0     5417
 1      703
Name: count, dtype: int64

# LABEL RULE 3: Filter Low-Impact Issues

Performance issues during low-traffic periods (e.g., 00:00-06:00) often have minimal impact and may not require immediate attention.

## Step 11: Apply Low-Impact Filter

Filter out performance issues (CPU, latency) that occur during low-traffic periods with low failure counts. These issues, while technically anomalous, don't significantly impact user experience.

In [11]:
# Filter out low-impact performance issues
# Performance issues with low failure counts during low-traffic periods
# are often not significant enough to be considered incidents
for topo in Topologies:
    # Check if this is a performance-related incident with low impact
    if (topo['y'] == 1 and 
        (topo['root_cause_type'] == 'cpu' or topo['root_cause_type'] == 'latency') and 
        topo['MaxFail'] < 50):  # Threshold for "low impact"
        topo['y'] = 0  # Reclassify as normal operation

## Step 12: Final Label Distribution

Check the final distribution of labels after applying all three labeling rules.

In [12]:
# Display final label distribution
# -1: Non-incident anomalies
#  0: Normal operation
#  1: True incidents requiring attention
pd.Series([topo['y'] for topo in Topologies]).value_counts()

-1    20012
 0     5494
 1      626
Name: count, dtype: int64

## Step 13: Save Labeled Data

Save the labeled topologies for use in feature engineering and model training. This dataset now contains ground truth labels for supervised learning.

In [13]:
# Save the labeled topologies
# This labeled dataset will be used for:
# 1. Feature engineering
# 2. Training machine learning models
# 3. Evaluating incident detection performance
with open('../data/issue_topoloies.pkl', 'wb') as f:
    pickle.dump(Topologies, f)