# Feature Engineering for Incident Detection

This notebook transforms the labeled topology data into machine learning-ready features. The process includes:

1. **Node Feature Engineering**: Create features for each service using CMDB information
2. **Edge Feature Engineering**: Extract time-series and statistical features from service call relationships
3. **Global Feature Engineering**: Create topology-level features
4. **Data Normalization**: Standardize features for machine learning
5. **Graph Structure Preparation**: Prepare data for graph neural networks

The output will be training and testing datasets ready for incident detection models.

## Step 1: Import Required Libraries

Import libraries for data processing, feature engineering, and machine learning preprocessing.

In [1]:
# Data serialization and file I/O
import pickle

# Data processing and analysis
import pandas as pd
import numpy as np
import math

# Machine learning preprocessing
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt

## Step 2: Load Labeled Topology Data

Load the labeled topologies from the data labeling step. These contain the ground truth labels and topology structures.

In [2]:
# Load the labeled topology data
# This data contains topologies with ground truth labels (incident vs non-incident)
with open('../data/issue_topoloies.pkl', 'rb') as f:
    Topologies = pickle.load(f)

## Step 3: Filter Valid Cases

Extract only the cases that are labeled as either incidents (y=1) or normal operations (y=0), excluding the anomalies that were marked as undetermined (y=-1).

In [3]:
# Filter out cases labeled as -1 (anomalies that are not incidents)
# We only want cases labeled as 0 (normal) or 1 (incident) for binary classification
all_cases = [case for case in Topologies if case['y'] != -1]

## Step 4: Verify Dataset Size

Check the number of valid cases for feature engineering.

In [4]:
# Display the number of valid cases for feature engineering
# This should be significantly smaller than the original topology count
len(all_cases)

6120

## Step 5: Create Service CMDB (Configuration Management Database)

Define service metadata that will be used as node features. This includes:
- **Fault Tolerance Type**: How the service handles failures (retry, degrade, none)
- **Importance Level**: Business criticality of the service
- **Product Name**: Which product/domain the service belongs to
- **Status**: Current operational status
- **Type Name**: Technical category of the service

In [5]:
# Initialize the service metadata dictionary
# This acts as a Configuration Management Database (CMDB) for our services
nodes_cmdb = {}

In [6]:
# Define service characteristics based on system architecture

# Fault tolerance mechanisms implemented by each service
nodes_cmdb['fault_tolerance_type'] = {
    'frontend': 'no',                    # No special fault tolerance
    'cartservice': 'retry',              # Implements retry logic
    'productcatalogservice': 'retry',    # Implements retry logic
    'currencyservice': 'retry',          # Implements retry logic
    'paymentservice': 'retry',           # Implements retry logic
    'shippingservice': 'retry',          # Implements retry logic
    'emailservice': 'degrade',           # Graceful degradation
    'checkoutservice': 'retry',          # Implements retry logic
    'recommendationservice': 'retry',    # Implements retry logic
    'adservice': 'degrade',              # Graceful degradation
    'mysql': 'retry',                    # Database retry mechanisms
    'redis-cart': 'retry'                # Cache retry mechanisms
}

# Business importance level of each service
nodes_cmdb['importance_level'] = {
    'frontend': 'important',             # User-facing interface
    'cartservice': 'important',          # Core shopping functionality
    'productcatalogservice': 'important', # Core product data
    'currencyservice': 'important',      # Financial calculations
    'paymentservice': 'important',       # Payment processing
    'shippingservice': 'important',      # Order fulfillment
    'emailservice': 'ordinary',          # Non-critical notifications
    'checkoutservice': 'important',      # Core purchase flow
    'recommendationservice': 'important', # User experience
    'adservice': 'important',            # Revenue generation
    'mysql': 'important',                # Primary data store
    'redis-cart': 'ordinary'             # Cache layer
}

# Product domain classification
nodes_cmdb['product_name'] = {
    'frontend': 'basic',                 # Basic infrastructure
    'cartservice': 'shopping',           # Shopping domain
    'productcatalogservice': 'basic',    # Basic infrastructure
    'currencyservice': 'shopping',       # Shopping domain
    'paymentservice': 'shopping',        # Shopping domain
    'shippingservice': 'shopping',       # Shopping domain
    'emailservice': 'shopping',          # Shopping domain
    'checkoutservice': 'shopping',       # Shopping domain
    'recommendationservice': 'recommendation', # Recommendation domain
    'adservice': 'ad',                   # Advertisement domain
    'mysql': 'basic',                    # Basic infrastructure
    'redis-cart': 'shopping'             # Shopping domain
}

# Operational status (all services are currently online)
nodes_cmdb['status'] = {service: 'online' for service in nodes_cmdb['fault_tolerance_type'].keys()}

# Technical service type classification
nodes_cmdb['type_name'] = {
    'frontend': 'httpapi',               # HTTP API gateway
    'cartservice': 'appsvr',             # Application server
    'productcatalogservice': 'appsvr',   # Application server
    'currencyservice': 'appsvr',         # Application server
    'paymentservice': 'appsvr',          # Application server
    'shippingservice': 'appsvr',         # Application server
    'emailservice': 'appsvr',            # Application server
    'checkoutservice': 'appsvr',         # Application server
    'recommendationservice': 'appsvr',   # Application server
    'adservice': 'appsvr',               # Application server
    'mysql': 'mysql',                    # Database server
    'redis-cart': 'cache'                # Cache server
}

## Step 6: Convert CMDB to DataFrame and One-Hot Encode

Transform the service metadata into a structured format and apply one-hot encoding to categorical variables for machine learning compatibility.

In [7]:
# Convert the CMDB dictionary to a pandas DataFrame
# Each row represents a service, each column represents a characteristic
nodes_cmdb = pd.DataFrame(nodes_cmdb)

In [8]:
# Apply one-hot encoding to categorical variables
# This converts categorical features into binary vectors for ML algorithms
# Each unique category becomes a separate binary feature
nodes_cmdb = pd.get_dummies(nodes_cmdb)

## Step 7: Create Node and Edge Features for Graph Structure

Transform each topology case into a graph structure with:
- **Node features**: Service characteristics from CMDB
- **Edge indices**: Connections between services
- **Node mapping**: Service name to node index mapping

In [9]:
# Create graph structure for each topology case
for case in all_cases:
    # Initialize mapping from service names to node indices
    nodes_map = {}
    nodes_num = 0
    nodes_feature = []  # Node feature matrix
    edge_index = []     # Edge connectivity list
    
    # Process each edge in the topology
    for edge in case['edges_info']:
        # Add source node if not already present
        if edge['src'] not in nodes_map:
            nodes_map[edge['src']] = nodes_num
            # Get node features from CMDB (one-hot encoded service characteristics)
            nodes_feature.append(list(nodes_cmdb.loc[edge['src']].values))
            nodes_num += 1
        
        # Add destination node if not already present
        if edge['des'] not in nodes_map:
            nodes_map[edge['des']] = nodes_num
            # Get node features from CMDB
            nodes_feature.append(list(nodes_cmdb.loc[edge['des']].values))
            nodes_num += 1
        
        # Add edge connection using node indices
        edge_index.append([nodes_map[edge['src']], nodes_map[edge['des']]])
    
    # Store graph structure in the case
    case['x'] = nodes_feature      # Node feature matrix
    case['edge_index'] = edge_index # Edge connectivity list

## Step 8: Split Data into Training and Testing Sets

Divide the dataset into training and testing portions for model development and evaluation.

In [10]:
# Split data into training and testing sets (50-50 split)
# First half for training
train_cases = all_cases[:len(all_cases)//2]

In [11]:
# Second half for testing
test_cases = all_cases[len(all_cases)//2:]

## Step 9: Extract Edge Features from Time Series Data

Create sophisticated edge features from the time series data of each service call relationship. These features capture:

1. **Failure Level Binning**: Categorize failure severity
2. **Temporal Ratios**: Compare current vs historical failure rates
3. **Workload Ratios**: Failure rate relative to request volume
4. **Trend Analysis**: Recent failure pattern changes

In [12]:
# Extract edge features for training data
for case in train_cases:
    case['edge_fea'] = []
    
    for i, edge_attr in enumerate(case['edges_info']):
        case['edge_fea'].append([])
        
        # Feature 1: Failure Level Binning
        # Categorize the current failure level into severity bins
        final_fail_level = np.max(edge_attr['FailCount'][-1:])
        
        # Binning thresholds are system-specific
        # For larger systems, use higher thresholds (e.g., [5000, 10000, 30000, 70000])
        if final_fail_level < 50:
            case['edge_fea'][i].append(0)                # Low severity
        elif final_fail_level < 100:
            case['edge_fea'][i].append(1e2 - 1)          # Medium-low severity
        elif final_fail_level < 300:
            case['edge_fea'][i].append(1e4 - 1)          # Medium severity
        elif final_fail_level < 700:
            case['edge_fea'][i].append(1e7 - 1)          # High severity
        else:
            case['edge_fea'][i].append(1e10 - 1)         # Critical severity
        
        # Feature 2-4: Temporal Comparison Ratios
        # Compare current failure rate with different historical periods
        points_length = len(edge_attr['FailCount'])
        
        # Current vs early period ratio
        early_period_mean = np.mean(edge_attr['FailCount'][:-points_length//2])
        current_vs_early = (np.max(edge_attr['FailCount'][-1:]) + 1) / (early_period_mean + 1)
        case['edge_fea'][i].append(current_vs_early)
        
        # Current vs middle period ratio
        middle_period_mean = np.mean(edge_attr['FailCount'][-points_length//2:-points_length//4])
        current_vs_middle = (np.max(edge_attr['FailCount'][-1:]) + 1) / (middle_period_mean + 1)
        case['edge_fea'][i].append(current_vs_middle)
        
        # Current vs yesterday same time ratio
        yesterday_max = np.max(edge_attr['YesterFailCount'][points_length//2:])
        current_vs_yesterday = (np.max(edge_attr['FailCount'][-1:]) + 1) / (yesterday_max + 1)
        case['edge_fea'][i].append(current_vs_yesterday)
        
        # Feature 5: Failure Rate vs Workload
        # Calculate failure rate relative to request volume
        current_workload = np.array(edge_attr['Workload'])[-1:]
        current_failures = np.array(edge_attr['FailCount'])[-1:]
        failure_rate = np.mean((current_failures + 1) / (current_workload + 1))
        case['edge_fea'][i].append(failure_rate)
        
        # Feature 6: Current vs Yesterday Failure Rate
        # Compare today's failure rate with yesterday's
        current_vs_yesterday_rate = np.mean(
            ((np.array(edge_attr['FailCount']) + 1) / 
            (np.array(edge_attr['YesterFailCount']) + 1))[-1:]
        )
        case['edge_fea'][i].append(current_vs_yesterday_rate)
        
        # Feature 7: Recent Trend Analysis
        # Compare recent peak with current level to detect trends
        recent_peak = np.max(np.array(edge_attr['FailCount'])[-4:-1])
        current_level = edge_attr['FailCount'][-1]
        trend_ratio = (recent_peak + 1) / (current_level + 1)
        case['edge_fea'][i].append(trend_ratio)

## Step 10: Extract Edge Features for Test Data

Apply the same feature extraction process to the test dataset.

In [13]:
# Extract edge features for test data using the same logic
# (Same code as training data feature extraction)
for case in test_cases:
    case['edge_fea'] = []
    
    for i, edge_attr in enumerate(case['edges_info']):
        case['edge_fea'].append([])
        
        # Apply the same feature extraction logic as training data
        final_fail_level = np.max(edge_attr['FailCount'][-1:])
        
        if final_fail_level < 50:
            case['edge_fea'][i].append(0)
        elif final_fail_level < 100:
            case['edge_fea'][i].append(1e2 - 1)
        elif final_fail_level < 300:
            case['edge_fea'][i].append(1e4 - 1)
        elif final_fail_level < 700:
            case['edge_fea'][i].append(1e7 - 1)
        else:
            case['edge_fea'][i].append(1e10 - 1)
        
        points_length = len(edge_attr['FailCount'])
        case['edge_fea'][i].append((np.max(edge_attr['FailCount'][-1:]) + 1)/(np.mean(edge_attr['FailCount'][:-points_length//2])+1))
        case['edge_fea'][i].append((np.max(edge_attr['FailCount'][-1:]) + 1)/(np.mean(edge_attr['FailCount'][-points_length//2:-points_length//4])+1))
        case['edge_fea'][i].append((np.max(edge_attr['FailCount'][-1:]) + 1)/(np.max(edge_attr['YesterFailCount'][points_length//2:])+1))
        case['edge_fea'][i].append(np.mean(((np.array(edge_attr['FailCount'])+1)/(np.array(edge_attr['Workload'])+1))[-1:]))
        case['edge_fea'][i].append(np.mean(((np.array(edge_attr['FailCount'])+1)/(np.array(edge_attr['YesterFailCount'])+1))[-1:]))
        case['edge_fea'][i].append((np.max(np.array(edge_attr['FailCount'])[-4:-1])+1)/(edge_attr['FailCount'][-1]+1))

## Step 11: Apply Logarithmic Transformation

Apply log transformation to edge features to handle large value ranges and improve model stability.

In [14]:
# Apply logarithmic transformation to training edge features
# This helps normalize large value ranges and improves model convergence
for case in train_cases:
    for i, edge_fea in enumerate(case['edge_fea']):
        # Apply log base 10 transformation (add 1 to avoid log(0))
        case['edge_fea'][i] = [math.log(f + 1, 10) for f in edge_fea]

In [15]:
# Apply logarithmic transformation to test edge features
for case in test_cases:
    for i, edge_fea in enumerate(case['edge_fea']):
        case['edge_fea'][i] = [math.log(f + 1, 10) for f in edge_fea]

## Step 12: Standardize Edge Features

Apply standardization (z-score normalization) to edge features using training data statistics.

In [16]:
# Collect all edge features from training data for standardization
# This ensures we use only training data statistics for normalization
all_fea_for_norm = []
for case in train_cases:
    for i, edge_fea in enumerate(case['edge_fea']):
        all_fea_for_norm.append(edge_fea)

In [17]:
# Fit StandardScaler on training data
# This calculates mean and standard deviation for each feature
ss = StandardScaler()
ss.fit_transform(all_fea_for_norm)

array([[-0.42218303, -0.3151539 , -0.45919992, ..., -0.49623995,
        -0.10882742, -0.1122749 ],
       [-0.42218303, -0.57792549, -0.68802945, ..., -0.50003458,
        -0.43365717,  1.02474846],
       [-0.42218303, -0.32928518, -0.68802945, ..., -0.50130648,
        -0.43365717,  0.51993701],
       ...,
       [-0.42218303, -0.44279358, -0.52983737, ..., -0.46424418,
        -0.58531977,  0.29108647],
       [-0.42218303, -0.44430415, -0.51369982, ..., -0.4926088 ,
        -0.52997535, -0.9963308 ],
       [-0.42218303, -0.78025256, -0.85850706, ..., -0.48451811,
        -0.8465918 ,  2.79943788]])

In [18]:
# Apply standardization to training edge features
for case in train_cases:
    case['edge_fea'] = ss.transform(case['edge_fea'])

In [19]:
# Apply the same standardization to test edge features
# Important: Use training data statistics, not test data statistics
for case in test_cases:
    case['edge_fea'] = ss.transform(case['edge_fea'])

## Step 13: Create Global Topology Features

Extract topology-level features that characterize the overall structure and severity of each incident topology:

1. **Number of Services**: Topology size (log-transformed)
2. **Number of Relationships**: Topology complexity (log-transformed)
3. **Important Services Count**: Number of business-critical services involved
4. **Overall Severity Level**: Binned total failure count across the topology

In [20]:
# Extract global features for training data
for case in train_cases:
    # Feature 1: Log of number of services in topology
    num_services = math.log(len(case['x']) + 1, 10)
    
    # Feature 2: Log of number of calling relationships
    num_relationships = math.log(len(case['edge_index']) + 1, 10)
    
    # Feature 3: Number of important services involved
    # Index 3 corresponds to 'importance_level_important' from one-hot encoding
    num_important_services = np.sum([node_x[3] for node_x in case['x']])
    
    case['global_fea'] = [num_services, num_relationships, num_important_services]

# Standardize global features using training data
global_fea_for_norm = [case['global_fea'] for case in train_cases]
ss_global = StandardScaler()
norm_rst = ss_global.fit_transform(global_fea_for_norm)

# Apply standardized global features to training data
for i, case in enumerate(train_cases):
    case['global_fea'] = norm_rst[i]

In [21]:
# Extract and standardize global features for test data
for case in test_cases:
    # Apply same feature extraction logic
    num_services = math.log(len(case['x']) + 1, 10)
    num_relationships = math.log(len(case['edge_index']) + 1, 10)
    num_important_services = np.sum([node_x[3] for node_x in case['x']])
    
    case['global_fea'] = [num_services, num_relationships, num_important_services]

# Apply training data standardization to test data
for i, case in enumerate(test_cases):
    case['global_fea'] = ss_global.transform(np.array(case['global_fea']).reshape(1, -1))

## Step 14: Add Overall Severity Feature

Add a binned feature representing the overall severity level of the topology based on total failure count.

In [22]:
# Add overall severity level feature to training data
for case in train_cases:
    # Calculate total failure count across all edges in the topology
    sum_fail = 0
    for i, edge_attr in enumerate(case['edges_info']):
        sum_fail += np.mean(edge_attr['FailCount'][-1:])
    
    # Bin the total failure count into severity levels
    if sum_fail < 50:
        severity_level = 0      # Very low severity
    elif sum_fail < 100:
        severity_level = 0.2    # Low severity
    elif sum_fail < 150:
        severity_level = 0.4    # Medium severity
    elif sum_fail < 700:
        severity_level = 0.7    # High severity
    else:
        severity_level = 1      # Critical severity
    
    # Append severity level to global features
    case['global_fea'] = np.append(case['global_fea'], severity_level)

In [23]:
# Add overall severity level feature to test data
for case in test_cases:
    # Apply same severity binning logic
    sum_fail = 0
    for i, edge_attr in enumerate(case['edges_info']):
        sum_fail += np.mean(edge_attr['FailCount'][-1:])
    
    if sum_fail < 50:
        severity_level = 0
    elif sum_fail < 100:
        severity_level = 0.2
    elif sum_fail < 150:
        severity_level = 0.4
    elif sum_fail < 700:
        severity_level = 0.7
    else:
        severity_level = 1
    
    case['global_fea'] = np.append(case['global_fea'], severity_level)

## Step 15: Create Bidirectional Graph Structure

Convert directed edges to bidirectional edges for graph neural network processing. This allows information to flow in both directions along service call relationships.

In [24]:
# Create bidirectional edge indices for training data
# Add reverse edges to allow bidirectional information flow
for case in train_cases:
    # Original edges + reverse edges
    case['bi_edge_index'] = case['edge_index'] + [[edge[1], edge[0]] for edge in case['edge_index']]

In [25]:
# Create bidirectional edge indices for test data
for case in test_cases:
    case['bi_edge_index'] = case['edge_index'] + [[edge[1], edge[0]] for edge in case['edge_index']]

## Step 16: Create Bidirectional Edge Features

Duplicate edge features to match the bidirectional edge structure.

In [26]:
# Duplicate edge features for bidirectional edges in training data
# Each edge feature is used for both directions
for case in train_cases:
    case['bi_edge_fea'] = np.concatenate((case['edge_fea'], case['edge_fea']), axis=0)

In [27]:
# Duplicate edge features for bidirectional edges in test data
for case in test_cases:
    case['bi_edge_fea'] = np.concatenate((case['edge_fea'], case['edge_fea']), axis=0)

## Step 17: Save Processed Datasets

Save the feature-engineered training and testing datasets for use in machine learning models.

In [28]:
# Save the processed training dataset
# This dataset is ready for machine learning model training
with open('../data/train_cases.pkl', 'wb') as f:
    pickle.dump(train_cases, f)

In [29]:
# Save the processed test dataset
# This dataset is ready for model evaluation
with open('../data/test_cases.pkl', 'wb') as f:
    pickle.dump(test_cases, f)