# Loop Order Prediction and Misclassification Analysis

This notebook trains an XGBoost model to predict loop order (COEFFICIENTS) using graph features from loops 5-10, and then analyzes misclassifications in the training set.

## Data Overview
- **Training data**: 5loopfeats.csv, 6loopfeats.csv, 7loopfeats.csv, 8loopfeats.csv, 9loopfeats.csv, 10loopfeats.csv
- **Test data**: 11loopfeats.csv
- **Target variable**: COEFFICIENTS (first column)
- **Features**: Graph structural features (Basic, Connectivity, Centrality, Core, Robust, Cycle, Spectral, Kirchhoff, Planarity, Symmetry)


In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


In [36]:
# Define data path
data_path = "/Users/rezadoobary/Documents/ML-correlator/Tree classifier for graphs/mixed_loops/features_tabular"

# Load training data (loops 5-10)
training_files = ['9loopfeats.csv']
training_data = []

print("Loading training data...")
for file in training_files:
    print(f"Loading {file}...")
    df = pd.read_csv(f"{data_path}/{file}")
    df['source_loop'] = file.replace('loopfeats.csv', '')  # Add source loop identifier
    training_data.append(df)
    print(f"  Shape: {df.shape}")

# Combine all training data
train_df = pd.concat(training_data, ignore_index=True)
print(f"\nCombined training data shape: {train_df.shape}")
print(f"Target variable (COEFFICIENTS) unique values: {sorted(train_df['COEFFICIENTS'].unique())}")
print(f"Target variable distribution:")
print(train_df['COEFFICIENTS'].value_counts().sort_index())


Loading training data...
Loading 9loopfeats.csv...
  Shape: (13972, 58)

Combined training data shape: (13972, 58)
Target variable (COEFFICIENTS) unique values: [np.int64(0), np.int64(1)]
Target variable distribution:
COEFFICIENTS
0    8311
1    5661
Name: count, dtype: int64


In [37]:
# Load test data (loop 11)
print("Loading test data...")
test_df = pd.read_csv(f"{data_path}/11loopfeats.csv")
test_df['source_loop'] = '11'  # Add source loop identifier
print(f"Test data shape: {test_df.shape}")
print(f"Test target variable (COEFFICIENTS) unique values: {sorted(test_df['COEFFICIENTS'].unique())}")
print(f"Test target variable distribution:")
print(test_df['COEFFICIENTS'].value_counts().sort_index())


Loading test data...
Test data shape: (1697302, 58)
Test target variable (COEFFICIENTS) unique values: [np.int64(0), np.int64(1)]
Test target variable distribution:
COEFFICIENTS
0    1207883
1     489419
Name: count, dtype: int64


In [38]:
# Prepare features and target
# Remove non-feature columns
feature_columns = [col for col in train_df.columns if col not in ['COEFFICIENTS', 'source_loop']]

X_train = train_df[feature_columns]
y_train = train_df['COEFFICIENTS']
X_test = test_df[feature_columns]
y_test = test_df['COEFFICIENTS']

print(f"Number of features: {len(feature_columns)}")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Check for missing values
print(f"\nMissing values in training set: {X_train.isnull().sum().sum()}")
print(f"Missing values in test set: {X_test.isnull().sum().sum()}")

# Handle any missing values by filling with median


Number of features: 56
Training set: 13972 samples
Test set: 1697302 samples

Missing values in training set: 21
Missing values in test set: 3394771


In [39]:
# Train XGBoost model
print("Training XGBoost model...")

# Set up XGBoost parameters
xgb_params = {
    'objective': 'multi:softmax',  # Multi-class classification
    'num_class': len(y_train.unique()),
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 100,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42,
    'n_jobs': -1
}

# Create and train the model
model = xgb.XGBClassifier(**xgb_params)
model.fit(X_train, y_train)

print("Model training completed!")


Training XGBoost model...
Model training completed!


In [40]:
# Make predictions on training and test sets
print("Making predictions...")

# Training set predictions
y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

# Test set predictions
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")

# Add predictions to dataframes for analysis
train_df['predicted'] = y_train_pred
train_df['correct'] = (train_df['COEFFICIENTS'] == train_df['predicted'])
test_df['predicted'] = y_test_pred
test_df['correct'] = (test_df['COEFFICIENTS'] == test_df['predicted'])


Making predictions...
Training accuracy: 0.9103
Test accuracy: 0.7418


## Misclassification Analysis

Now let's analyze the misclassifications in the training set to understand where the model struggles.


In [41]:
# Overall misclassification statistics
misclassified = train_df[~train_df['correct']]
total_samples = len(train_df)
misclassified_count = len(misclassified)

print(f"Total training samples: {total_samples}")
print(f"Misclassified samples: {misclassified_count}")
print(f"Misclassification rate: {misclassified_count/total_samples:.4f}")

# Misclassification by source loop
print("\nMisclassification by source loop:")
misclass_by_loop = train_df.groupby('source_loop').agg({
    'correct': ['count', 'sum', lambda x: (x == False).sum()]
}).round(4)
misclass_by_loop.columns = ['Total', 'Correct', 'Misclassified']
misclass_by_loop['Misclass_Rate'] = misclass_by_loop['Misclassified'] / misclass_by_loop['Total']
print(misclass_by_loop)


Total training samples: 13972
Misclassified samples: 1253
Misclassification rate: 0.0897

Misclassification by source loop:
             Total  Correct  Misclassified  Misclass_Rate
source_loop                                              
9            13972    12719           1253       0.089679


In [44]:
# Display the misclassified samples dataframe with all features
print("Misclassified Samples DataFrame:")
print("=" * 50)

# Show the first 20 misclassified samples with all their features
print(f"Showing first 20 out of {len(misclassified)} misclassified samples:")
print(f"Total columns: {len(misclassified.columns)}")

# Display the dataframe
display(misclassified.head(20))

# Also show some basic info about the misclassified samples
print(f"\nMisclassified samples info:")
print(f"Shape: {misclassified.shape}")
print(f"Columns: {list(misclassified.columns)}")
print(f"Data types:\n{misclassified.dtypes}")


Misclassified Samples DataFrame:
Showing first 20 out of 1253 misclassified samples:
Total columns: 60


Unnamed: 0,COEFFICIENTS,Basic_num_nodes,Basic_num_edges,Basic_min_degree,Basic_max_degree,Basic_avg_degree,Basic_degree_std,Basic_degree_skew,Basic_density,Basic_edge_to_node_ratio,...,Kirchhoff_index,Planarity_num_faces,Planarity_face_size_mean,Planarity_face_size_max,Symmetry_automorphism_group_order,Symmetry_num_orbits,Symmetry_orbit_size_max,source_loop,predicted,correct
51,0,13,30,4,6,4.615385,0.73782,0.747932,0.384615,2.307692,...,38.792243,19,3.157895,4,1,13,1,9,1,False
56,0,13,31,4,6,4.769231,0.799408,0.438359,0.397436,2.384615,...,37.321277,20,3.1,4,1,13,1,9,1,False
71,1,13,31,4,6,4.769231,0.696568,0.347455,0.397436,2.384615,...,37.346345,20,3.1,4,1,13,1,9,0,False
75,0,13,31,4,6,4.769231,0.696568,0.347455,0.397436,2.384615,...,37.342107,20,3.1,4,1,13,1,9,1,False
85,0,13,31,4,7,4.769231,0.890449,1.12174,0.397436,2.384615,...,37.266592,20,3.1,4,1,13,1,9,1,False
87,0,13,30,4,7,4.615385,0.835598,1.610225,0.384615,2.307692,...,38.766478,19,3.157895,4,1,13,1,9,1,False
88,0,13,29,4,6,4.461538,0.634324,1.04861,0.371795,2.230769,...,40.212789,18,3.222222,4,2,7,2,9,1,False
92,0,13,31,4,6,4.769231,0.799408,0.438359,0.397436,2.384615,...,37.485907,20,3.1,5,2,8,2,9,1,False
99,0,13,31,4,6,4.769231,0.799408,0.438359,0.397436,2.384615,...,37.213093,20,3.1,4,1,13,1,9,1,False
106,0,13,31,4,6,4.769231,0.799408,0.438359,0.397436,2.384615,...,37.363579,20,3.1,4,1,13,1,9,1,False



Misclassified samples info:
Shape: (1253, 60)
Columns: ['COEFFICIENTS', 'Basic_num_nodes', 'Basic_num_edges', 'Basic_min_degree', 'Basic_max_degree', 'Basic_avg_degree', 'Basic_degree_std', 'Basic_degree_skew', 'Basic_density', 'Basic_edge_to_node_ratio', 'Basic_degree_entropy', 'Connectivity_is_connected', 'Connectivity_num_components', 'Connectivity_diameter', 'Connectivity_radius', 'Connectivity_avg_shortest_path_length', 'Connectivity_wiener_index', 'Centrality_betweenness_mean', 'Centrality_betweenness_max', 'Centrality_betweenness_std', 'Centrality_betweenness_skew', 'Centrality_closeness_mean', 'Centrality_closeness_max', 'Centrality_closeness_std', 'Centrality_closeness_skew', 'Centrality_eigenvector_mean', 'Centrality_eigenvector_max', 'Centrality_eigenvector_std', 'Centrality_eigenvector_skew', 'Core_max_core_index', 'Core_core_index_mean', 'Robust_articulation_points', 'Robust_bridge_count', 'Cycle_num_cycles_len_5', 'Cycle_num_cycles_len_6', 'Spectral_algebraic_connectivit