# Network Congestion Analysis with Machine Learning

This notebook analyzes network congestion data using various machine learning techniques. The workflow includes:

1. Data Loading and Preprocessing
2. Exploratory Data Analysis
3. Feature Engineering & Selection
4. Model Training (Multiple Algorithms)
5. Model Evaluation
6. Results Visualization
7. Feature Importance Analysis
8. Prediction Analysis

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime

# For preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.impute import SimpleImputer

# Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
import lightgbm as lgb

# For evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# For feature importance
from sklearn.inspection import permutation_importance

# Set display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

# Set visualization style
sns.set(style="whitegrid")
plt.style.use('seaborn-v0_8-whitegrid')

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Load the network congestion dataset
data = pd.read_csv('Network Congestion Dataset.csv')

# Display basic information about the dataset
print(f"Dataset Shape: {data.shape}")
print("\nFirst 5 rows:")
data.head()

In [None]:
# Check data information
data.info()

In [None]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0] if any(missing_values > 0) else "No missing values found.")

In [None]:
# Basic statistical summary
data.describe().T

## 2. Data Preprocessing and Feature Engineering

In [None]:
# Convert timestamp to datetime
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

# Extract additional time-based features
data['Hour'] = data['Timestamp'].dt.hour
data['Day'] = data['Timestamp'].dt.day
data['DayOfWeek'] = data['Timestamp'].dt.dayofweek
data['IsWeekend'] = data['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)

# Display the updated dataset with new features
data.head()

In [None]:
# Check unique values in categorical columns
categorical_cols = ['Source_Node', 'Destination_Node', 'Admin_Contact', 'Region_Code']
for col in categorical_cols:
    print(f"\nUnique values in {col}:")
    print(data[col].value_counts())

In [None]:
# Create connection pairs for analysis
data['Connection'] = data['Source_Node'] + ' → ' + data['Destination_Node']

# Check outliers in numerical columns
numerical_cols = ['Packet_Loss_Rate', 'Average_Latency_ms', 'Node_Betweenness_Centrality', 
                  'Traffic_Volume_MBps', 'Link_Stability_Score']

# Boxplot for numerical features to identify outliers
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x=data[col])
    plt.title(f'Boxplot of {col}')
    plt.tight_layout()
plt.show()

In [None]:
# Function to identify outliers using IQR method
def identify_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound, len(outliers)

# Check for outliers in each numerical column
for col in numerical_cols:
    outliers, lower, upper, count = identify_outliers(data, col)
    print(f"\nOutliers in {col}:")
    print(f"Lower bound: {lower:.4f}, Upper bound: {upper:.4f}")
    print(f"Number of outliers: {count} ({(count/len(data))*100:.2f}% of data)")

In [None]:
# Function to handle outliers using capping
def cap_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    data[column] = np.where(data[column] < lower_bound, lower_bound, data[column])
    data[column] = np.where(data[column] > upper_bound, upper_bound, data[column])
    return data

# Handle outliers for each numerical column
data_processed = data.copy()
for col in numerical_cols:
    data_processed = cap_outliers(data_processed, col)

# Verify outliers are handled
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_cols):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x=data_processed[col])
    plt.title(f'Boxplot of {col} (After Handling Outliers)')
    plt.tight_layout()
plt.show()

In [None]:
# Create a congestion level target variable based on combined factors
# We'll use Packet_Loss_Rate and Average_Latency_ms as our main indicators

# Normalize the factors for weighted scoring
loss_norm = (data_processed['Packet_Loss_Rate'] - data_processed['Packet_Loss_Rate'].min()) / \
            (data_processed['Packet_Loss_Rate'].max() - data_processed['Packet_Loss_Rate'].min())

latency_norm = (data_processed['Average_Latency_ms'] - data_processed['Average_Latency_ms'].min()) / \
               (data_processed['Average_Latency_ms'].max() - data_processed['Average_Latency_ms'].min())

# Combined congestion score (weighted average)
data_processed['Congestion_Score'] = 0.6 * loss_norm + 0.4 * latency_norm

# Create congestion level categories (Low, Medium, High)
data_processed['Congestion_Level'] = pd.qcut(data_processed['Congestion_Score'], 
                                             q=[0, 0.33, 0.67, 1.0], 
                                             labels=['Low', 'Medium', 'High'])

# Display the distribution of congestion levels
plt.figure(figsize=(10, 6))
sns.countplot(x='Congestion_Level', data=data_processed)
plt.title('Distribution of Network Congestion Levels')
plt.xlabel('Congestion Level')
plt.ylabel('Count')
plt.show()

# Verify the resulting dataset
data_processed[['Packet_Loss_Rate', 'Average_Latency_ms', 'Congestion_Score', 'Congestion_Level']].head()

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Distribution of numerical features
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_cols):
    plt.subplot(3, 2, i+1)
    sns.histplot(data_processed[col], kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix for numerical features
numerical_data = data_processed[numerical_cols + ['Congestion_Score']]
correlation_matrix = numerical_data.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.show()

In [None]:
# Interactive correlation heatmap with Plotly
fig = px.imshow(correlation_matrix, 
                text_auto='.2f',
                color_continuous_scale='RdBu_r',
                title='Correlation Matrix of Numerical Features')
fig.update_layout(width=800, height=800)
fig.show()

In [None]:
# Hourly patterns in congestion
hourly_congestion = data_processed.groupby('Hour')['Congestion_Score'].mean().reset_index()

plt.figure(figsize=(12, 6))
plt.plot(hourly_congestion['Hour'], hourly_congestion['Congestion_Score'], marker='o', linestyle='-')
plt.title('Average Congestion Score by Hour of Day')
plt.xlabel('Hour of Day')
plt.ylabel('Average Congestion Score')
plt.xticks(range(0, 24))
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Interactive hourly pattern with Plotly
fig = px.line(hourly_congestion, x='Hour', y='Congestion_Score', markers=True,
              title='Average Congestion Score by Hour of Day')
fig.update_layout(xaxis_title='Hour of Day', 
                  yaxis_title='Average Congestion Score',
                  xaxis=dict(tickmode='linear', tick0=0, dtick=1))
fig.show()

In [None]:
# Analyze congestion by day of week
day_mapping = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 
               4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
data_processed['DayName'] = data_processed['DayOfWeek'].map(day_mapping)

daily_congestion = data_processed.groupby('DayName')['Congestion_Score'].mean().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
).reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(x='DayName', y='Congestion_Score', data=daily_congestion)
plt.title('Average Congestion Score by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Average Congestion Score')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Regional analysis
region_congestion = data_processed.groupby('Region_Code')['Congestion_Score'].mean().sort_values(ascending=False).reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(x='Region_Code', y='Congestion_Score', data=region_congestion)
plt.title('Average Congestion Score by Region')
plt.xlabel('Region')
plt.ylabel('Average Congestion Score')
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Connection pair analysis
connection_congestion = data_processed.groupby('Connection')['Congestion_Score'].mean().sort_values(ascending=False).reset_index()
top_10_congested = connection_congestion.head(10)

plt.figure(figsize=(14, 7))
sns.barplot(x='Congestion_Score', y='Connection', data=top_10_congested)
plt.title('Top 10 Most Congested Connection Pairs')
plt.xlabel('Average Congestion Score')
plt.ylabel('Connection Pair')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot to examine relationship between key variables
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Packet_Loss_Rate', y='Average_Latency_ms', hue='Congestion_Level', data=data_processed)
plt.title('Packet Loss Rate vs. Average Latency')

plt.subplot(2, 2, 2)
sns.scatterplot(x='Traffic_Volume_MBps', y='Packet_Loss_Rate', hue='Congestion_Level', data=data_processed)
plt.title('Traffic Volume vs. Packet Loss Rate')

plt.subplot(2, 2, 3)
sns.scatterplot(x='Node_Betweenness_Centrality', y='Packet_Loss_Rate', hue='Congestion_Level', data=data_processed)
plt.title('Node Betweenness Centrality vs. Packet Loss Rate')

plt.subplot(2, 2, 4)
sns.scatterplot(x='Link_Stability_Score', y='Packet_Loss_Rate', hue='Congestion_Level', data=data_processed)
plt.title('Link Stability Score vs. Packet Loss Rate')

plt.tight_layout()
plt.show()

In [None]:
# Interactive 3D scatter plot to visualize multiple dimensions
fig = px.scatter_3d(data_processed, x='Packet_Loss_Rate', y='Average_Latency_ms', z='Traffic_Volume_MBps',
                  color='Congestion_Level', opacity=0.7,
                  title='3D Visualization of Network Congestion Factors')
fig.update_layout(scene=dict(
    xaxis_title='Packet Loss Rate',
    yaxis_title='Average Latency (ms)',
    zaxis_title='Traffic Volume (MBps)'),
    width=900, height=700)
fig.show()

In [None]:
# Create a pair plot for key numerical features
plt.figure(figsize=(15, 12))
sns.pairplot(data_processed[numerical_cols + ['Congestion_Level']], hue='Congestion_Level', height=2.5)
plt.suptitle('Pair Plot of Key Network Metrics', y=1.02, fontsize=16)
plt.tight_layout()
plt.show()

## 4. Feature Engineering & Selection

In [None]:
# Create interaction features
data_processed['Loss_Latency_Interaction'] = data_processed['Packet_Loss_Rate'] * data_processed['Average_Latency_ms']
data_processed['Traffic_Stability_Ratio'] = data_processed['Traffic_Volume_MBps'] / data_processed['Link_Stability_Score']
data_processed['Centrality_Loss_Interaction'] = data_processed['Node_Betweenness_Centrality'] * data_processed['Packet_Loss_Rate']

# Feature scaling for numerical columns
numerical_cols_extended = numerical_cols + ['Loss_Latency_Interaction', 'Traffic_Stability_Ratio', 'Centrality_Loss_Interaction', 
                                           'Hour', 'Day', 'DayOfWeek']

# Prepare data for modeling
# Convert categorical 'Congestion_Level' to numeric for ML models
level_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
data_processed['Congestion_Level_Numeric'] = data_processed['Congestion_Level'].map(level_mapping)

# Select features and target for modeling
X = data_processed[numerical_cols_extended]
y = data_processed['Congestion_Level_Numeric']

# Display the new features
X.head()

In [None]:
# Check for correlation in extended features
correlation_extended = X.corr()

plt.figure(figsize=(16, 14))
sns.heatmap(correlation_extended, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Extended Features')
plt.tight_layout()
plt.show()

In [None]:
# Find highly correlated features (above 0.8)
def get_highly_correlated_pairs(corr_matrix, threshold=0.8):
    corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))
    return corr_pairs

high_corr_pairs = get_highly_correlated_pairs(correlation_extended, threshold=0.8)
print("Highly correlated feature pairs:")
for feat1, feat2, corr in high_corr_pairs:
    print(f"{feat1} - {feat2}: {corr:.4f}")

In [None]:
# Remove highly correlated features to reduce multicollinearity
# Based on the correlation analysis and feature importance
features_to_drop = []
for feat1, feat2, _ in high_corr_pairs:
    # Strategy: Keep the original feature, drop the derived one
    if feat1 in ['Loss_Latency_Interaction', 'Traffic_Stability_Ratio', 'Centrality_Loss_Interaction']:
        features_to_drop.append(feat1)
    elif feat2 in ['Loss_Latency_Interaction', 'Traffic_Stability_Ratio', 'Centrality_Loss_Interaction']:
        features_to_drop.append(feat2)
    # If both are original, drop the one with less correlation to target
    else:
        corr1 = abs(data_processed[feat1].corr(data_processed['Congestion_Level_Numeric']))
        corr2 = abs(data_processed[feat2].corr(data_processed['Congestion_Level_Numeric']))
        features_to_drop.append(feat1 if corr1 < corr2 else feat2)

# Remove duplicates from the list
features_to_drop = list(set(features_to_drop))
print(f"Features to drop due to high correlation: {features_to_drop}")

# Remove the correlated features
X_reduced = X.drop(columns=features_to_drop)
print(f"Reduced feature set shape: {X_reduced.shape}")

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.25, random_state=42, stratify=y)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Testing set shape: {X_test_scaled.shape}")
print(f"Class distribution in training set:\n{pd.Series(y_train).value_counts(normalize=True)}")

## 5. Model Training & Cross-Validation

In [None]:
# Define models to be evaluated
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(probability=True, random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42),
    'LightGBM': lgb.LGBMClassifier(random_state=42)
}

# Function to evaluate models using cross-validation
def evaluate_models(models, X, y, cv=5):
    results = {}
    for name, model in models.items():
        print(f"\nEvaluating {name}...")
        
        # Cross-validation
        cv_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        results[name] = {
            'cv_accuracy_mean': cv_scores.mean(),
            'cv_accuracy_std': cv_scores.std()
        }
        
        print(f"Cross-validation accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
        
        # Train on the full training set
        model.fit(X, y)
        
    return results

# Evaluate all models using cross-validation
cv_results = evaluate_models(models, X_train_scaled, y_train, cv=5)

In [None]:
# Visualize cross-validation results
cv_means = [result['cv_accuracy_mean'] for result in cv_results.values()]
cv_stds = [result['cv_accuracy_std'] for result in cv_results.values()]
model_names = list(cv_results.keys())

# Sort results by mean accuracy
sorted_indices = np.argsort(cv_means)[::-1]  # Descending order
cv_means = [cv_means[i] for i in sorted_indices]
cv_stds = [cv_stds[i] for i in sorted_indices]
model_names = [model_names[i] for i in sorted_indices]

plt.figure(figsize=(12, 8))
plt.barh(model_names, cv_means, xerr=cv_stds, capsize=5, alpha=0.7, color='skyblue')
plt.xlabel('Mean Cross-Validation Accuracy')
plt.ylabel('Model')
plt.title('Cross-Validation Accuracy Comparison')
plt.grid(axis='x', alpha=0.3)
plt.xlim(min(cv_means) - 0.05, 1.0)

# Add text annotations for mean accuracy values
for i, mean in enumerate(cv_means):
    plt.text(mean + 0.01, i, f"{mean:.4f}", va='center')

plt.tight_layout()
plt.show()

In [None]:
# Identify the top-performing models
top_models = {
    name: models[name] for name in model_names[:3]  # Select top 3 models
}

print("Top performing models for hyperparameter tuning:")
for name in top_models.keys():
    print(f"- {name}")

In [None]:
# Hyperparameter tuning for the top models
def hyperparameter_tuning(model_name, model, X_train, y_train):
    print(f"\nTuning hyperparameters for {model_name}...")
    
    param_grid = {}
    
    if model_name == 'Random Forest':
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    elif model_name == 'XGBoost':
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.2],
            'subsample': [0.8, 0.9, 1.0]
        }
    elif model_name == 'LightGBM':
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.2],
            'num_leaves': [31, 50, 70]
        }
    elif model_name == 'Gradient Boosting':
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.1, 0.2],
            'min_samples_split': [2, 5, 10]
        }
    elif model_name == 'Support Vector Machine':
        param_grid = {
            'C': [0.1, 1, 10, 100],
            'gamma': ['scale', 'auto', 0.1, 0.01],
            'kernel': ['rbf', 'poly', 'sigmoid']
        }
    elif model_name == 'K-Nearest Neighbors':
        param_grid = {
            'n_neighbors': [3, 5, 7, 9, 11],
            'weights': ['uniform', 'distance'],
            'metric': ['euclidean', 'manhattan', 'minkowski']
        }
    else:  # Logistic Regression
        param_grid = {
            'C': [0.001, 0.01, 0.1, 1, 10, 100],
            'solver': ['liblinear', 'saga'],
            'penalty': ['l1', 'l2']
        }
    
    # Use stratified k-fold cross-validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    # Create grid search
    grid_search = GridSearchCV(model, param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
    
    # Fit grid search
    grid_search.fit(X_train, y_train)
    
    # Get best parameters and score
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_
    
    print(f"Best parameters for {model_name}: {best_params}")
    print(f"Best cross-validation accuracy: {best_score:.4f}")
    
    # Return best model
    return grid_search.best_estimator_, best_params, best_score

In [None]:
# Tune hyperparameters for top models
tuned_models = {}
for name, model in top_models.items():
    best_model, best_params, best_score = hyperparameter_tuning(name, model, X_train_scaled, y_train)
    tuned_models[name] = {
        'model': best_model,
        'params': best_params,
        'cv_score': best_score
    }

## 6. Model Evaluation & Performance Metrics

In [None]:
# Evaluate the tuned models on the test set
def evaluate_on_test_set(tuned_models, X_test, y_test):
    results = {}
    
    for name, model_info in tuned_models.items():
        model = model_info['model']
        
        # Make predictions
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='weighted')
        recall = recall_score(y_test, y_pred, average='weighted')
        f1 = f1_score(y_test, y_pred, average='weighted')
        conf_matrix = confusion_matrix(y_test, y_pred)
        
        # Store results
        results[name] = {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'confusion_matrix': conf_matrix,
            'predictions': y_pred,
            'probabilities': y_pred_proba
        }
        
        # Print evaluation results
        print(f"\nEvaluation results for {name}:")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall: {recall:.4f}")
        print(f"F1 Score: {f1:.4f}")
        print("\nConfusion Matrix:")
        print(conf_matrix)
        print("\nClassification Report:")
        print(classification_report(y_test, y_pred, target_names=['Low', 'Medium', 'High']))
        
    return results

# Evaluate models on test set
test_results = evaluate_on_test_set(tuned_models, X_test_scaled, y_test)

In [None]:
# Visualize confusion matrices for each model
def plot_confusion_matrix(conf_matrix, model_name):
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Low', 'Medium', 'High'],
                yticklabels=['Low', 'Medium', 'High'])
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.tight_layout()
    plt.show()

# Plot confusion matrix for each model
for name, result in test_results.items():
    plot_confusion_matrix(result['confusion_matrix'], name)

In [None]:
# Compare model performances
metrics = ['accuracy', 'precision', 'recall', 'f1_score']
model_comparison = pd.DataFrame(index=metrics, columns=test_results.keys())

for name, result in test_results.items():
    for metric in metrics:
        model_comparison.loc[metric, name] = result[metric]

# Display model comparison table
model_comparison

In [None]:
# Visualize model comparison
plt.figure(figsize=(14, 10))

# Create a grouped bar chart for model comparison
bar_width = 0.2
index = np.arange(len(test_results.keys()))

for i, metric in enumerate(metrics):
    plt.bar(index + i*bar_width, model_comparison.loc[metric], bar_width, 
            label=metric.capitalize())

plt.xlabel('Model')
plt.ylabel('Score')
plt.title('Model Performance Comparison')
plt.xticks(index + bar_width*1.5, test_results.keys(), rotation=45)
plt.legend()
plt.ylim(0.5, 1.0)  # Adjust as needed for your results
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Interactive model comparison with Plotly
fig = go.Figure()

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

for i, metric in enumerate(metrics):
    fig.add_trace(go.Bar(
        x=list(test_results.keys()),
        y=model_comparison.loc[metric],
        name=metric.capitalize(),
        marker_color=colors[i]
    ))

fig.update_layout(
    title='Model Performance Comparison',
    xaxis_title='Model',
    yaxis_title='Score',
    yaxis=dict(range=[0.5, 1.0]),  # Adjust as needed
    barmode='group',
    width=900,
    height=600
)

fig.show()

In [None]:
# ROC curves for each model (for multi-class, we'll use One-vs-Rest approach)
def plot_roc_curves(tuned_models, X_test, y_test):
    # Create binary labels for each class
    n_classes = 3  # Low, Medium, High
    y_test_bin = np.zeros((len(y_test), n_classes))
    for i in range(n_classes):
        y_test_bin[:, i] = (y_test == i).astype(int)
    
    plt.figure(figsize=(12, 8))
    
    for name, model_info in tuned_models.items():
        model = model_info['model']
        y_pred_proba = model.predict_proba(X_test)
        
        # Calculate ROC curve and AUC for each class
        fpr = dict()
        tpr = dict()
        roc_auc = dict()
        
        for i in range(n_classes):
            fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_pred_proba[:, i])
            roc_auc[i] = auc(fpr[i], tpr[i])
        
        # Compute micro-average ROC curve and AUC
        fpr["micro"], tpr["micro"], _ = roc_curve(y_test_bin.ravel(), np.concatenate([y_pred_proba[:, i].reshape(-1, 1) for i in range(n_classes)], axis=1).ravel())
        roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
        
        # Plot micro-average ROC curve
        plt.plot(fpr["micro"], tpr["micro"], label=f'{name} (AUC = {roc_auc["micro"]:.4f})')
    
    plt.plot([0, 1], [0, 1], 'k--', lw=2)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curves')
    plt.legend(loc="lower right")
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

# Plot ROC curves
plot_roc_curves(tuned_models, X_test_scaled, y_test)

## 7. Feature Importance Analysis

In [None]:
# Get feature importance from the best model
def analyze_feature_importance(model_info, feature_names):
    model = model_info['model']
    feature_importance = None
    
    # Extract feature importance based on model type
    if hasattr(model, 'feature_importances_'):  # Tree-based models
        feature_importance = model.feature_importances_
    elif hasattr(model, 'coef_'):  # Linear models
        feature_importance = np.abs(model.coef_).mean(axis=0) if model.coef_.ndim > 1 else np.abs(model.coef_)
    else:  # Use permutation importance
        perm_importance = permutation_importance(model, X_test_scaled, y_test, n_repeats=10, random_state=42)
        feature_importance = perm_importance.importances_mean
    
    # Create DataFrame for feature importance
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': feature_importance
    })
    
    # Sort by importance
    importance_df = importance_df.sort_values('Importance', ascending=False).reset_index(drop=True)
    
    return importance_df

# Get the best model based on F1 score
best_model_name = model_comparison.loc['f1_score'].idxmax()
print(f"Best model based on F1 score: {best_model_name}")

# Analyze feature importance
feature_names = X_reduced.columns
importance_df = analyze_feature_importance(tuned_models[best_model_name], feature_names)

# Display feature importance
print("\nFeature Importance:")
importance_df.head(10)

In [None]:
# Visualize feature importance
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(10), palette='viridis')
plt.title(f'Top 10 Feature Importance ({best_model_name})')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Interactive feature importance visualization with Plotly
fig = px.bar(importance_df.head(10),
            x='Importance', y='Feature',
            orientation='h',
            title=f'Top 10 Feature Importance ({best_model_name})',
            color='Importance',
            color_continuous_scale='viridis')

fig.update_layout(
    xaxis_title='Importance',
    yaxis_title='Feature',
    height=600,
    width=900
)

fig.show()

## 8. Prediction Analysis & Visualization

In [None]:
# Get the best model for prediction analysis
best_model = tuned_models[best_model_name]['model']
y_pred = test_results[best_model_name]['predictions']
y_pred_proba = test_results[best_model_name]['probabilities']

# Create a DataFrame with actual and predicted values
prediction_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred,
    'Prob_Low': y_pred_proba[:, 0],
    'Prob_Medium': y_pred_proba[:, 1],
    'Prob_High': y_pred_proba[:, 2]
})

# Add a column to indicate correct/incorrect predictions
prediction_df['Correct'] = prediction_df['Actual'] == prediction_df['Predicted']

# Display sample of predictions
prediction_df.head(10)

In [None]:
# Visualize prediction distribution
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.countplot(x='Actual', hue='Predicted', data=prediction_df,
              palette='viridis', alpha=0.7)
plt.title('Actual vs Predicted Congestion Levels')
plt.xlabel('Actual Congestion Level')
plt.ylabel('Count')
plt.legend(title='Predicted', loc='upper right')
plt.xticks([0, 1, 2], ['Low', 'Medium', 'High'])

plt.subplot(1, 2, 2)
sns.countplot(x='Correct', data=prediction_df, palette=['red', 'green'])
plt.title('Prediction Accuracy')
plt.xlabel('Prediction Correct')
plt.ylabel('Count')
plt.xticks([0, 1], ['Incorrect', 'Correct'])
for i, p in enumerate(plt.gca().patches):
    height = p.get_height()
    plt.text(p.get_x() + p.get_width()/2., height + 5,
             f'{height} ({height/len(prediction_df)*100:.1f}%)', ha="center")

plt.tight_layout()
plt.show()

In [None]:
# Analyze prediction probabilities
plt.figure(figsize=(15, 10))

# Prediction probability distributions
plt.subplot(3, 1, 1)
sns.kdeplot(data=prediction_df[prediction_df['Actual'] == 0]['Prob_Low'], 
           label='Correct Low', color='green', shade=True)
sns.kdeplot(data=prediction_df[prediction_df['Actual'] != 0]['Prob_Low'], 
           label='Incorrect Low', color='red', shade=True)
plt.title('Probability Distribution for Low Congestion Predictions')
plt.xlabel('Probability')
plt.ylabel('Density')
plt.legend()

plt.subplot(3, 1, 2)
sns.kdeplot(data=prediction_df[prediction_df['Actual'] == 1]['Prob_Medium'], 
           label='Correct Medium', color='green', shade=True)
sns.kdeplot(data=prediction_df[prediction_df['Actual'] != 1]['Prob_Medium'], 
           label='Incorrect Medium', color='red', shade=True)
plt.title('Probability Distribution for Medium Congestion Predictions')
plt.xlabel('Probability')
plt.ylabel('Density')
plt.legend()

plt.subplot(3, 1, 3)
sns.kdeplot(data=prediction_df[prediction_df['Actual'] == 2]['Prob_High'], 
           label='Correct High', color='green', shade=True)
sns.kdeplot(data=prediction_df[prediction_df['Actual'] != 2]['Prob_High'], 
           label='Incorrect High', color='red', shade=True)
plt.title('Probability Distribution for High Congestion Predictions')
plt.xlabel('Probability')
plt.ylabel('Density')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Function to get misclassified instances
def analyze_misclassifications(prediction_df, X_test, feature_names):
    # Get misclassified instances
    misclassified = prediction_df[~prediction_df['Correct']].copy()
    
    # Add feature values to misclassified DataFrame
    misclassified_features = pd.DataFrame(X_test[~prediction_df['Correct']], 
                                          columns=feature_names)
    
    misclassified_data = pd.concat([misclassified.reset_index(drop=True), 
                                    misclassified_features.reset_index(drop=True)], 
                                   axis=1)
    
    return misclassified_data

# Analyze misclassifications
misclassified_data = analyze_misclassifications(prediction_df, X_test_scaled, X_reduced.columns)

# Display summary of misclassifications
print(f"Total misclassifications: {len(misclassified_data)} out of {len(prediction_df)} ({len(misclassified_data)/len(prediction_df)*100:.2f}%)\n")

# Group misclassifications by actual and predicted classes
misclass_summary = pd.crosstab(misclassified_data['Actual'], misclassified_data['Predicted'],
                               rownames=['Actual'], colnames=['Predicted'])
print("Misclassification summary:")
misclass_summary

In [None]:
# Visualize misclassification patterns
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
misclass_summary.plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Misclassification Patterns')
plt.xlabel('Actual Congestion Level')
plt.ylabel('Count')
plt.xticks([0, 1, 2], ['Low', 'Medium', 'High'])
plt.legend(title='Predicted as')

# Convert to percentages for clearer visualization
misclass_percent = misclass_summary.div(misclass_summary.sum(axis=1), axis=0) * 100

plt.subplot(1, 2, 2)
misclass_percent.plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Misclassification Percentages')
plt.xlabel('Actual Congestion Level')
plt.ylabel('Percentage')
plt.xticks([0, 1, 2], ['Low', 'Medium', 'High'])
plt.legend(title='Predicted as')

plt.tight_layout()
plt.show()

## 9. Model Application: Predicting Congestion Level for New Data

In [None]:
# Example: Predicting congestion level for synthetic test cases
# Let's create some sample data points with different characteristics

# Create a function to prepare new data for prediction
def prepare_data_for_prediction(new_data, scaler):
    # Extract required features
    required_features = X_reduced.columns
    
    # Ensure all required features are present
    for feature in required_features:
        if feature not in new_data.columns:
            raise ValueError(f"Feature '{feature}' is missing from the input data")
    
    # Select only the required features and in the correct order
    X_new = new_data[required_features]
    
    # Scale the features
    X_new_scaled = scaler.transform(X_new)
    
    return X_new_scaled

# Create sample test cases
test_cases = [
    {"Packet_Loss_Rate": 0.01, "Average_Latency_ms": 10.0, "Node_Betweenness_Centrality": 0.5, 
     "Traffic_Volume_MBps": 100.0, "Link_Stability_Score": 0.9, "Hour": 12, "Day": 15, "DayOfWeek": 2},  # Expected: Low
    
    {"Packet_Loss_Rate": 0.08, "Average_Latency_ms": 120.0, "Node_Betweenness_Centrality": 0.6, 
     "Traffic_Volume_MBps": 500.0, "Link_Stability_Score": 0.7, "Hour": 18, "Day": 20, "DayOfWeek": 4},  # Expected: Medium
    
    {"Packet_Loss_Rate": 0.19, "Average_Latency_ms": 250.0, "Node_Betweenness_Centrality": 0.8, 
     "Traffic_Volume_MBps": 900.0, "Link_Stability_Score": 0.5, "Hour": 8, "Day": 25, "DayOfWeek": 1}   # Expected: High
]

# Create additional derived features (if they were in the model)
for case in test_cases:
    if 'Loss_Latency_Interaction' in X_reduced.columns:
        case['Loss_Latency_Interaction'] = case['Packet_Loss_Rate'] * case['Average_Latency_ms']
    if 'Traffic_Stability_Ratio' in X_reduced.columns:
        case['Traffic_Stability_Ratio'] = case['Traffic_Volume_MBps'] / case['Link_Stability_Score']
    if 'Centrality_Loss_Interaction' in X_reduced.columns:
        case['Centrality_Loss_Interaction'] = case['Node_Betweenness_Centrality'] * case['Packet_Loss_Rate']
    if 'IsWeekend' in X_reduced.columns:
        case['IsWeekend'] = 1 if case['DayOfWeek'] >= 5 else 0

# Convert to DataFrame
test_cases_df = pd.DataFrame(test_cases)

# Prepare data for prediction
X_test_cases_scaled = prepare_data_for_prediction(test_cases_df, scaler)

# Predict congestion levels
predictions = best_model.predict(X_test_cases_scaled)
probabilities = best_model.predict_proba(X_test_cases_scaled)

# Create a DataFrame with predictions
results_df = test_cases_df.copy()
results_df['Predicted_Level_Numeric'] = predictions
results_df['Predicted_Level'] = ['Low', 'Medium', 'High'][predictions]
results_df['Probability_Low'] = probabilities[:, 0]
results_df['Probability_Medium'] = probabilities[:, 1]
results_df['Probability_High'] = probabilities[:, 2]

# Display results
print(f"Predictions using {best_model_name}:")
results_df[['Packet_Loss_Rate', 'Average_Latency_ms', 'Traffic_Volume_MBps', 
           'Predicted_Level', 'Probability_Low', 'Probability_Medium', 'Probability_High']]

In [None]:
# Visualize the prediction probabilities
plt.figure(figsize=(12, 8))

x = np.arange(len(results_df))
width = 0.25

plt.bar(x - width, results_df['Probability_Low'], width, label='Low', color='green', alpha=0.7)
plt.bar(x, results_df['Probability_Medium'], width, label='Medium', color='orange', alpha=0.7)
plt.bar(x + width, results_df['Probability_High'], width, label='High', color='red', alpha=0.7)

plt.xlabel('Test Case')
plt.ylabel('Probability')
plt.title('Prediction Probabilities for Test Cases')
plt.xticks(x, [f'Case {i+1}' for i in range(len(results_df))])
plt.legend()
plt.grid(axis='y', alpha=0.3)

for i, p in enumerate(results_df['Predicted_Level']):
    plt.text(i, 1.05, f"Predicted: {p}", ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## 10. Conclusion and Next Steps

### Summary of Findings

In this analysis, we have:

1. **Preprocessed the network congestion dataset** by handling outliers, creating new features, and preparing the data for modeling.

2. **Performed exploratory data analysis** to understand patterns in network congestion, including temporal patterns and regional variations.

3. **Engineered features** such as time-based attributes and interaction terms to enhance model performance.

4. **Trained multiple machine learning models** including Random Forest, XGBoost, LightGBM, and others.

5. **Evaluated model performance** using various metrics, with the best model achieving high accuracy and F1 score.

6. **Analyzed feature importance** to understand key drivers of network congestion.

7. **Examined model predictions** and analyzed misclassifications to identify patterns.

8. **Applied the model** to predict congestion levels for new data.

### Key Insights

1. The most important factors for predicting network congestion include packet loss rate, latency, and traffic volume.

2. There are distinct patterns in congestion based on time of day and day of week.

3. Some connection pairs consistently experience higher congestion than others.

4. Regional variations in congestion suggest that infrastructure quality or demand patterns differ across regions.

### Next Steps

1. **Model Deployment**: Deploy the best model in a production environment for real-time congestion prediction.

2. **Continuous Learning**: Implement a feedback loop to continuously improve the model with new data.

3. **Additional Features**: Consider incorporating additional features such as weather data, special events, or hardware specifications.

4. **Predictive Maintenance**: Use the model to identify network segments at risk of congestion before issues occur.

5. **Time Series Analysis**: Implement more sophisticated time series models to better capture temporal patterns.

6. **User Interface**: Develop a dashboard for network administrators to visualize predictions and take preventive actions.