### Multiplex Network Construction Documentation

In this document, we describe the construction of a multiplex network based on the incident data from the Oklahoma Gas and Electric company. The multiplex network consists of multiple layers, each representing different types of connections between the substations.

#### Layers in the Multiplex Network

1. **Job Region**
   - **Description**: This layer represents the geographical regions where the substations are located. Nodes (substations) are connected if they belong to the same job region.
   
2. **Month/Day/Year**
   - **Description**: This temporal layer represents the date on which incidents occurred. Nodes are connected if incidents at these substations occurred on the same day.
   
3. **Custs Affected Interval**
   - **Description**: This layer categorizes incidents based on the number of customers affected. Nodes are connected if the number of affected customers falls within the same interval (Very Low, Low, Medium, High).
   
4. **OGE Causes**
   - **Description**: This layer categorizes incidents based on their causes as defined by the Oklahoma Gas and Electric company. Nodes are connected if incidents share the same cause.
   
5. **Major Storm Event (Yes or No)**
   - **Description**: This layer represents whether an incident occurred during a major storm event. Nodes are connected if incidents at these substations were affected by the same storm event (Yes or No).
   
6. **Distribution, Substation, Transmission Type**
   - **Description**: This layer represents the type of infrastructure associated with the incidents. Nodes are connected if they belong to the same type, such as distribution, substation, or transmission.

These layers collectively provide a comprehensive view of the different relationships and interactions between the substations based on various criteria, enabling a detailed analysis of the incident data.
#### Target Column

1. **Job Area (DISTRICT)**
   - **Description**: This target columns is used for predicting the Job Area i.e. District. This is our first target Column. 

1. **Extent**
   - **Description**: This Prediction column is to know the Extent of the Incident like how big or small scale the incident is. This is our second target Column. 


In [None]:
# Cell 01 {Loading libraries}
import pandas as pd
import networkx as nx
import itertools
import numpy as np
import concurrent.futures
import os

In [None]:
# Cell 02 Multiplex Network Construction 

file_path = 'Incidents.xlsx'
data = pd.read_excel(file_path, engine='openpyxl')

# Preprocess data: replace spaces in 'Job Substation' names with underscores
data['Job Substation'] = data['Job Substation'].str.replace(' ', '_')

# Define intervals for 'Custs Affected Interval'
custs_intervals = {
    'Very Low': (0, 50),
    'Low': (51, 100),
    'Medium': (101, 500),
    'High': (501, float('inf'))
}

def categorize_custs_affected(affected):
    for category, (low, high) in custs_intervals.items():
        if low <= affected <= high:
            return category
    return 'Unknown'

# Add a column for categorized customer affected intervals
data['Custs Affected Interval'] = data['Custs Affected'].apply(categorize_custs_affected)

# Initialize a dictionary to hold each layer's graph
layers = {
    'Job Region': nx.Graph(),
    'Time': nx.Graph(),
    'Custs Affected Interval': nx.Graph(),
    'OGE Causes': nx.Graph(),
    'Major Storm Event': nx.Graph(),
    'Distribution, Substation, Transmission': nx.Graph()
}

# Add nodes to each layer
nodes = data['Job Substation'].unique()
for layer in layers.values():
    layer.add_nodes_from(nodes)

# Group data once for each layer
grouped_data = {
    'Job Region': data.groupby('Job Region'),
    'Time': data.groupby('Month/Day/Year'),
    'Custs Affected Interval': data.groupby('Custs Affected Interval'),
    'OGE Causes': data.groupby('OGE Causes'),
    'Major Storm Event': data.groupby('Major Storm Event  Y (Yes) or N (No)'),
    'Distribution, Substation, Transmission': data.groupby('Distribution, Substation, Transmission')
}

# Add edges based on grouped data
def add_edges_by_group(layer, groups):
    for _, group in groups:
        nodes = group['Job Substation'].tolist()
        if len(nodes) > 1:
            for node1, node2 in itertools.combinations(nodes, 2):
                layer.add_edge(node1, node2)

# Use ThreadPoolExecutor to parallelize edge addition
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
    futures = []
    for layer_name, group in grouped_data.items():
        future = executor.submit(add_edges_by_group, layers[layer_name], group)
        futures.append(future)
    
    # Ensure all futures are completed
    for future in concurrent.futures.as_completed(futures):
        future.result()

# Custom class to represent a multiplex network
class MultiplexNetwork:
    def __init__(self):
        self.layers = {}
        self.node_set = set()
        
    def add_layer(self, layer_name, graph):
        self.layers[layer_name] = graph
        self.node_set.update(graph.nodes)
        
    def get_layer(self, layer_name):
        return self.layers.get(layer_name, None)
    
    def nodes(self):
        return self.node_set
    
    def edges(self, layer_name=None):
        if layer_name:
            return self.layers[layer_name].edges
        else:
            all_edges = {}
            for layer, graph in self.layers.items():
                all_edges[layer] = list(graph.edges)
            return all_edges

# Create the multiplex network
multiplex_network = MultiplexNetwork()
for layer_name, graph in layers.items():
    multiplex_network.add_layer(layer_name, graph)

# Print the number of nodes and edges in each layer
for layer_name in layers:
    print(f"Number of nodes in {layer_name} layer: {len(layers[layer_name].nodes)}")
    print(f"Number of edges in {layer_name} layer: {len(layers[layer_name].edges)}")


In [None]:
# Cell 3:  Save adjacency matrices of each layer as CSV files

# Define the directory to save the adjacency matrices
output_dir = '/mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Full_SPP_NW_Adj'
os.makedirs(output_dir, exist_ok=True)

# Function to save adjacency matrix of each layer
def save_adjacency_matrices(multiplex_network, output_dir):
    for layer_name, graph in multiplex_network.layers.items():
        # Create adjacency matrix
        adjacency_matrix = nx.to_numpy_array(graph)
        
        # Convert adjacency matrix to DataFrame for CSV export
        adjacency_df = pd.DataFrame(adjacency_matrix, index=graph.nodes, columns=graph.nodes)
        
        # Define file path
        file_path = os.path.join(output_dir, f"{layer_name.replace(' ', '_')}_adjacency_matrix.csv")
        
        # Save adjacency matrix as .csv file
        adjacency_df.to_csv(file_path)
        print(f"Adjacency matrix for layer '{layer_name}' saved at: {file_path}")

# Assuming `multiplex_network` is your existing multiplex network object
save_adjacency_matrices(multiplex_network, output_dir)


After the creation of the Adjaceny matrices then we constructed the Network Embeddings through multinode2vec algorithm. 
```bash
python multi_node2vec.py --dir /path/to/your/data --output /path/to/output --d 100 --walk_length 100 --window_size 10 --n_samples 1 --thresh 0.5 --w2v_workers 8 --rvals 0.25 --pvals 1 --qvals 0.5
```

In [None]:

# Cell 4: Merging Embeddings with the Feature/Raw Data

# Load embeddings and raw data
embeddings = pd.read_csv('/mmfs1/home/muhammad.kazim/Multiplex_NW_Emd/r0.25/mltn2v_results.csv', header=None)
raw_data = pd.read_excel('Incidents.xlsx', engine='openpyxl')

# Check the structure of the embeddings DataFrame
print("Embeddings DataFrame head:")
print(embeddings.head())

# Verify the correct column for substation mapping
substation_column = embeddings.columns[0]
embedding_columns = embeddings.columns[1:]

# Map Job Substation to embeddings
substation_mapping = dict(zip(embeddings[substation_column], embeddings[embedding_columns].values.tolist()))

# Function to map and add embeddings to raw data
def map_embeddings(row, mapping):
    substation = row['Job Substation']
    if substation in mapping:
        return mapping[substation]
    else:
        return [None] * (len(embedding_columns))

# Apply the function to the raw data
embeddings_columns = [f'Embedding_{i+1}' for i in range(len(embedding_columns))]
raw_data[embeddings_columns] = raw_data.apply(map_embeddings, axis=1, mapping=substation_mapping, result_type='expand')

# Select only the embeddings and the target column
augmented_data = raw_data[['Job Area (DISTRICT)'] + embeddings_columns]

# Handle missing values in embeddings
augmented_data = augmented_data.fillna(0)  # Fill missing values with 0 for simplicity

# Save the augmented data to a new file
augmented_data.to_csv('/mmfs1/home/muhammad.kazim/Multiplex_NW_Emd/augmented_data_with_embeddings_c.csv', index=False)

print("Augmented data saved to 'augmented_data_with_embeddings_c.csv'")


In [None]:
# Load both CSV files to check the columns
first_csv = pd.read_csv('/mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Multiplex_NW_Emd/augmented_data_with_embeddings_c.csv')
second_csv = pd.read_csv('/mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Multiplex_NW_Emd/augmented_data_with_embeddings_and_features.csv')

# Print the column names for both files
print("First CSV (Embeddings and Target) Columns:")
print(first_csv.columns.tolist())

print("\nSecond CSV (Embeddings and Additional Features) Columns:")
print(second_csv.columns.tolist())


In [5]:
import pandas as pd
import numpy as np
from cuml.ensemble import RandomForestClassifier as cuRF
from cuml.neighbors import KNeighborsClassifier as cuKNN
from xgboost import XGBClassifier  # XGBoost for Gradient Boosting with GPU support
from cuml.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, cross_val_score
import cudf
import cupy as cp  # Import CuPy for GPU-based arrays

# File paths
raw_data_file_path = 'Incidents.xlsx'
embedding_file_path = '/mmfs1/home/muhammad.kazim/Multiplex_NW_Emd/augmented_data_with_embeddings_c.csv'
embedding_with_features_file_path = '/mmfs1/home/muhammad.kazim/Multiplex_NW_Emd/augmented_data_with_embeddings_and_features.csv'
output_file_path = '/mmfs1/home/muhammad.kazim/Multiplex_NW_Emd/model_comparison_OGE_Causes_corrected.txt'

# Load raw data using pandas for simplicity
raw_data_pandas = pd.read_excel(raw_data_file_path, engine='openpyxl')

# Ensure all columns are either numeric or categorical
for col in raw_data_pandas.columns:
    if raw_data_pandas[col].dtype == 'object':
        raw_data_pandas[col] = raw_data_pandas[col].astype(str)

# Convert to cuDF for GPU processing
raw_data = cudf.DataFrame.from_pandas(raw_data_pandas)
embedding_data = cudf.read_csv(embedding_file_path)
embedding_with_features_data = cudf.read_csv(embedding_with_features_file_path)

# Handle missing values
raw_data = raw_data.fillna(0)
embedding_data = embedding_data.fillna(0)
embedding_with_features_data = embedding_with_features_data.fillna(0)

# Full features from raw data
full_features = raw_data.columns.tolist()

# Selected 7 features
selected_features = ['Job Region', 'Month/Day/Year', 'Custs Affected', 'OGE Causes', 
                     'Major Storm Event  Y (Yes) or N (No)', 'Distribution, Substation, Transmission']

# Encode labels
def encode_labels(y):
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)
    return y_encoded, le

# Prepare data by ensuring it's all numeric and converted to float32
def prepare_data(data, target_column, features=None):
    if features:
        X = data[features].drop(columns=[target_column], errors='ignore')
    else:
        X = data.drop(columns=[target_column])
    
    # Keep only numeric columns and convert to float32
    X = X.select_dtypes(include=[np.number]).astype('float32')
    
    y = data[target_column].to_pandas()  # Convert to pandas to handle labels
    y_encoded, le = encode_labels(y)
    
    # Convert data to CuPy for cuML models
    X_cupy = X.to_cupy()
    y_cupy = cp.asarray(y_encoded)
    
    return X_cupy, y_cupy

# Perform cross-validation using cuML models (GPU)
def perform_cv(X, y, models, cv=5):
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    results = {}

    for name, model in models.items():
        accuracy_scores = []

        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]

            # Train and evaluate model
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
            accuracy = accuracy_score(y_test, predictions)
            accuracy_scores.append(accuracy)
        
        mean_accuracy = np.mean(accuracy_scores)
        std_accuracy = np.std(accuracy_scores)
        results[name] = (mean_accuracy, std_accuracy)
    
    return results

# Define models for comparison
models = {
    'RandomForest': cuRF(n_estimators=100),
    'XGBoost': XGBClassifier(n_estimators=100, learning_rate=0.1, device='cuda', eval_metric='logloss'),
    'KNeighbors': cuKNN(n_neighbors=5)
}

# Perform predictions and comparison
results = []
for target in ['Job Area (DISTRICT)']:
    results.append(f"Target: {target}\n")
    
    # Prepare full raw data
    X_full, y_full = prepare_data(raw_data, target)
    
    # Prediction using full raw data
    full_data_results = perform_cv(X_full, y_full, models)
    results.append("Full Raw Data Results:\n")
    for model, (mean_acc, std_acc) in full_data_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    # Prepare raw data with selected features
    X_selected, y_selected = prepare_data(raw_data, target, selected_features)
    
    # Prediction using raw data with selected features
    selected_results = perform_cv(X_selected, y_selected, models)
    results.append("Raw Data with Selected Features Results:\n")
    for model, (mean_acc, std_acc) in selected_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    # Prepare embedding data
    X_embed, y_embed = prepare_data(embedding_data, target)
    
    # Prediction using embedding data
    embedding_results = perform_cv(X_embed, y_embed, models)
    results.append("Embedding Data Results:\n")
    for model, (mean_acc, std_acc) in embedding_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    # Prepare embeddings with full features data
    X_embed_full, y_embed_full = prepare_data(embedding_with_features_data, target)
    
    # Prediction using embedding with full features
    embedding_with_features_results = perform_cv(X_embed_full, y_embed_full, models)
    results.append("Embedding and Full Features Results:\n")
    for model, (mean_acc, std_acc) in embedding_with_features_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    results.append("\n")

# Write results to output file
with open(output_file_path, 'w') as f:
    f.writelines(results)

print("Prediction results have been written to:", output_file_path)


  ret_val = func(*args, **kwargs)
Defaulting to CPU-based Prediction. 
To predict on float-64 data, set parameter predict_model = 'CPU'
  ret_val = func(*args, **kwargs)
  ret_val = func(*args, **kwargs)
Defaulting to CPU-based Prediction. 
To predict on float-64 data, set parameter predict_model = 'CPU'
  ret_val = func(*args, **kwargs)
  ret_val = func(*args, **kwargs)
Defaulting to CPU-based Prediction. 
To predict on float-64 data, set parameter predict_model = 'CPU'
  ret_val = func(*args, **kwargs)


Prediction results have been written to: /mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Multiplex_NW_Emd/model_comparison_OGE_Causes_corrected.txt


In [9]:
import pandas as pd
import numpy as np
from cuml.ensemble import RandomForestClassifier as cuRF
from cuml.neighbors import KNeighborsClassifier as cuKNN
from xgboost import XGBClassifier  # XGBoost for Gradient Boosting with GPU support
from cuml.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, cross_val_score
import cudf
import cupy as cp  # Import CuPy for GPU-based arrays

# File paths
raw_data_file_path = 'Incidents.xlsx'
embedding_file_path = '/mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Multiplex_NW_Emd/augmented_data_with_embeddings_c.csv'
embedding_with_features_file_path = '/mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Multiplex_NW_Emd/augmented_data_with_embeddings_and_features.csv'
output_file_path = '/mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Final_Results_All_no_parameter.txt'

# Load raw data using pandas for simplicity
raw_data_pandas = pd.read_excel(raw_data_file_path, engine='openpyxl')

# Ensure all columns are either numeric or categorical
for col in raw_data_pandas.columns:
    if raw_data_pandas[col].dtype == 'object':
        raw_data_pandas[col] = raw_data_pandas[col].astype(str)

# Convert to cuDF for GPU processing
raw_data = cudf.DataFrame.from_pandas(raw_data_pandas)
embedding_data = cudf.read_csv(embedding_file_path)
embedding_with_features_data = cudf.read_csv(embedding_with_features_file_path)

# Handle missing values
raw_data = raw_data.fillna(0)
embedding_data = embedding_data.fillna(0)
embedding_with_features_data = embedding_with_features_data.fillna(0)

# Full features from raw data
full_features = raw_data.columns.tolist()

# Selected 7 features
selected_features = ['Job Region', 'Month/Day/Year', 'Custs Affected', 'OGE Causes', 
                     'Major Storm Event  Y (Yes) or N (No)', 'Distribution, Substation, Transmission']

# Encode labels
def encode_labels(y):
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)
    return y_encoded, le

# Prepare data by ensuring it's all numeric and converted to float32
def prepare_data(data, target_column, features=None):
    if features:
        X = data[features].drop(columns=[target_column], errors='ignore')
    else:
        X = data.drop(columns=[target_column])
    
    # Keep only numeric columns and convert to float32
    X = X.select_dtypes(include=[np.number]).astype('float32')
    
    y = data[target_column].to_pandas()  # Convert to pandas to handle labels
    y_encoded, le = encode_labels(y)
    
    # Convert data to CuPy for cuML models
    X_cupy = X.to_cupy()
    y_cupy = cp.asarray(y_encoded)
    
    return X_cupy, y_cupy

# Perform cross-validation using cuML models (GPU)
def perform_cv(X, y, models, cv=5):
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    results = {}

    for name, model in models.items():
        accuracy_scores = []

        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]

            # Train and evaluate model
            model.fit(X_train, y_train)
            predictions = model.predict(X_test)
            accuracy = accuracy_score(y_test, predictions)
            accuracy_scores.append(accuracy)
        
        mean_accuracy = np.mean(accuracy_scores)
        std_accuracy = np.std(accuracy_scores)
        results[name] = (mean_accuracy, std_accuracy)
    
    return results

# Define models for comparison
#models = {
 #   'RandomForest': cuRF(n_estimators=100),
  #  'XGBoost': XGBClassifier(n_estimators=100, learning_rate=0.1, device='cuda', eval_metric='logloss'),
  #  'KNeighbors': cuKNN(n_neighbors=5)
#}

models = {
    'RandomForest': cuRF(),
    'XGBoost': XGBClassifier(),
    'KNeighbors': cuKNN()
}

# Perform predictions and comparison
results = []
for target in ['Job Area (DISTRICT)']:
    results.append(f"Target: {target}\n")
    
    # Prepare full raw data
    X_full, y_full = prepare_data(raw_data, target)
    
    # Prediction using full raw data
    full_data_results = perform_cv(X_full, y_full, models)
    results.append("Full Raw Data Results:\n")
    for model, (mean_acc, std_acc) in full_data_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    # Prepare raw data with selected features
    X_selected, y_selected = prepare_data(raw_data, target, selected_features)
    
    # Prediction using raw data with selected features
    selected_results = perform_cv(X_selected, y_selected, models)
    results.append("Raw Data with Selected Features Results:\n")
    for model, (mean_acc, std_acc) in selected_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    # Prepare embedding data
    X_embed, y_embed = prepare_data(embedding_data, target)
    
    # Prediction using embedding data
    embedding_results = perform_cv(X_embed, y_embed, models)
    results.append("Embedding Data Results:\n")
    for model, (mean_acc, std_acc) in embedding_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    # Prepare embeddings with full features data
    X_embed_full, y_embed_full = prepare_data(embedding_with_features_data, target)
    
    # Prediction using embedding with full features
    embedding_with_features_results = perform_cv(X_embed_full, y_embed_full, models)
    results.append("Embedding and Full Features Results:\n")
    for model, (mean_acc, std_acc) in embedding_with_features_results.items():
        results.append(f"{model}: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}\n")
    
    results.append("\n")

# Write results to output file
with open(output_file_path, 'w') as f:
    f.writelines(results)

print("Prediction results have been written to:", output_file_path)


Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




Prediction results have been written to: /mmfs1/home/muhammad.kazim/Embeddings_EMGNN_Paper_Codes/Final_Results_All_no_parameter.txt
