### Multiplex Network Construction Documentation

In this document, we describe the construction of a multiplex network based on the incident data from the Oklahoma Gas and Electric company. The multiplex network consists of multiple layers, each representing different types of connections between the substations.

#### Layers in the Multiplex Network

1. **Job Region**
   - **Description**: This layer represents the geographical regions where the substations are located. Nodes (substations) are connected if they belong to the same job region.
   
2. **Job Area (DISTRICT)**
   - **Description**: This layer represents the specific districts within the regions. Nodes are connected if they belong to the same job area or district.
   
3. **Month/Day/Year**
   - **Description**: This temporal layer represents the date on which incidents occurred. Nodes are connected if incidents at these substations occurred on the same day.
   
4. **Custs Affected Interval**
   - **Description**: This layer categorizes incidents based on the number of customers affected. Nodes are connected if the number of affected customers falls within the same interval (Very Low, Low, Medium, High).
   
5. **OGE Causes**
   - **Description**: This layer categorizes incidents based on their causes as defined by the Oklahoma Gas and Electric company. Nodes are connected if incidents share the same cause.
   
6. **Major Storm Event (Yes or No)**
   - **Description**: This layer represents whether an incident occurred during a major storm event. Nodes are connected if incidents at these substations were affected by the same storm event (Yes or No).
   
7. **Distribution, Substation, Transmission Type**
   - **Description**: This layer represents the type of infrastructure associated with the incidents. Nodes are connected if they belong to the same type, such as distribution, substation, or transmission.

These layers collectively provide a comprehensive view of the different relationships and interactions between the substations based on various criteria, enabling a detailed analysis of the incident data.


In [None]:
# Loading libraries
import pandas as pd
import networkx as nx
import itertools
import numpy as np
import os

In [None]:
Cell 01 # Multiplex Network Creation from the selected features

# Load the dataset
file_path = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Incidents_5000.xlsx'
data = pd.read_excel(file_path)

# Preprocess data: replace spaces in 'Job Substation' names with underscores
data['Job Substation'] = data['Job Substation'].str.replace(' ', '_')

# Define intervals for 'Custs Affected Interval'
custs_intervals = {
    'Very Low': (0, 50),
    'Low': (51, 100),
    'Medium': (101, 500),
    'High': (501, float('inf'))
}

def categorize_custs_affected(affected):
    for category, (low, high) in custs_intervals.items():
        if low <= affected <= high:
            return category
    return 'Unknown'

# Add a column for categorized customer affected intervals
data['Custs Affected Interval'] = data['Custs Affected'].apply(categorize_custs_affected)

# Initialize a dictionary to hold each layer's graph
layers = {
    'Job Region': nx.Graph(),
    'Job Area (DISTRICT)': nx.Graph(),
    'Time': nx.Graph(),
    'Custs Affected Interval': nx.Graph(),
    'OGE Causes': nx.Graph(),
    'Major Storm Event': nx.Graph(),
    'Distribution, Substation, Transmission': nx.Graph()
}

# Add nodes with replaced spaces
for layer in layers:
    nodes = [substation.replace(' ', '_') for substation in data['Job Substation'].unique()]
    layers[layer].add_nodes_from(nodes)

# Define functions to add edges to each layer based on criteria
def add_edges_by_column(layer_name, column):
    layer = layers[layer_name]
    for _, group in data.groupby(column):
        nodes = [substation.replace(' ', '_') for substation in group['Job Substation']]
        for node1, node2 in itertools.combinations(nodes, 2):
            layer.add_edge(node1, node2)

def add_edges_by_date(layer_name):
    layer = layers[layer_name]
    for _, group in data.groupby('Month/Day/Year'):
        nodes = [substation.replace(' ', '_') for substation in group['Job Substation']]
        for node1, node2 in itertools.combinations(nodes, 2):
            layer.add_edge(node1, node2)

# Add edges for each layer
add_edges_by_column('Job Region', 'Job Region')
add_edges_by_column('Job Area (DISTRICT)', 'Job Area (DISTRICT)')
add_edges_by_date('Time')
add_edges_by_column('Custs Affected Interval', 'Custs Affected Interval')
add_edges_by_column('OGE Causes', 'OGE Causes')
add_edges_by_column('Major Storm Event', 'Major Storm Event  Y (Yes) or N (No)')
add_edges_by_column('Distribution, Substation, Transmission', 'Distribution, Substation, Transmission')

# Custom class to represent a multiplex network
class MultiplexNetwork:
    def __init__(self):
        self.layers = {}
        self.node_set = set()
        
    def add_layer(self, layer_name, graph):
        self.layers[layer_name] = graph
        self.node_set.update(graph.nodes)
        
    def get_layer(self, layer_name):
        return self.layers.get(layer_name, None)
    
    def nodes(self):
        return self.node_set
    
    def edges(self, layer_name=None):
        if layer_name:
            return self.layers[layer_name].edges
        else:
            all_edges = {}
            for layer, graph in self.layers.items():
                all_edges[layer] = list(graph.edges)
            return all_edges

# Create the multiplex network
multiplex_network = MultiplexNetwork()
for layer_name, graph in layers.items():
    multiplex_network.add_layer(layer_name, graph)

# Interact with the multiplex network
#print(f"All nodes in multiplex network: {multiplex_network.nodes()}")
#for layer_name in layers:
    #print(f"Edges in {layer_name} layer: {multiplex_network.edges(layer_name)}")

# print the number of nodes and edges in each layer
for layer_name in layers:
    print(f"Number of nodes in {layer_name} layer: {len(layers[layer_name].nodes)}")
    print(f"Number of edges in {layer_name} layer: {len(layers[layer_name].edges)}")    


In [None]:
# Cell 2:  Save adjacency matrices of each layer as CSV files

# Define the directory to save the adjacency matrices
output_dir = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Multiplex Network'
os.makedirs(output_dir, exist_ok=True)

# Function to save adjacency matrix of each layer
def save_adjacency_matrices(multiplex_network, output_dir):
    for layer_name, graph in multiplex_network.layers.items():
        # Create adjacency matrix
        adjacency_matrix = nx.to_numpy_array(graph)
        
        # Convert adjacency matrix to DataFrame for CSV export
        adjacency_df = pd.DataFrame(adjacency_matrix, index=graph.nodes, columns=graph.nodes)
        
        # Define file path
        file_path = os.path.join(output_dir, f"{layer_name.replace(' ', '_')}_adjacency_matrix.csv")
        
        # Save adjacency matrix as .csv file
        adjacency_df.to_csv(file_path)
        print(f"Adjacency matrix for layer '{layer_name}' saved at: {file_path}")

# Assuming `multiplex_network` is your existing multiplex network object
save_adjacency_matrices(multiplex_network, output_dir)


In [None]:
# Cell 3: Merging the Embeddings with the target column
# File paths
incidents_file_path = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Incidents_5000.xlsx'
embeddings_file_path = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Multiplex Network/r0.25/mltn2v_results.csv'
output_file_path = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Multiplex Network/merged_data.csv'

# Read the datasets
incidents_data = pd.read_excel(incidents_file_path)
embeddings_data = pd.read_csv(embeddings_file_path)

# Ensure 'Job Substation' in incidents_data matches the embedding keys
incidents_data['Job Substation'] = incidents_data['Job Substation'].str.replace(' ', '_')

# Reduce incidents_data to necessary columns
reduced_incidents_data = incidents_data[['Job Substation', 'Job Area (DISTRICT)']]

# Merge the embeddings data with the reduced incidents data
merged_data = pd.merge(reduced_incidents_data, embeddings_data, left_on='Job Substation', right_on=embeddings_data.columns[0], how='inner')

# Drop the substation identifier column from embeddings data
merged_data = merged_data.drop(columns=[embeddings_data.columns[0]])

# Rename columns for embeddings
embedding_columns = [f'Embedding_{i}' for i in range(1, merged_data.shape[1] - 1)]
merged_data.columns = ['Job Substation', 'Job Area (DISTRICT)'] + embedding_columns

# Save the merged data to a CSV file
merged_data.to_csv(output_file_path, index=False)

print(f"Merged data saved to {output_file_path}")


In [None]:
# Cell 4: Prediction through Network Embeddings

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Path to the embedding dataset
embedding_file_path = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Multiplex Network/merged_data.csv'

# Load the embedding dataset
embedding_data = pd.read_csv(embedding_file_path)

# Prepare data for the embedding dataset
def prepare_embedding_data(data):
    X = data.iloc[:, 2:].values  # All columns except the first two
    y = data['Job Area (DISTRICT)'].values
    return X, y

# Encode labels
def encode_labels(y):
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)
    return y_encoded, le

# Perform 10-fold cross-validation
def perform_cv(X, y, cv=10):
    clf = RandomForestClassifier()
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    scores = cross_val_score(clf, X, y, cv=kf, scoring='accuracy')
    return scores.mean(), scores.std()

# Process the embedding dataset
X, y = prepare_embedding_data(embedding_data)
y_encoded, le = encode_labels(y)

# Handle missing values (if any)
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

# Perform cross-validation
mean_acc, std_acc = perform_cv(X, y_encoded, cv=10)

# Print the results
print(f'Multiplex Network: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}')


In [None]:
# Cell 5: Prediction using the classical machine learning (selected features raw dataset)
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the dataset
file_path = '/Volumes/Data/NDSU/PhD Work/Research/IME Research/AI-Energy/Data/SPP/Incidents_5000.xlsx'
data = pd.read_excel(file_path)

# Preprocess data: replace spaces in 'Job Substation' names with underscores
data['Job Substation'] = data['Job Substation'].str.replace(' ', '_')

# Define intervals for 'Custs Affected Interval'
custs_intervals = {
    'Very Low': (0, 50),
    'Low': (51, 100),
    'Medium': (101, 500),
    'High': (501, float('inf'))
}

def categorize_custs_affected(affected):
    for category, (low, high) in custs_intervals.items():
        if low <= affected <= high:
            return category
    return 'Unknown'

# Add a column for categorized customer affected intervals
data['Custs Affected Interval'] = data['Custs Affected'].apply(categorize_custs_affected)

# Select relevant columns for prediction
columns_to_use = [
    'Job Region', 'Custs Affected Interval', 'OGE Causes',
    'Major Storm Event  Y (Yes) or N (No)', 'Distribution, Substation, Transmission', 'Month/Day/Year'
]

# Ensure all selected columns are present
data = data[columns_to_use + ['Job Area (DISTRICT)']]

# Prepare data for the embedding dataset
def prepare_embedding_data(data):
    X = data.iloc[:, :-1]  # All columns except the last one
    y = data['Job Area (DISTRICT)']
    return X, y

# Encode labels
def encode_labels(y):
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)
    return y_encoded, le

# Process the embedding dataset
X, y = prepare_embedding_data(data)
y_encoded, le = encode_labels(y)

# Define preprocessing for numeric and categorical features
numeric_features = ['Month/Day/Year']
categorical_features = [
    'Job Region', 'Custs Affected Interval', 'OGE Causes',
    'Major Storm Event  Y (Yes) or N (No)', 'Distribution, Substation, Transmission'
]

# Preprocessing for numerical data
numeric_transformer = SimpleImputer(strategy='mean')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the model pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=42))])

# Perform 10-fold cross-validation
def perform_cv(X, y, cv=10):
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    scores = cross_val_score(clf, X, y, cv=kf, scoring='accuracy')
    return scores.mean(), scores.std()

# Perform cross-validation
mean_acc, std_acc = perform_cv(X, y_encoded, cv=10)

# Print the results
print(f'Multiplex Network: Mean Accuracy = {mean_acc:.4f}, Standard Deviation = {std_acc:.4f}')
