# Parallel Computing Performance Analysis

This notebook analyzes parallel computing performance metrics from experimental data, including execution time, speedup, and efficiency across different problem sizes and numbers of processes.

## Data Loading and Preprocessing


### List of import

In [None]:
import pandas as pd
from utils import *

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Load the data from the file

In [None]:
# Load the process information data
DATA_DIR = "../../data"
df = pd.read_csv(DATA_DIR+'/algorithm_results/execution_info.csv')

# Display basic information about the dataset
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

# Calculation Overview

This section computes parallel-performance metrics from the experimental data.

- Dataset size: we define the `dataset_size` as the array of dimensions `[n_samples, n_features, n_clusters]`

- Data grouping: mean execution time per `dataset_size` and `n_process` → `grouped`.

- Speedup: for each `dataset_size`, take the execution time with `n_process == 1` as T₁ and compute `speedup = T₁ / execution_time`.

- Efficiency: `efficiency = speedup / n_process`.

- Results are stored in `metrics_df` and summarized as pivot tables: `execution_time_table`, `speedup_table`, `efficiency_table`.

- Filtering: a subset of datasets is selected via `selected_datasets` / `filtered_df` for plotting.

- Visualization: `fig_speedup` and `fig_efficiency` show measured curves together with ideal reference lines (ideal speedup and acceptable efficiency).

In [None]:
# Group by dataset number of processes and input size (defined by n_samples, n_features, n_clusters), then compute the mean of all remaining numeric columns (times)
grouped = df.groupby(['n_process', 'n_samples', 'n_features', 'n_clusters']).mean().reset_index()

# Apply metrics calculation to each problem size group
metrics_df = grouped.groupby(['n_samples', 'n_features', 'n_clusters']).apply(compute_metrics).reset_index(drop=True)

## Execution Time Table

Shows how computation time changes when using different numbers of processes for various dataset sizes.

In [None]:
# Sort processes in logical order
process_order = sorted(metrics_df['n_process'].unique())

execution_time_table = make_pivot(metrics_df, 
                                  value='compute_time', 
                                  index='n_process', 
                                  columns=['n_samples', 'n_features', 'n_clusters'],
                                  process_order=process_order)


print("Execution Time Table (seconds):")
display(execution_time_table)


## Speedup Analysis Table

This section creates a **pivot table** to visualize the average *speedup* achieved as a function of the number of processes (`n_process`) and the dataset size (`[n_samples, n_features, n_clusters]`).  

In [None]:
speedup_table = make_pivot(metrics_df,
                           value='speedup',
                           index='n_process',
                           columns=['n_samples', 'n_features', 'n_clusters'], 
                           process_order=process_order)

print("Speedup Table (T₁/Tₚ):")
display(speedup_table)

## Efficiency Analysis

This section builds a **pivot table** to display the average *efficiency* for different numbers of processes (`n_process`) and dataset sizes (`[n_samples, n_features, n_clusters]`).  
An efficiency close to **1.0** indicates near-perfect scalability.

In [None]:
efficiency_table = make_pivot(metrics_df, 
                                  value='efficiency', 
                                  index='n_process', 
                                  columns=['n_samples', 'n_features', 'n_clusters'],
                                  process_order=process_order)

print("Efficiency Table (Speedup/Processes):")
display(efficiency_table)

### Efficiency Analysis

Similar to what was done on the previous section, here we build two **pivot tables**, focusing on the **efficiency** of the **E-Step** and **M-Step** respectively. This allows for a more in depth analysis.


In [None]:
efficiency_table = make_pivot(metrics_df, 
                                  value='e_step_efficiency', 
                                  index='n_process', 
                                  columns=['n_samples', 'n_features', 'n_clusters'],
                                  process_order=process_order)

print("Efficiency Table for E-Step:")
display(efficiency_table)

In [None]:
efficiency_table = make_pivot(metrics_df, 
                                  value='m_step_efficiency', 
                                  index='n_process', 
                                  columns=['n_samples', 'n_features', 'n_clusters'],
                                  process_order=process_order)

print("Efficiency Table for M-Step:")
display(efficiency_table)

## Speedup Visualization

This section visualizes the *speedup* behavior for selected datasets using **Plotly**.  
You can manually specify which datasets to include by editing the `selected_datasets` list.  
The plot compares the measured speedup with the **ideal linear speedup** (represented by the dashed red line).  

Each dataset is shown as a separate colored curve, allowing for an easy comparison of scalability across different problem sizes.


In [None]:
# Manual dataset selection:
# - selected_n_samples, selected_n_features, selected_n_clusters: lists of dataset parameters to include in the plot.
#   All lists must have the same length and be ordered consistently: 
#   the first element in selected_n_samples corresponds to the first element in selected_n_features 
#   and the first element in selected_n_clusters, and so on.

selected_n_samples  = [10000000, 2500000, 1250000]
selected_n_features = [50, 50, 50]          
selected_n_clusters = [15, 15, 15]

# Filter metrics_df for selected datasets
filtered_df = metrics_df[
    (metrics_df['n_samples'].isin(selected_n_samples)) &
    (metrics_df['n_features'].isin(selected_n_features)) &
    (metrics_df['n_clusters'].isin(selected_n_clusters))
]

# Create and show the speedup figure
fig_speedup = plot_metrics(filtered_df, metric='speedup', fixed_parameters=['n_samples'])
fig_speedup.show()

# Efficiency Visualization

This section plots the *parallel efficiency* for the selected datasets using **Plotly**.  
Each curve represents how efficiently computational resources are used as the number of processes increases.  

The red dotted line marks an **acceptable efficiency threshold** at 0.75, helping to visually identify when performance begins to degrade.  


In [None]:
# Create Efficiency figure
fig_efficiency = plot_metrics(filtered_df, metric='efficiency')
# Show plot
fig_efficiency.show()

## Accuracy Analysis

The `clustering_accuracy` function computes the accuracy of clustering results by comparing predicted labels with the true labels.  
The function returns the accuracy as a **percentage**.

In [None]:
csv_file = DATA_DIR+"algorithm_results/em_validation.csv"
df = pd.read_csv(csv_file)
acc = clustering_accuracy(df)
print(f"Clustering accuracy: {acc:.4f}%")