# Parallel Computing Performance Analysis

This notebook analyzes parallel computing performance metrics from experimental data, including execution time, speedup, and efficiency across different problem sizes and numbers of processes.

## Data Loading and Preprocessing


### List of import

In [None]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
from utils import *
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Load the data from the file

In [None]:
# Load the process information data
df = pd.read_csv('../process_info.csv')

# Display basic information about the dataset
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

# Calculation Overview

This section computes parallel-performance metrics from the experimental data.

- Data grouping: mean execution time per `dataset_size` and `n_processes` → `grouped`.

- Speedup: for each `dataset_size`, take the execution time with `n_processes == 1` as T₁ and compute `speedup = T₁ / execution_time`.

- Efficiency: `efficiency = speedup / n_processes`.

- Results are stored in `metrics_df` and summarized as pivot tables: `execution_time_table`, `speedup_table`, `efficiency_table`.

- Filtering: a subset of datasets is selected via `selected_datasets` / `filtered_df` for plotting.

- Visualization: `fig_speedup` and `fig_efficiency` show measured curves together with ideal reference lines (ideal speedup and acceptable efficiency).

In [None]:
# Group by dataset_size and number of processes. Compute mean execution time for each process count
grouped = df.groupby(['dataset_size', 'n_processes'])['total_execution_time'].mean().reset_index()

# Apply metrics calculation to each problem size group
metrics_df = grouped.groupby('dataset_size').apply(compute_metrics).reset_index(drop=True)

## Execution Time Table

Shows how computation time changes when using different numbers of processes for various dataset sizes.

In [None]:
# Sort processes in logical order
process_order = sorted(metrics_df['n_processes'].unique())
execution_time_table = make_pivot(metrics_df, value='total_execution_time', process_order=process_order)

print("Execution Time Table (seconds):")
display(execution_time_table)

## Speedup Analysis Table

This section creates a **pivot table** to visualize the average *speedup* achieved as a function of the number of processes (`n_processes`) and the dataset size (`dataset_size`).  

In [None]:
speedup_table = make_pivot(metrics_df, 'speedup', process_order=process_order)

print("Speedup Table (T₁/Tₚ):")
display(speedup_table)

## Efficiency Analysis

This section builds a **pivot table** to display the average *efficiency* for different numbers of processes (`n_processes`) and dataset sizes (`dataset_size`).  
An efficiency close to **1.0** indicates near-perfect scalability.

In [None]:
efficiency_table = make_pivot(metrics_df, 'efficiency', process_order=process_order)

print("Efficiency Table (Speedup/Processes):")
display(efficiency_table)

## Speedup Visualization

This section visualizes the *speedup* behavior for selected datasets using **Plotly**.  
You can manually specify which datasets to include by editing the `selected_datasets` list.  
The plot compares the measured speedup with the **ideal linear speedup** (represented by the dashed red line).  

Each dataset is shown as a separate colored curve, allowing for an easy comparison of scalability across different problem sizes.


In [None]:
# Manual dataset selection
selected_datasets = [100000,1000000, 10000000]  # <- change this list to filter datasets
# Filter metrics_df for selected datasets
filtered_df = metrics_df[metrics_df['dataset_size'].isin(selected_datasets)]

# Create and show the speedup figure
fig_speedup = plot_metrics(filtered_df, metric='speedup')
fig_speedup.show()



# Efficiency Visualization

This section plots the *parallel efficiency* for the selected datasets using **Plotly**.  
Each curve represents how efficiently computational resources are used as the number of processes increases.  

The red dotted line marks an **acceptable efficiency threshold** at 0.75, helping to visually identify when performance begins to degrade.  


In [None]:
# Create Efficiency figure
fig_efficiency = plot_metrics(filtered_df, metric='efficiency')
# Show plot
fig_efficiency.show()
# Current filter info
print(f"Displayed datasets: {selected_datasets}")


## Accuracy Analysis

In [None]:
csv_file = "../em_validation.csv"
acc = clustering_accuracy(csv_file)
print(f"Clustering accuracy: {acc:.4f}%")