# Parallel Computing Performance Analysis

This notebook analyzes parallel computing performance metrics from experimental data, including execution time, speedup, and efficiency across different problem sizes and numbers of processes.

## Data Loading and Preprocessing


### List of import

In [186]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import numpy as np

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Calculation Overview

This section computes parallel-performance metrics from the experimental data.

- Data grouping: mean execution time per `dataset_size` and `n_processes` → `grouped`.

- Speedup: for each `dataset_size`, take the execution time with `n_processes == 1` as T₁ and compute `speedup = T₁ / execution_time`.

- Efficiency: `efficiency = speedup / n_processes`.

- Results are stored in `metrics_df` and summarized as pivot tables: `execution_time_table`, `speedup_table`, `efficiency_table`.

- Filtering: a subset of datasets is selected via `selected_datasets` / `filtered_df` for plotting.

- Visualization: `fig_speedup` and `fig_efficiency` show measured curves together with ideal reference lines (ideal speedup and acceptable efficiency).## Data Loading and Preprocessing


In [187]:
# Load the process information data
df = pd.read_csv('process_info.csv')

# Display basic information about the dataset
print("Dataset Overview:")
print(f"Shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

Dataset Overview:
Shape: (7, 3)

First few rows:
   n_processes  dataset_size  execution_time
0           64     102084863        0.562088
1           32     102084863        0.491992
2            8     102084863        1.734151
3            4     102084863        2.573479
4           16     102084863        1.475546


In [188]:
# Group by dataset_size and number of processes. Compute mean execution time for each process count
grouped = df.groupby(['dataset_size', 'n_processes'])['execution_time'].mean().reset_index()

In [189]:
def calculate_metrics(group):
    """Calculate speedup and efficiency for a problem size group"""
    # Get the sequential execution time (1 process)
    t1 = group[group['n_processes'] == 1]['execution_time'].iloc[0]
    
    # Calculate speedup and efficiency
    group = group.copy()
    group['speedup'] = t1 / group['execution_time']
    group['efficiency'] = group['speedup'] / group['n_processes']
    
    return group

# Apply metrics calculation to each problem size group
metrics_df = grouped.groupby('dataset_size').apply(calculate_metrics).reset_index(drop=True)

# Verify calculations for one problem size
sample_size = df['dataset_size'].iloc[0]
sample_data = metrics_df[metrics_df['dataset_size'] == sample_size]

# Execution Time Table

Shows how computation time changes when using different numbers of processes for various dataset sizes.

In [190]:
# Create pivot table for execution times
execution_time_table = metrics_df.pivot_table(
    index='n_processes', 
    columns='dataset_size', 
    values='execution_time',
    aggfunc='mean'
).round(2)

# Sort processes in logical order
process_order = sorted(metrics_df['n_processes'].unique())
execution_time_table = execution_time_table.reindex(process_order)

print("Execution Time Table (seconds):")
display(execution_time_table)

Execution Time Table (seconds):


dataset_size,102084863
n_processes,Unnamed: 1_level_1
1,5.05
2,3.14
4,2.57
8,1.73
16,1.48
32,0.49
64,0.56


# Speedup Analysis Table

This section creates a **pivot table** to visualize the average *speedup* achieved as a function of the number of processes (`n_processes`) and the dataset size (`dataset_size`).  

In [191]:
# Create pivot table for speedup
speedup_table = metrics_df.pivot_table(
    index='n_processes', 
    columns='dataset_size', 
    values='speedup',
    aggfunc='mean'
).round(1)

# Sort processes in logical order and ensure 1 process shows 1.0 speedup
speedup_table = speedup_table.reindex(process_order)

print("Speedup Table (T₁/Tₚ):")
display(speedup_table)

Speedup Table (T₁/Tₚ):


dataset_size,102084863
n_processes,Unnamed: 1_level_1
1,1.0
2,1.6
4,2.0
8,2.9
16,3.4
32,10.3
64,9.0


# Efficiency Analysis

This section builds a **pivot table** to display the average *efficiency* for different numbers of processes (`n_processes`) and dataset sizes (`dataset_size`).  
An efficiency close to **1.0** indicates near-perfect scalability.

In [192]:
# Create pivot table for efficiency
efficiency_table = metrics_df.pivot_table(
    index='n_processes', 
    columns='dataset_size', 
    values='efficiency',
    aggfunc='mean'
).round(3)

# Sort processes in logical order
efficiency_table = efficiency_table.reindex(process_order)

print("Efficiency Table (Speedup/Processes):")
display(efficiency_table)

Efficiency Table (Speedup/Processes):


dataset_size,102084863
n_processes,Unnamed: 1_level_1
1,1.0
2,0.804
4,0.491
8,0.364
16,0.214
32,0.321
64,0.14


# Speedup Visualization

This section visualizes the *speedup* behavior for selected datasets using **Plotly**.  
You can manually specify which datasets to include by editing the `selected_datasets` list.  
The plot compares the measured speedup with the **ideal linear speedup** (represented by the dashed red line).  

Each dataset is shown as a separate colored curve, allowing for an easy comparison of scalability across different problem sizes.


In [193]:
# Manual dataset selection
selected_datasets = [10, 102084863]  # <- change this list to filter datasets

# Filter data
filtered_df = metrics_df[metrics_df['dataset_size'].isin(selected_datasets)]

# Color palette
colors = px.colors.qualitative.Set1

# Format dataset size in scientific notation
def format_scientific(size):
    if size >= 1e6:
        return f"{size/1e6:.1f}×10⁶"
    elif size >= 1e3:
        return f"{size/1e3:.0f}×10³"
    else:
        return f"{size:.0f}"

# Create Speedup figure
fig_speedup = go.Figure()

# Ideal speedup line (y = x)
n_proc = np.sort(filtered_df['n_processes'].unique())
fig_speedup.add_trace(go.Scatter(
    x=n_proc, y=n_proc,
    mode='lines',
    name='Ideal Speedup',
    line=dict(dash='dash', color='red', width=2)
))

# Plot data for each dataset
for i, size in enumerate(filtered_df['dataset_size'].unique()):
    subset = filtered_df[filtered_df['dataset_size'] == size]
    label = format_scientific(size)
    fig_speedup.add_trace(go.Scatter(
        x=subset['n_processes'],
        y=subset['speedup'],
        mode='lines+markers',
        name=f"Dataset {label}",
        line=dict(width=3, color=colors[i % len(colors)]),
        marker=dict(size=8)
    ))

# Layout settings
fig_speedup.update_layout(
    title='Speedup Analysis',
    xaxis_title='Number of Processes (P)',
    yaxis_title='Speedup (T₁ / Tₚ)',
    template='plotly_white',
    font=dict(size=12),
    height=500,
    showlegend=True
)

# Show plot
fig_speedup.show()



# Efficiency Visualization

This section plots the *parallel efficiency* for the selected datasets using **Plotly**.  
Each curve represents how efficiently computational resources are used as the number of processes increases.  

The red dotted line marks an **acceptable efficiency threshold** at 0.75, helping to visually identify when performance begins to degrade.  


In [194]:
# Create Efficiency figure
fig_efficiency = go.Figure()

# Acceptable efficiency line (constant = 0.75)
n_proc = np.sort(filtered_df['n_processes'].unique())
fig_efficiency.add_trace(go.Scatter(
    x=n_proc,
    y=[0.75] * len(n_proc),
    mode='lines',
    name='Acceptable Efficiency (0.75)',
    line=dict(dash='dot', color='red', width=2)
))

# Plot data for each dataset
for i, size in enumerate(filtered_df['dataset_size'].unique()):
    subset = filtered_df[filtered_df['dataset_size'] == size]
    label = f"{size/1e6:.1f}×10⁶" if size >= 1e6 else (
        f"{size/1e3:.0f}×10³" if size >= 1e3 else f"{size:.0f}"
    )
    fig_efficiency.add_trace(go.Scatter(
        x=subset['n_processes'],
        y=subset['efficiency'],
        mode='lines+markers',
        name=f"Dataset {label}",
        line=dict(width=3, color=px.colors.qualitative.Set1[i % 9]),
        marker=dict(size=8)
    ))

# Layout settings
fig_efficiency.update_layout(
    title='Parallel Efficiency Analysis',
    xaxis_title='Number of Processes (P)',
    yaxis_title='Efficiency (Speedup / P)',
    template='plotly_white',
    font=dict(size=12),
    height=500,
    showlegend=True,
    yaxis=dict(range=[0, 1.05])
)

# Show plot
fig_efficiency.show()

# Current filter info
print(f"Displayed datasets: {selected_datasets}")


Displayed datasets: [10, 102084863]
