# Comparing PyProBE Performance

This example will demonstrate the performance benefits of PyProBE against Pandas, a popular library for dataframes.

In [1]:
import pyprobe
import pandas as pd
import timeit
import numpy as np
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'pyprobe'

Setting up data analysis in PyProBE requires conversion into the PyProBE format. This is normally the most time-intensive process, but only needs to be performed once.

In [2]:
info_dictionary = {'Name': 'Sample cell',
                   'Chemistry': 'NMC622',
                   'Nominal Capacity [Ah]': 0.04,
                   'Cycler number': 1,
                   'Channel number': 1,}

cell = pyprobe.Cell(info =info_dictionary)
data_directory = '../../../tests/sample_data/neware'
# cell.process_cycler_file(cycler='neware',
#                         folder_path=data_directory,
#                         input_filename='sample_data_neware.xlsx',
#                         output_filename='sample_data_neware.parquet')

NameError: name 'pyprobe' is not defined

We will measure the time for PyProBE and Pandas to read from a parquet file and filter the data a few times. With PyProBE we can call the built-in filtering methods, whereas Pandas must perform the filtering manually.

In [3]:
def measure_pyprobe(repeats, file):
    steps = 5
    cumulative_time = np.zeros((steps,repeats))
    for repeat in range(repeats):
        start_time = timeit.default_timer()
        cell.add_procedure(procedure_name='Sample',
                   folder_path = data_directory,
                   filename = file)
        cumulative_time[0, repeat] = timeit.default_timer() - start_time
        
        experiment = cell.procedure['Sample'].experiment('Break-in Cycles')
        cumulative_time[1, repeat] =timeit.default_timer() - start_time
        
        cycle = experiment.cycle(1)
        cumulative_time[2, repeat] =timeit.default_timer() - start_time
        
        step = cycle.discharge(0)
        cumulative_time[3, repeat] = timeit.default_timer() - start_time

        voltage = step.get("Voltage [V]")
        cumulative_time[4, repeat] = timeit.default_timer() - start_time
    
    return cumulative_time, voltage


def measure_pandas(repeats, file):
    steps = 5
    cumulative_time = np.zeros((steps,repeats))
    for repeat in range(repeats):
        start_time = timeit.default_timer()
        df = pd.read_parquet(data_directory + '/' + file)
        cumulative_time[0, repeat] = timeit.default_timer() - start_time

        experiment = df[df['Step'].isin([4, 5, 6, 7])]
        cumulative_time[1, repeat] = timeit.default_timer() - start_time

        unique_cycles = experiment['Cycle'].unique()
        
        cycle = experiment[experiment['Cycle'] == unique_cycles[1]]
        cumulative_time[2, repeat] =timeit.default_timer() - start_time
        
        step = cycle[cycle['Current [A]'] < 0]
        unique_events = step['Event'].unique()
        step = step[step['Event'] == unique_events[0]]
        cumulative_time[3, repeat] =timeit.default_timer() - start_time

        voltage = step['Voltage [V]'].values
        cumulative_time[4, repeat] = timeit.default_timer() - start_time
    
    return cumulative_time, voltage

def make_boxplots(total_time_polars, total_time_pandas):
    data_polars = [total_time_polars[i, :] for i in range(total_time_polars.shape[0])]
    data_pandas = [total_time_pandas[i, :] for i in range(total_time_pandas.shape[0])]

    # Create labels for the boxplots
    labels = ["Read file", "Select experiment", "Select cycle", "Select step", "Select voltage"]
    # Create the subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

    # Boxplot for Polars
    ax1.boxplot(data_polars, labels=labels, vert=True, patch_artist=True)
    ax1.set_title('PyProBE Execution Time')
    ax1.set_ylabel('Time (seconds)')

    # Boxplot for Pandas
    ax2.boxplot(data_pandas, labels=labels, vert=True, patch_artist=True)
    ax2.set_title('Pandas Execution Time')
    ax2.yaxis.set_visible(False)  # Remove y-axis on the right-hand subplot

    # Adjust layout
    plt.tight_layout()
    plt.show()


Running the tests shows the initial overhead for PyProBE to read and filter the data is zero. This is because of the Lazy implementation where all the computation is delayed until the final request for data is made. Overall, it is faster than Pandas as the polars backend is able to optimize the filtering process, instead of requiring filters to be performed one-by-one.

In [4]:
repeats = 10
total_time_polars, voltage_pyprobe = measure_pyprobe(repeats, 'sample_data_neware.parquet')
total_time_pandas, voltage_pandas = measure_pandas(repeats, 'sample_data_neware.parquet')
make_boxplots(total_time_polars, total_time_pandas)
assert np.allclose(voltage_pyprobe, voltage_pandas)

NameError: name 'np' is not defined

We will now extend the input data, to demonstrate how much more scalable the Polars approach is.

In [5]:
df = pd.read_parquet(data_directory + '/sample_data_neware.parquet')
extended_df = pd.concat([df] * 25, ignore_index=True)
extended_df.to_parquet(data_directory + '/sample_data_neware_extended.parquet')
print(len(df))
print(len(extended_df))

NameError: name 'pd' is not defined

With 25x more data points, the PyProBE implementation is almost 3x faster than manual filtering in Pandas.

In [6]:
total_time_polars, _ = measure_pyprobe(repeats, 'sample_data_neware_extended.parquet')
total_time_pandas, _ = measure_pandas(repeats, 'sample_data_neware_extended.parquet')
make_boxplots(total_time_polars, total_time_pandas)

NameError: name 'np' is not defined

The Polars Lazy approach is best demonstrated by plotting the optimized graph:

In [7]:
lazyframe = cell.procedure['Sample'].experiment('Break-in Cycles').cycle(1).discharge(0).base_dataframe
print(lazyframe.explain(tree_format=True))

NameError: name 'cell' is not defined