# Benchmarks for Exdir #

This notebook contains a number of benchmarks for Exdir.
They compare the performance of Exdir with h5py.

The following functions are used to set up an exdir or hdf5 file for benchmarking.

**Warning**: Please make sure the files are not created in a folder managed by Syncthing, Dropbox or any other file synchronization system. 
We will be making a large number of changes to the files and a file synchronization system will reduce performance and possibly become out of sync in the process.

In [1]:
import exdir
import os
import shutil
import h5py

def setup_exdir():
    testpath = "test.exdir"
    if os.path.exists(testpath):
        shutil.rmtree(testpath)
    f = exdir.File(testpath)
    return f, testpath

def setup_exdir_no_validation():
    testpath = "test.exdir"
    if os.path.exists(testpath):
        shutil.rmtree(testpath)
    f = exdir.File(testpath, name_validation=exdir.validation.minimal)
    return f, testpath

def teardown_exdir(f, testpath):
    f.close()
    shutil.rmtree(testpath)

def setup_h5py():
    testpath = "test.h5"
    if os.path.exists(testpath):
        os.remove(testpath)
    f = h5py.File(testpath)
    return f, testpath

    
def teardown_h5py(f, testpath):
    os.remove(testpath)

The following function is used to run the different benchmarks.
It takes a target function to test, a setup function to create the file and the number of iterations the function should be run to get a decent average:

In [2]:
import time

def benchmark(target, setup=None, teardown=None, iterations=10):
    total_time = 0
    setup_teardown_start = time.time()
    for i in range(iterations):
        data = tuple()
        if setup is not None:
            data = setup()
        time.sleep(1) # allow changes to be flushed to disk
        start_time = time.time()
        target(*data)
        end_time = time.time()
        total_time += end_time - start_time
        if teardown is not None:
            teardown(*data)
    setup_teardown_end = time.time()
    total_setup_teardown = setup_teardown_end - setup_teardown_start
    
    mean = total_time / iterations
    
    return mean

The following functions are used as wrappers to make it easy to run a benchmark of Exdir or h5py:

In [3]:
import pandas as pd
import numpy as np

all_results = []

def benchmark_both(function, iterations=10, name_validation=True):
    if name_validation:
        setup_exdir_ = setup_exdir
        name = function.__name__
    else:
        setup_exdir_ = setup_exdir_no_validation
        name = function.__name__ + " (minimal name validation)"
    
    exdir_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir_,
        teardown=teardown_exdir,
        iterations=iterations
    )
    hdf5_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_h5py,
        teardown=teardown_h5py,
        iterations=iterations
    )
    
    result = pd.DataFrame(
        [(name, hdf5_mean, exdir_mean, hdf5_mean/exdir_mean)],
        columns=["Test", "h5py", "Exdir", "Ratio"]
    )
    all_results.append(result)
    return result

def benchmark_exdir(function, iterations=10):
    exdir_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir,
        teardown=teardown_exdir,
        iterations=iterations
    )
    result = pd.DataFrame(
        [(function.__name__, np.nan, exdir_mean, np.nan)],
        columns=["Test", "h5py", "Exdir", "Ratio"]
    )
    all_results.append(result)
    return result

We are now ready to start running the different benchmarks.

The following benchmark creates a small number of attributes.
This should be very fast with both h5py and Exdir:

In [4]:
def add_few_attributes(obj):
    for i in range(5):
        obj.attrs["hello" + str(i)] = "world"

benchmark_both(add_few_attributes)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_few_attributes,0.002151,0.009924,0.216786


The following benchmark adds a larger number of attributes one-by-one.
Because Exdir needs to read back and rewrite the entire file in case someone changed it between each write, this is significantly slower with Exdir than h5py:

In [5]:
def add_many_attributes(obj):
    for i in range(200):
        obj.attrs["hello" + str(i)] = "world"

benchmark_both(add_many_attributes, 10)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_many_attributes,0.066113,3.653728,0.018095


However, Exdir is capable of writing all attributes in one operation.
This makes writing the same attributes about as fast (or even faster than h5py).
Writing a large number of attributes in a single operation is not possible with h5py.
We therefore need to run this only with Exdir:

In [6]:
def add_many_attributes_single_operation(obj):
    attributes = {}
    for i in range(200):
        attributes["hello" + str(i)] = "world"
    obj.attrs = attributes
    
benchmark_exdir(add_many_attributes_single_operation)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_many_attributes_single_operation,,0.030075,


Exdir also supports adding nested attributes, such as Python dictionaries, which is not supported at all in HDF5.

In [7]:
def add_attribute_tree(obj):
    tree = {}
    for i in range(100):
        tree["hello" + str(i)] = "world"
    tree["intermediate"] = {}
    intermediate = tree["intermediate"]
    for level in range(10):
        level_str = "level" + str(level)
        intermediate[level_str] = {}
        intermediate = intermediate[level_str]
    intermediate = 42
    obj.attrs["test"] = tree
    
benchmark_exdir(add_attribute_tree)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_attribute_tree,,0.02044,


The following tests benchmark adding a small, medium and large dataset:

In [8]:
def add_small_dataset(obj):
    data = np.zeros((100, 100, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_small_dataset)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_small_dataset,0.008624,0.012962,0.665307


In [9]:
def add_medium_dataset(obj):
    data = np.zeros((1000, 100, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_medium_dataset, 10)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_medium_dataset,0.05979,0.085022,0.703224


In [10]:
def add_large_dataset(obj):
    data = np.zeros((1000, 1000, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_large_dataset, 3)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_large_dataset,0.376301,0.535744,0.70239


There is some overhead in creating the objects themselves.
This is rather small in h5py, but can be hight in Exdir with name validation enabled because the name of every created object must be checked against all the existing objects in the same group:

In [11]:
def create_many_objects(obj):
    for i in range(5000):
        group = obj.create_group("group{}".format(i))

benchmark_both(create_many_objects, 3)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,create_many_objects,0.260548,8.084029,0.03223


Without minimal validation, this is almost as fast in Exdir as it is in h5py.
Minimal name validation only checks if file with the exact same name exist in the folder:

In [12]:
benchmark_both(create_many_objects, 3, name_validation=False)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,create_many_objects (minimal name validation),0.25957,1.081869,0.239927


Not only the number of created objects matter.
Creating them in a tree structure can also incur a performance penalty.
The following test creates an object tree:

In [13]:
def create_large_tree(obj, level=0):
    if level > 4:
        return
    for i in range(3):
        group = obj.create_group("group_{}_{}".format(i, level))
        data = np.zeros((10, 10, 10))
        group.create_dataset("dataset_{}_{}".format(i, level), data=data)
        create_large_tree(group, level + 1)
        
benchmark_both(create_large_tree)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,create_large_tree,0.135256,0.343413,0.393858


In [14]:
def write_slice(dataset):
    dataset[320:420, 0:300, 0:100] = np.ones((100, 300, 100))

def create_setup_dataset(setup_function):
    def setup():
        f, path = setup_function()
        data = np.zeros((1000, 500, 100))
        dataset = f.create_dataset("foo", data=data)
        time.sleep(1) # allow changes to get flushed to disk
        return dataset, f, path
    return setup

exdir_mean = benchmark(
    target=lambda dataset, f, path: write_slice(dataset),
    setup=create_setup_dataset(setup_exdir),
    teardown=lambda dataset, f, path: teardown_exdir(f, path),
    iterations=3
)

hdf5_mean = benchmark(
    target=lambda dataset, f, path: write_slice(dataset),
    setup=create_setup_dataset(setup_h5py),
    teardown=lambda dataset, f, path: teardown_h5py(f, path),
    iterations=3
)
result = pd.DataFrame(
    [("write_slice", hdf5_mean, exdir_mean, hdf5_mean/exdir_mean)],
    columns=["Test", "h5py", "Exdir", "Ratio"]
)
all_results.append(result)

result

Unnamed: 0,Test,h5py,Exdir,Ratio
0,write_slice,0.015987,0.014682,1.088835


## Benchmark summary ##

The results are summarized in the following table:

In [15]:
pd.concat(all_results)

Unnamed: 0,Test,h5py,Exdir,Ratio
0,add_few_attributes,0.002151,0.009924,0.216786
0,add_many_attributes,0.066113,3.653728,0.018095
0,add_many_attributes_single_operation,,0.030075,
0,add_attribute_tree,,0.02044,
0,add_small_dataset,0.008624,0.012962,0.665307
0,add_medium_dataset,0.05979,0.085022,0.703224
0,add_large_dataset,0.376301,0.535744,0.70239
0,create_many_objects,0.260548,8.084029,0.03223
0,create_many_objects (minimal name validation),0.25957,1.081869,0.239927
0,create_large_tree,0.135256,0.343413,0.393858


# Profiling the largest differences #

While the performance of Exdir in many cases is close to h5py, there are a few cases that can be worth investigating further.

For instance, it might be interesting to know what takes most time in write_slice, which is about 60 times slower in Exdir than h5py:

In [16]:
setup_exdir_dataset = create_setup_dataset(setup_exdir)
dataset, f, path = setup_exdir_dataset()
%prun write_slice(dataset)
teardown_exdir(f, path)

 

In [17]:
f, path = setup_exdir()
%prun create_large_tree(f)
teardown_exdir(f, path)

 