# Benchmarks for Exdir #

This notebook contains a number of benchmarks for Exdir.
They compare the performance of Exdir with h5py.

The following functions are used to set up an exdir or hdf5 file for benchmarking.

**Warning**: Please make sure the files are not created in a folder managed by Syncthing, Dropbox or any other file synchronization system. 
We will be making a large number of changes to the files and a file synchronization system will reduce performance and possibly become out of sync in the process.

In [13]:
import exdir
import os
import shutil
import h5py

def setup_exdir():
    testpath = "test.exdir"
    if os.path.exists(testpath):
        shutil.rmtree(testpath)
    f = exdir.File(testpath)
    return f, testpath

def setup_exdir_none():
    testpath = "test.exdir"
    if os.path.exists(testpath):
        shutil.rmtree(testpath)
    f = exdir.File(testpath, validation=exdir.validation.none)
    return f, testpath

def teardown_exdir(f, testpath):
    f.close()
    shutil.rmtree(testpath)


def setup_h5py():
    testpath = "test.h5"
    if os.path.exists(testpath):
        os.remove(testpath)
    f = h5py.File(testpath)
    return f, testpath

    
def teardown_h5py(f, testpath):
    os.remove(testpath)

The following function is used to run the different benchmarks.
It takes a target function to test, a setup function to create the file and the number of iterations the function should be run to get a decent average:

In [14]:
import time

def benchmark(target, setup=None, teardown=None, iterations=1):
    total_time = 0
    setup_teardown_start = time.time()
    for i in range(iterations):
        data = tuple()
        if setup is not None:
            data = setup()
        start_time = time.time()
        target(*data)
        end_time = time.time()
        total_time += end_time - start_time
        if teardown is not None:
            teardown(*data)
    setup_teardown_end = time.time()
    total_setup_teardown = setup_teardown_end - setup_teardown_start
    
    mean = total_time / iterations
    
    return mean

The following functions are used as wrappers to make it easy to run a benchmark of Exdir or h5py:

In [21]:
import pandas as pd
import numpy as np

all_results = []

def benchmark_both(function, iterations=100):
    exdir_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir,
        teardown=teardown_exdir,
        iterations=iterations
    )
    exdir_none_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir_none,
        teardown=teardown_exdir,
        iterations=iterations
    )
    hdf5_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_h5py,
        teardown=teardown_h5py,
        iterations=iterations
    )
    result = pd.DataFrame(
        [(function.__name__, hdf5_mean, exdir_mean)],
        columns=["Test", "h5py", "exdir (thorough)", "exdir (none)"]
    )
    all_results.append(result)
    return result

def benchmark_exdir(function, iterations=100):
    exdir_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir,
        teardown=teardown_exdir,
        iterations=iterations
    )
    exdir_none_mean = benchmark(
        target=lambda f, path: function(f),
        setup=setup_exdir_none,
        teardown=teardown_exdir,
        iterations=iterations
    )
    result = pd.DataFrame(
        [(function.__name__, np.nan, exdir_mean, exdir_none_mean)],
        columns=["Test", "h5py", "exdir (thorough)", "exdir (none))"]
    )
    all_results.append(result)
    return result

We are now ready to start running the different benchmarks.

The following benchmark creates a small number of attributes.
This should be very fast with both h5py and Exdir:

In [22]:
def add_few_attributes(obj):
    for i in range(5):
        obj.attrs["hello" + str(i)] = "world"

benchmark_both(add_few_attributes)

AttributeError: module 'exdir' has no attribute 'validation'

The following benchmark adds a larger number of attributes one-by-one.
Because Exdir needs to read back and rewrite the entire file in case someone changed it between each write, this is significantly slower with Exdir than h5py:

In [23]:
def add_many_attributes(obj):
    for i in range(200):
        obj.attrs["hello" + str(i)] = "world"

benchmark_both(add_many_attributes, 10)

AttributeError: module 'exdir' has no attribute 'validation'

However, Exdir is capable of writing all attributes in one operation.
This makes writing the same attributes about as fast (or even faster than h5py).
Writing a large number of attributes in a single operation is not possible with h5py.
We therefore need to run this only with Exdir:

In [18]:
def add_many_attributes_single_operation(obj):
    attributes = {}
    for i in range(200):
        attributes["hello" + str(i)] = "world"
    obj.attrs = attributes
    
benchmark_exdir(add_many_attributes_single_operation)

Unnamed: 0,Test,h5py,exdir
0,add_many_attributes_single_operation,,0.012755


In [19]:
def add_attribute_tree(obj):
    tree = {}
    for i in range(100):
        tree["hello" + str(i)] = "world"
    tree["intermediate"] = {}
    intermediate = tree["intermediate"]
    for level in range(10):
        level_str = "level" + str(level)
        intermediate[level_str] = {}
        intermediate = intermediate[level_str]
    intermediate = 42
    obj.attrs["test"] = tree
    
benchmark_both(add_attribute_tree)

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

In [None]:
def add_small_dataset(obj):
    data = np.zeros((100, 100, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_small_dataset)

In [None]:
def add_medium_dataset(obj):
    data = np.zeros((1000, 100, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_medium_dataset, 10)

In [None]:
def add_large_dataset(obj):
    data = np.zeros((1000, 1000, 100))
    obj.create_dataset("foo", data=data)
    obj.close()
    
benchmark_both(add_large_dataset, 10)

In [None]:
def create_many_objects(obj):
    for i in range(5000):
        group = obj.create_group("group{}".format(i))

benchmark_both(create_many_objects, 3)

In [None]:
def iterate_objects(obj):
    i = 0
    for a in obj:
        i += 1
    return i

benchmark_both(iterate_objects)

In [None]:
def create_large_tree(obj, level=0):
    if level > 4:
        return
    for i in range(3):
        group = obj.create_group("group_{}_{}".format(i, level))
        data = np.zeros((10, 10, 10))
        group.create_dataset("dataset_{}_{}".format(i, level), data=data)
        create_large_tree(group, level + 1)
        
benchmark_both(create_large_tree)

In [None]:
pd.concat(all_results)