# Compare data saving and loading performance of QCoDeS SQLite backend vs HDF5 (h5py) and numpy npy files

This notebook measures time it takes to save and load measurement data using qcodes dataset versus other ways of storing data, hdf5 and numpy npy files. The reason for such a study is that qcodes users should not be limited in the performance of the their experiments by the performance of data saving (and loading).

HDF5 and numpy npy storage solutions are widely used in the scientific community, and are known for their efficiency.

In this notebook, we are going to define convenient functions that generate data, load and save that data in the means of interest, and some infrastructure that allows us to measure the time the loading and saving takes.

# Preparations

## Imports

In [1]:
import time
import os
from tempfile import TemporaryFile
from functools import partial

import numpy
import h5py
from git import Repo

import qcodes
from qcodes import (
    initialise_or_create_database_at, load_or_create_experiment, 
    Measurement, Parameter,
    load_by_id
)
from qcodes.dataset.data_export import get_data_by_id

Logging hadn't been started.
Activating auto-logging. Current session state plus future input saved.
Filename       : C:\Users\Jens-Work\.qcodes\logs\command_history.log
Mode           : append
Output logging : True
Raw input log  : False
Timestamping   : True
State          : active
Qcodes Logfile : C:\Users\Jens-Work\.qcodes\logs\200602-13884-qcodes.log


## Relevant environment information

In [2]:
qcodes.version.__version__

'0.14.0+107.g12c917c30'

In [3]:
# in case the qcodes is installed from local git repository
qcodes_repo_path = os.sep.join(qcodes.__path__[0].split(os.sep)[:-1])
qcodes_repo = Repo(qcodes_repo_path)
print(qcodes_repo.head.commit)

12c917c30ceaa0a1afe5d7cccccdca71b2276c37


In [4]:
print(h5py.version.info)

Summary of the h5py configuration
---------------------------------

h5py    2.10.0
HDF5    1.10.5
Python  3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
sys.platform    win32
sys.maxsize     9223372036854775807
numpy   1.17.5



## Simulated measurement

For this study, we are going to take the case of sweeping 2 independent parameters (s1, s2) and measuring 2 dependent parameters (magnitude and phase). For simplicity, the number of datapoints per parameter is the same, and it is set in a variable. We are going to use the same generator function throughout the study for generating dummy data that we will be saving and loading.

In [5]:
# number of data points per parameter
n_pts_per_param = 20

In [6]:
def make_data_producer(n_pts_per_param):
    def produce_measurement_data():
        """
        This iterator represents the code that obtains
        measurement data. For the sake of example, it
        just returns random dummy data: 4 parameters/dimensions, 
        `n_pts_per_param` per each dimension (which becomes
        `n_pts_per_param**4` data points in total).

        Args:
            n_pts_per_param
                number of points per each parameter/dimension

        Returns:
            tuple of values of the 4 dimensions obtained
            at a single "measurement" iteration
        """
        for s1_val in range(n_pts_per_param):
            for s2_val in range(n_pts_per_param):
                magn_vals, phas_vals = numpy.meshgrid(
                    numpy.random.rand(n_pts_per_param),
                    numpy.random.rand(n_pts_per_param),
                )
                magn_vals = numpy.reshape(magn_vals, -1)
                phas_vals = numpy.reshape(phas_vals, -1)

                yield s1_val, s2_val, magn_vals, phas_vals
    return produce_measurement_data

## Measuring execution time

In most of the cases, we are going to use `timeit` to measure time.

In some cases, however, `timeit` interface is not flexible: it does not let you measure "start" and "stop" time moments __within__ the code that is under test. Below is a custom decorator that allows to overcome this limitation.

In [7]:
import timeit
from IPython.core.magics.execution import TimeitResult
from copy import deepcopy, copy

In [8]:
def time_it(number=None,
            repeat=timeit.default_repeat):
    """
    Sometimes it is needed to define in the code itself
    where you want to start measuring the execution time
    of that piece of code and when you want to stop the 
    measurement. Unfortunately, `timeit` module does not
    support that out-of-the-box. Hence, this decorator.
    
    This decorator uses `timeit` infrastructure, but allows
    to profile a function that returns its execution time.
    This allows developers to define the start and stop moments
    in the code itself, and the `timeit` infrastructure will
    do the rest.
    
    To use this decorator, follow these steps:
    * implement a piece of code that you'd like to profile
      as a function
    * in the code of the function find the start and stop
      points where the time needs to be measured
    * use `time.perf_counter()` to get the time in seconds
      at those places
    * make the function return the difference between stop
      and start moments and its first return value
    * the function signature is not restricted to its input
      arguments, and is not restricted to its return values
      except for the first return value
    * decorate the function with this decorator
    * call your decorated function to see the results of 
      the profiling
    
    Args:
        number
            the function gets executed this `number` of times,
            and the average of the collected individual execution
            times is used (same as for `timeit`); if None, then
            the necessary number of execution times will be 
            inferred (see `timeit` module for more info)
        repeat
            the profiling measurement gets repeated `repeat`
            number of times (same as for `timeit`)
    """
    def time_sut(sut):
        """
        This is the actual decorator. "sut" stands for "system
        under test".
        """
        def wrapper(*args, **kwargs):
            """
            This wrapper function uses `timeit` infrastructure
            from `timeit` module and its implementation in Jupyter
            magics.
            
            Returns the `TimeitResult` object that contains all the
            information about the profiling results.
            """
            t = timeit.Timer()
            
            # define a function that the Timer class
            # can consume for profiling
            def inner(_it, _timer):
                """
                see the internals of the `timeit.Timer` class
                for more information
                """
                total_time = 0
                for _ in _it:
                    args_ = copy(args)
                    kwargs_ = copy(kwargs)
                    
                    returned_vals = sut(*args_, **kwargs_)
                    
                    total_time += returned_vals[0] \
                                      if isinstance(returned_vals, tuple) \
                                  else returned_vals
                return total_time
            
            t.inner = inner
            
            # execute the profiling
            try:
                if number is None:
                    number_, __ = t.autorange()
                else:
                    number_ = number
                
                all_runs = t.repeat(repeat, number_)
            except:
                t.print_exc()
                raise
            
            # pretty print the results
            best = min(all_runs) / number_
            worst = max(all_runs) / number_
            timeit_result = TimeitResult(number_, repeat, best, worst, all_runs, 0, 3)
            print(timeit_result)
            
            return timeit_result
        return wrapper
    return time_sut

# Defining test routines

Now lets define all the test routines for saving and loading data for testing the performance of different backends. These routines will use the data generation function that is defined above, and some of them will conform to the interface that is required by the custom `time_it` decorator.

## QCoDeS dataset

First, we need to initialize a database file.

In [9]:
# initialize the database file for qcodes dataset

temp_db_file = TemporaryFile(suffix='.db')
temp_db_file.close()

initialise_or_create_database_at(temp_db_file.name)

load_or_create_experiment('save_load_speed_study', 'sqlite3_from_qcodes')

Upgrading database; v0 -> v1: : 0it [00:00, ?it/s]
Upgrading database; v1 -> v2: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 499.50it/s]
Upgrading database; v2 -> v3: : 0it [00:00, ?it/s]
Upgrading database; v3 -> v4: : 0it [00:00, ?it/s]
Upgrading database; v4 -> v5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 499.20it/s]
Upgrading database; v5 -> v6: : 0it [00:00, ?it/s]
Upgrading database; v6 -> v7: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

save_load_speed_study#sqlite3_from_qcodes#1@C:\Users\JENS-W~1\AppData\Local\Temp\tmp6dumqcbs.db
-----------------------------------------------------------------------------------------------

Next, we define a convenient function that performs all the usual steps that are necessary for a qcodes measurement and data saving.

Note that we exlcude from the time measurement the parts that are related to setting up the `Measurement` object, and starting the actual measurement. We do include the exiting of the `measurement.run()` context though because the last pieces of data are flushed then.

We decorate it with the our custom `time_it` decorator that has been presented above (note that we want to keep the original function as well, hence the `@` syntax is not used for decoration).

In [10]:
def save_to_sqlite(create_data_generator, 
                   paramtype='numeric', 
                   write_period=10):
    """
    Use qcodes dataset with its sqlite backend to save dummy
    data, and measure the time this takes. The data that is being
    saved is 2 dependent and 2 independent parameters. The data
    for the measurement is generated by an iterator that is returned
    by calling the `create_data_generator` function.
    
    Args:
        create_data_generator
            a callable with not arguments that returns an iterator 
            that in turn generates dummy data for 4 parameters
        paramtype
            controls the way data the 2 dependent parameters are stored
            in the sqlite database,
            see `Measurement.register_parameter` for more information
            (useful values in the context of this notebook are 'numeric'
            and 'array')
        write_period
            the data is written to the data base at least every 
            `write_period` number of seconds
            
    Returns:
        saving_time
            measured time it took to save the data, in seconds
        dataset
            the qcodes dataset object where the data was saved to;
            it is useful for accessing the data and measuring the
            time it takes to load it
    """
    data_generator = create_data_generator()
    
    # define parameters
    s1 = Parameter('s1', label='Setting 1', unit='V', get_cmd=None, set_cmd=None)
    s2 = Parameter('s2', label='Setting 2', unit='V', get_cmd=None, set_cmd=None)
    magn = Parameter('magn', label='Magnitude', unit='V', get_cmd=None, set_cmd=None)
    phas = Parameter('phas', label='Phase', unit='deg', get_cmd=None, set_cmd=None)
    
    meas = Measurement()
    
    # register parameters in the measurement object
    meas.register_parameter(s1)
    meas.register_parameter(s2)
    meas.register_parameter(magn, setpoints=(s1, s2), paramtype=paramtype)
    meas.register_parameter(phas, setpoints=(s1, s2), paramtype=paramtype)
    
    # set the write period to a large value, so that actual writing
    # to the database happens at the end of the "measurement"
    meas.write_period = write_period
    
    # perform the measurement
    with meas.run() as datasaver:
        t0 = time.perf_counter()  # <-----

        for s1_val, s2_val, magn_vals, phas_vals \
                in data_generator:
            
            datasaver.add_result((s1, s1_val), 
                                 (s2, s2_val),
                                 (magn, magn_vals),
                                 (phas, phas_vals))
    
    t1 = time.perf_counter()  # <-----
    saving_time = t1 - t0
    
    dataset = datasaver.dataset
    
    return saving_time, dataset


# decorate the function, and leave the original one intact
time_save_to_sqlite_numeric = time_it(number=3, repeat=2)(
    partial(save_to_sqlite, paramtype='numeric'))

time_save_to_sqlite_array = time_it()(
    partial(save_to_sqlite, paramtype='array'))

## HDF5 file

HDF5 files (thanks for `h5py`) behave very similar to `numpy` array, the interfacing with them is very similar.

We are not going to use the custom `time_it` decorator, because `timeit` itself is not limiting us.

In [11]:
def save_to_hdf5(create_data_generator, filename):
    """
    Use HDF5 file to save dummy data, and measure the time 
    this takes. The data that is being saved is 2 dependent
    and 2 independent parameters. The resulting HDF5 file
    is going to contain a single 'dataset' with the name
    "results".
    
    The data for the measurement is generated by an iterator 
    that is returned by calling the `create_data_generator` 
    function.
    
    Args:
        create_data_generator
            a callable with not arguments that returns an iterator 
            that in turn generates dummy data for 4 parameters
        filename
            the name of the HDF5 file with the full path
    
    Returns:
        saving_time
            measured time it took to save the data, in seconds
    """
    data_generator = create_data_generator()

    with h5py.File(filename, 'w') as f:
        ds = f.create_dataset('results', shape=(4, 0), maxshape=(4, None))

        for s1_val, s2_val, magn_vals, phas_vals in data_generator:
            n_pts = len(magn_vals)
            
            # we simulate the fact that we don't 
            # know the full amount of data
            # that needs to be saved, hence 
            # we need to resize while saving
            n_cols, n_rows = ds.shape
            ds.resize((n_cols, n_rows + n_pts))

            ds[0, n_rows:n_rows+n_pts] = s1_val
            ds[1, n_rows:n_rows+n_pts] = s2_val
            ds[2, n_rows:n_rows+n_pts] = magn_vals
            ds[3, n_rows:n_rows+n_pts] = phas_vals

## Numpy npy file

We are going to use `numpy`'s `.npy` files together with the handy `open_memmap` function in order to save data that is being spit out of the iterator that generates data.

In [12]:
def save_to_npy(create_data_generator, filename):
    """
    Use numpy npy file to save dummy data, and measure the time 
    this takes. The data that is being saved is 2 dependent
    and 2 independent parameters. The data for the measurement
    is generated by an iterator that is returned
    by calling the `create_data_generator` function.
    
    Args:
        create_data_generator
            a callable with not arguments that returns an iterator 
            that in turn generates dummy data for 4 parameters
        filename
            the name of the npy file with the full path; it has to
            contain '.npy' extension, otherwise `numpy` will add it
            when saving data, and it will be impossible to refer
            to the actual file without manually appending the 
            '.npy' extension to the `filename` in the code
            outside of this function
    
    Returns:
        saving_time
            measured time it took to save the data, in seconds
    """
    data_generator = create_data_generator()

    npy_mm = numpy.lib.format.open_memmap(
        filename, mode='w+', shape=(4, 0))
    
    # this (possibly dangerous?) hack is needed to allow
    # resizing during adding data
    npy_mm = numpy.require(npy_mm, requirements=['OWNDATA'])

    for s1_val, s2_val, magn_vals, phas_vals in data_generator:
        n_pts = len(magn_vals)
        
        # we simulate the fact that we don't 
        # know the full amount of data
        # that needs to be saved, hence 
        # we need to resize while saving
        n_cols, n_rows = npy_mm.shape
        npy_mm.resize((n_cols, n_rows + n_pts))

        npy_mm[0, n_rows:n_rows+n_pts] = s1_val
        npy_mm[1, n_rows:n_rows+n_pts] = s2_val
        npy_mm[2, n_rows:n_rows+n_pts] = magn_vals
        npy_mm[3, n_rows:n_rows+n_pts] = phas_vals

    del npy_mm  # closes the file and performs final flushing

# Measure saving times

## Save time of QCoDeS dataset

### with 'numeric' type

In [13]:
save_time_dataset_numeric = time_save_to_sqlite_numeric(
    make_data_producer(n_pts_per_param))

Starting experimental run with id: 1. 
Starting experimental run with id: 2. 
Starting experimental run with id: 3. 
Starting experimental run with id: 4. 
Starting experimental run with id: 5. 
Starting experimental run with id: 6. 
1.86 s ± 1.26 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)


In [14]:
print("Data saving to dataset with 'numeric' paramtype took:")
print(save_time_dataset_numeric)

Data saving to dataset with 'numeric' paramtype took:
1.86 s ± 1.26 ms per loop (mean ± std. dev. of 2 runs, 3 loops each)


### with 'array' type

In [15]:
save_time_dataset_array = time_save_to_sqlite_array(
    make_data_producer(n_pts_per_param))

Starting experimental run with id: 7. 
Starting experimental run with id: 8. 
Starting experimental run with id: 9. 
Starting experimental run with id: 10. 
Starting experimental run with id: 11. 
Starting experimental run with id: 12. 
Starting experimental run with id: 13. 
Starting experimental run with id: 14. 
Starting experimental run with id: 15. 
Starting experimental run with id: 16. 
Starting experimental run with id: 17. 
Starting experimental run with id: 18. 
Starting experimental run with id: 19. 
155 ms ± 4.06 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)


In [16]:
print("Data saving to dataset with 'array' paramtype took:")
print(save_time_dataset_array)

Data saving to dataset with 'array' paramtype took:
155 ms ± 4.06 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)


## Save time of HDF5

In [17]:
%%timeit outfile = TemporaryFile(); outfile.close()

save_to_hdf5(make_data_producer(n_pts_per_param), outfile.name)

178 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Save time of npy

In [18]:
%%timeit outfile = TemporaryFile(suffix='.npy'); outfile.close()

save_to_npy(make_data_producer(n_pts_per_param), outfile.name)

398 ms ± 5.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Measure loading times

## Load time of QCoDeS dataset

QCoDeS dataset has two ways of just loading the data: via `DataSet.get_data` method, and via `get_data_by_id` function.

We are going to use both, but note that `get_data_by_id` does a bit more than just loading the data, hence it is supposedly more popular among users.

A third way is to use `DataSet.get_values` and obtain values of each parameter one by one. `get_data_by_id` is already using it internally, hence we are not going to profile it.

### of 'numeric' type

In [19]:
_, dataset_numeric = save_to_sqlite(
    make_data_producer(n_pts_per_param), 
    paramtype='numeric')

Starting experimental run with id: 20. 


In [20]:
%%timeit parameter_names = dataset_numeric.parameters.split(',')

data = dataset_numeric.get_data(*parameter_names)




2.5 s ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [21]:
%%timeit

data = get_data_by_id(dataset_numeric.run_id)

2.56 s ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### of 'array' type

In [22]:
_, dataset_array = save_to_sqlite(
    make_data_producer(n_pts_per_param), 
    paramtype='array')

Starting experimental run with id: 21. 


In [23]:
%%timeit parameter_names = dataset_array.parameters.split(',')

data = dataset_array.get_data(*parameter_names)

137 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [24]:
%%timeit

data = get_data_by_id(dataset_array.run_id)

158 ms ± 648 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Load time from HDF5

In [25]:
hdf5file = TemporaryFile()
hdf5file.close()
hdf5filename = hdf5file.name

_ = save_to_hdf5(make_data_producer(n_pts_per_param), hdf5filename)

In [26]:
%%timeit

with h5py.File(hdf5filename, 'r') as f:
    data = numpy.array(f['results'], copy=True)

5.24 ms ± 35.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Load time from npy

In [27]:
npyfile = TemporaryFile(suffix='.npy')
npyfile.close()
npyfilename = npyfile.name

_ = save_to_npy(make_data_producer(n_pts_per_param), npyfilename)

In [28]:
%%timeit

data = numpy.load(npyfilename)

251 µs ± 610 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Summary of results

Let's summarize the measurement results. 

Note that the results are added manually so that playing around with the cells of this notebook will not result in changes to the summary.

For saving `4 columns` of data, `160000 (20**4)` points per each column, in a `for` loop over the first 2 parameters where the data for the last 2 parameters is obtained as arrays of `400 (20**2)` elements, assuming that the data is saved in these chunks, these are the measurement results for saving that data and loading it for different backends:

| Backend | QCoDeS<br>dataset<br>'numeric' | QCoDeS<br>dataset<br>'array' | HDF5<br>file | .npy<br>file |
|---|---|---|---|---|
| Saving time  | ~ 1900 ms | ~ 160 ms | ~ 180 ms | ~ 400 ms |
| Loading time | ~ 2500 ms | ~ 150 ms |  ~ 5 ms | ~ 0.25 ms |

__Note__ that the numbers in this table have the `~` sign - this means that these numbers should be read as __"on the order of"__. The reason being is that when you run the cells of this notebook multiple times, you might get a discrepancy of ~10-20% from these numbers, but (hopefully) not more.

__Conclusions__
* 'numeric' type for QCoDeS dataset should not be used for large data
* saving speed of 'array' type for QCoDeS dataset is similar to that of HDF5
* loading time of HDF5 and npy is significantly better than that of QCoDeS dataset
* saving speed of npy may seem large but that is most probably due to the resizing of the file on-the-fly while new data comes in; HDF5 seems to handle this 2x better