# Performance

This notebook is intended to demonstrate the performance of AlphaTIMS.
It contains the following information:
1. [**Samples**](#Samples)
2. [**Reading raw Bruker .d folders**](#Reading-raw-Bruker-.d-folders)
3. [**Saving HDF files**](#Saving-HDF-files)
4. [**Reading HDF files**](#Reading-HDF-files)
5. [**Slicing data**](#Slicing-data)
6. [**Final overview**](#Final-overview)

The following system was used:

In [1]:
import alphatims.utils
import pandas as pd
import numpy as np

alphatims.utils.set_threads(8)
log_file_name = alphatims.utils.set_logger(
    log_file_name="performance_log.txt",
    overwrite=True
)
alphatims.utils.show_platform_info()
alphatims.utils.show_python_info()
alphatims.utils.set_progress_callback(None)

2021-04-22 16:59:43> Platform information:
2021-04-22 16:59:43> system        - Darwin
2021-04-22 16:59:43> release       - 19.6.0
2021-04-22 16:59:43> version       - 10.15.7
2021-04-22 16:59:43> machine       - x86_64
2021-04-22 16:59:43> processor     - i386
2021-04-22 16:59:43> cpu count     - 8
2021-04-22 16:59:43> cpu frequency - 2300.00 Mhz
2021-04-22 16:59:43> ram           - 23.5/32.0 Gb (available/total)
2021-04-22 16:59:43> 
2021-04-22 16:59:43> Python information:
2021-04-22 16:59:43> alphatims  - 0.2.5
2021-04-22 16:59:43> bokeh      - 2.2.3
2021-04-22 16:59:43> click      - 7.1.2
2021-04-22 16:59:43> datashader - 0.12.1
2021-04-22 16:59:43> h5py       - 3.2.1
2021-04-22 16:59:43> hvplot     - 0.7.1
2021-04-22 16:59:43> numba      - 0.53.1
2021-04-22 16:59:43> pandas     - 1.2.3
2021-04-22 16:59:43> psutil     - 5.8.0
2021-04-22 16:59:43> python     - 3.8.8
2021-04-22 16:59:43> pyzstd     - 0.14.4
2021-04-22 16:59:43> selenium   - 3.141.0
2021-04-22 16:59:43> tqdm       - 

## Samples

Five samples are used and compared:

In [2]:
file_names = {
    "DDA_6": "/Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d",
    "DIA_6": "/Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_SA_HeLa_200ng_300-1200_2steps_16scans_06-15_5_6min_4cm_S1-A4_1_21720.d",
    "DDA_21": "/Users/swillems/Data/alphatims_testing/20201207_tims03_Evo03_PS_SA_HeLa_200ng_EvoSep_prot_DDA_21min_8cm_S1-C10_1_22476.d",
    "DIA_21": "/Users/swillems/Data/alphatims_testing/20201207_tims03_Evo03_PS_SA_HeLa_200ng_EvoSep_prot_high_speed_21min_8cm_S1-C8_1_22474.d",
    "DDA_120": "/Users/swillems/Data/alphatims_testing/HeLa_200ng_1428.d"
}
overview = pd.DataFrame(index=file_names.keys())
overview["Type"] = pd.Series({x: x.split("_")[0] for x in file_names})
overview["Gradient (min)"] = pd.Series({x: x.split("_")[1] for x in file_names})

We first load all these files to show their basic statistics before we do the actual timing:

In [3]:
import logging
import alphatims.bruker

timstof_objects = {}
detector_strikes = {}
for sample_id, file_name in file_names.items():
    logging.info(f"Initial loading of {sample_id}")
    timstof_objects[sample_id] = alphatims.bruker.TimsTOF(file_name)
    detector_strikes[sample_id] = len(timstof_objects[sample_id])
    logging.info("")

2021-04-22 16:59:43> 
2021-04-22 16:59:43> Initial loading of DDA_6
2021-04-22 16:59:43> Importing data from /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 16:59:43> Reading frame metadata for /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 16:59:44> Reading 2,978 frames with 214,172,697 detector strikes for /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 16:59:45> Indexing /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d...
2021-04-22 16:59:45> Succesfully imported data from /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 16:59:45> 
2021-04-22 16:59:45> Initial loading of DIA_6
2021-04-22 16:59:45> Importing data

In [4]:
overview["Detector strikes"] = pd.Series(detector_strikes).apply(
    lambda x : f"{x:,}"
)

In summary, we thus consider the following samples:

In [5]:
overview

Unnamed: 0,Type,Gradient (min),Detector strikes
DDA_6,DDA,6,214172697
DIA_6,DIA,6,158552099
DDA_21,DDA,21,295251252
DIA_21,DIA,21,730564765
DDA_120,DDA,120,2074019899


## Reading raw Bruker .d folders

To avoid unwanted system inferences, we perform a `timeit` function to get a robust estimate of loading times for raw Bruker .d folders:

In [6]:
alphatims.utils.set_logger(stream=False)
raw_load_times = {}
for sample_id, file_name in file_names.items():
    print(f"Time to load {sample_id} raw Bruker .d folder:")
    raw_load_time = %timeit -o tmp = alphatims.bruker.TimsTOF(file_name)
    raw_load_times[sample_id] = np.average(raw_load_time.timings)
    print("")
overview["Load .d (s)"] = pd.Series(raw_load_times).apply(
    lambda x : f"{x:#.3g}"
)

Time to load DDA_6 raw Bruker .d folder:
1.6 s ± 49.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_6 raw Bruker .d folder:
1.14 s ± 9.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_21 raw Bruker .d folder:
3.35 s ± 154 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_21 raw Bruker .d folder:
5.01 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_120 raw Bruker .d folder:
22.2 s ± 3.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [7]:
overview["Load .d (s)"] = pd.Series(raw_load_times).apply(
    lambda x : f"{x:#.3g}"
)

## Saving HDF files

Each of the data files can also be exported to HDF files:

In [8]:
save_hdf_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to export {sample_id} to HDF file:")
    path = data.directory
    file_name = f"{data.sample_name}.hdf"
    save_hdf_time = %timeit -o tmp = data.save_as_hdf(path, file_name, overwrite=True)
    save_hdf_times[sample_id] = np.average(save_hdf_time.timings)
    print("")

Time to export DDA_6 to HDF file:
569 ms ± 42.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DIA_6 to HDF file:
392 ms ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DDA_21 to HDF file:
758 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DIA_21 to HDF file:
1.77 s ± 51.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DDA_120 to HDF file:
5.23 s ± 367 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [9]:
overview["Save .hdf (s)"] = pd.Series(save_hdf_times).apply(
    lambda x : f"{x:#.3g}"
)

## Reading HDF files

Once these HDF files are created, they can be loaded much faster than raw Bruker .d folders:

In [10]:
import os
load_hdf_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to load {sample_id} HDF file:")
    file_name = os.path.join(data.directory, f"{data.sample_name}.hdf")
    load_hdf_time = %timeit -o tmp = alphatims.bruker.TimsTOF(file_name)
    load_hdf_times[sample_id] = np.average(load_hdf_time.timings)
    print("")

Time to load DDA_6 HDF file:
466 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_6 HDF file:
300 ms ± 5.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_21 HDF file:
758 ms ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_21 HDF file:
1.93 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_120 HDF file:
10 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [11]:
overview["Load .hdf (s)"] = pd.Series(load_hdf_times).apply(
    lambda x : f"{x:#.3g}"
)

## Slicing data

Lastly, we can slice this data. Since this uses Numba JIT compilation, we first compile the relevant functions with an initial slice call:

In [12]:
tmp = timstof_objects["DDA_6"][0]

Now we can time how long it takes to slice in different dimensions:
   * **LC:** $100.0 \leq \textrm{retention_time} \lt 100.5$
   * **TIMS:** $\textrm{scan_index} = 450$
   * **Quadrupole:** $700.0 \leq \textrm{quad_mz_values} \lt 710.0$
   * **TOF:** $621.9 \leq \textrm{tof_mz_values} \lt 622.1$

In [13]:
import os
frame_slice_times = {}
scan_slice_times = {}
quad_slice_times = {}
tof_slice_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to slice {sample_id}:")
    
    print("Testing slice data[100.0: 100.5]")
    frame_slice_time = %timeit -o tmp = data[100.:100.5]
    frame_slice_times[sample_id] = np.average(frame_slice_time.timings)
    
    print("Testing slice data[:, 450]")
    scan_slice_time = %timeit -o tmp = data[:, 450]
    scan_slice_times[sample_id] = np.average(scan_slice_time.timings)
    
    print("Testing slice data[:, :, 700.0: 710]")
    quad_slice_time = %timeit -o tmp = data[:, :, 700.0: 710]
    quad_slice_times[sample_id] = np.average(quad_slice_time.timings)
    
    print("Testing slice data[:, :, :, 621.9: 622.1]")
    tof_slice_time = %timeit -o tmp = data[:, :, :, 621.9: 622.1]
    tof_slice_times[sample_id] = np.average(tof_slice_time.timings)
    
    print("")

Time to slice DDA_6:
Testing slice data[100.0: 100.5]
1.64 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Testing slice data[:, 450]
39.7 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing slice data[:, :, 700.0: 710]
27.7 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing slice data[:, :, :, 621.9: 622.1]
90.3 ms ± 7.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to slice DIA_6:
Testing slice data[100.0: 100.5]
7.31 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Testing slice data[:, 450]
29.8 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing slice data[:, :, 700.0: 710]
632 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing slice data[:, :, :, 621.9: 622.1]
96.3 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to slice DDA_21:
Testing slice data[100.0: 100.5]
1.71 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops

In [14]:
overview["Slice LC (s)"] = pd.Series(frame_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice TIMS (s)"] = pd.Series(scan_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice QUAD (s)"] = pd.Series(quad_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice TOF (s)"] = pd.Series(tof_slice_times).apply(
    lambda x : f"{x:#.3g}"
)

## Final overview

The final time performance can thus be summarized as follows:

In [15]:
overview

Unnamed: 0,Type,Gradient (min),Detector strikes,Load .d (s),Save .hdf (s),Load .hdf (s),Slice LC (s),Slice TIMS (s),Slice QUAD (s),Slice TOF (s)
DDA_6,DDA,6,214172697,1.6,0.569,0.466,0.00164,0.0397,0.0277,0.0903
DIA_6,DIA,6,158552099,1.14,0.392,0.3,0.00731,0.0298,0.632,0.0963
DDA_21,DDA,21,295251252,3.35,0.758,0.758,0.00171,0.0608,0.0988,0.174
DIA_21,DIA,21,730564765,5.01,1.77,1.93,0.000849,0.12,4.97,0.393
DDA_120,DDA,120,2074019899,22.2,5.23,10.0,0.000681,0.359,0.577,1.15
