# Performance

This notebook is intended to demonstrate the performance of AlphaTIMS.
It contains the following information:
1. [**Samples**](#Samples)
2. [**Reading raw Bruker .d folders**](#Reading-raw-Bruker-.d-folders)
3. [**Saving HDF files**](#Saving-HDF-files)
4. [**Reading HDF files**](#Reading-HDF-files)
5. [**Slicing data**](#Slicing-data)
6. [**Final overview**](#Final-overview)

The following system was used:

In [1]:
import alphatims.utils
import pandas as pd
import numpy as np

alphatims.utils.set_threads(8)
log_file_name = alphatims.utils.set_logger(
    log_file_name="performance_log.txt",
    overwrite=True
)
alphatims.utils.show_platform_info()
alphatims.utils.show_python_info()
alphatims.utils.set_progress_callback(None)

2021-04-22 13:40:30> Platform information:
2021-04-22 13:40:30> system     - Darwin
2021-04-22 13:40:30> release    - 19.6.0
2021-04-22 13:40:30> version    - 10.15.7
2021-04-22 13:40:30> machine    - x86_64
2021-04-22 13:40:30> processor  - i386
2021-04-22 13:40:30> cpu count  - 8
2021-04-22 13:40:30> ram        - 26.8/32.0 Gb (available/total)
2021-04-22 13:40:30> 
2021-04-22 13:40:30> Python information:
2021-04-22 13:40:30> alphatims  - 0.2.5
2021-04-22 13:40:30> bokeh      - 2.2.3
2021-04-22 13:40:30> click      - 7.1.2
2021-04-22 13:40:30> datashader - 0.12.1
2021-04-22 13:40:30> h5py       - 3.2.1
2021-04-22 13:40:30> hvplot     - 0.7.1
2021-04-22 13:40:30> numba      - 0.53.1
2021-04-22 13:40:30> pandas     - 1.2.3
2021-04-22 13:40:30> psutil     - 5.8.0
2021-04-22 13:40:30> python     - 3.8.8
2021-04-22 13:40:30> pyzstd     - 0.14.4
2021-04-22 13:40:30> selenium   - 3.141.0
2021-04-22 13:40:30> tqdm       - 4.59.0
2021-04-22 13:40:30> 


## Samples

Five samples are used and compared:

In [2]:
file_names = {
    "DDA_6": "/Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d",
    "DIA_6": "/Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_SA_HeLa_200ng_300-1200_2steps_16scans_06-15_5_6min_4cm_S1-A4_1_21720.d",
    "DDA_21": "/Users/swillems/Data/alphatims_testing/20201207_tims03_Evo03_PS_SA_HeLa_200ng_EvoSep_prot_DDA_21min_8cm_S1-C10_1_22476.d",
    "DIA_21": "/Users/swillems/Data/alphatims_testing/20201207_tims03_Evo03_PS_SA_HeLa_200ng_EvoSep_prot_high_speed_21min_8cm_S1-C8_1_22474.d",
    "DDA_120": "/Users/swillems/Data/alphatims_testing/HeLa_200ng_1428.d"
}
overview = pd.DataFrame(index=file_names.keys())
overview["Type"] = pd.Series({x: x.split("_")[0] for x in file_names})
overview["Gradient (min)"] = pd.Series({x: x.split("_")[1] for x in file_names})

We first load all these files to show their basic statistics before we do the actual timing:

In [3]:
import logging
import alphatims.bruker

timstof_objects = {}
detector_strikes = {}
for sample_id, file_name in file_names.items():
    logging.info(f"Initial loading of {sample_id}")
    timstof_objects[sample_id] = alphatims.bruker.TimsTOF(file_name)
    detector_strikes[sample_id] = len(timstof_objects[sample_id])
    logging.info("")

2021-04-22 13:40:30> 
2021-04-22 13:40:31> Initial loading of DDA_6
2021-04-22 13:40:31> Importing data from /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 13:40:31> Reading frame metadata for /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 13:40:31> Reading 2,978 frames with 214,172,697 detector strikes for /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 13:40:33> Indexing /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d...
2021-04-22 13:40:33> Succesfully imported data from /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-04-22 13:40:33> 
2021-04-22 13:40:33> Initial loading of DIA_6
2021-04-22 13:40:33> Importing data

In [4]:
overview["Detector strikes"] = pd.Series(detector_strikes).apply(
    lambda x : f"{x:,}"
)

In summary, we thus consider the following samples:

In [5]:
overview

Unnamed: 0,Type,Gradient (min),Detector strikes
DDA_6,DDA,6,214172697
DIA_6,DIA,6,158552099
DDA_21,DDA,21,295251252
DIA_21,DIA,21,730564765
DDA_120,DDA,120,2074019899


## Reading raw Bruker .d folders

To avoid unwanted system inferences, we perform a `timeit` function to get a robust estimate of loading times for raw Bruker .d folders:

In [6]:
alphatims.utils.set_logger(stream=False)
raw_load_times = {}
for sample_id, file_name in file_names.items():
    print(f"Time to load {sample_id} raw Bruker .d folder:")
    raw_load_time = %timeit -o tmp = alphatims.bruker.TimsTOF(file_name)
    raw_load_times[sample_id] = np.average(raw_load_time.timings)
    print("")
overview["Load .d (s)"] = pd.Series(raw_load_times).apply(
    lambda x : f"{x:#.3g}"
)

Time to load DDA_6 raw Bruker .d folder:
1.82 s ± 28.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_6 raw Bruker .d folder:
1.24 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_21 raw Bruker .d folder:
3.38 s ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_21 raw Bruker .d folder:
5.33 s ± 288 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_120 raw Bruker .d folder:
23.9 s ± 3.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [7]:
overview["Load .d (s)"] = pd.Series(raw_load_times).apply(
    lambda x : f"{x:#.3g}"
)

## Saving HDF files

Each of the data files can also be exported to HDF files:

In [8]:
save_hdf_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to export {sample_id} to HDF file:")
    path = data.directory
    file_name = f"{data.sample_name}.hdf"
    save_hdf_time = %timeit -o tmp = data.save_as_hdf(path, file_name, overwrite=True)
    save_hdf_times[sample_id] = np.average(save_hdf_time.timings)
    print("")

Time to export DDA_6 to HDF file:
572 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DIA_6 to HDF file:
415 ms ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DDA_21 to HDF file:
810 ms ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DIA_21 to HDF file:
1.86 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DDA_120 to HDF file:
5.18 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [9]:
overview["Save .hdf (s)"] = pd.Series(save_hdf_times).apply(
    lambda x : f"{x:#.3g}"
)

## Reading HDF files

Once these HDF files are created, they can be loaded much faster than raw Bruker .d folders:

In [11]:
import os
load_hdf_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to load {sample_id} HDF file:")
    file_name = os.path.join(data.directory, f"{data.sample_name}.hdf")
    load_hdf_time = %timeit -o tmp = alphatims.bruker.TimsTOF(file_name)
    load_hdf_times[sample_id] = np.average(load_hdf_time.timings)
    print("")

Time to load DDA_6 HDF file:
508 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_6 HDF file:
359 ms ± 20.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_21 HDF file:
871 ms ± 51.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_21 HDF file:
2.06 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_120 HDF file:
10.5 s ± 463 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [12]:
overview["Load .hdf (s)"] = pd.Series(load_hdf_times).apply(
    lambda x : f"{x:#.3g}"
)

## Slicing data

Lastly, we can slice this data. Since this uses Numba JIT compilation, we first compile the relevant functions with an initial slice call:

In [13]:
tmp = timstof_objects["DDA_6"][0]

Now we can time how long it takes to slice in different dimensions:
   * **LC:** $100.0 \leq \textrm{retention_time} \lt 100.5$
   * **TIMS:** $\textrm{scan_index} = 450$
   * **Quadrupole:** $700.0 \leq \textrm{quad_mz_values} \lt 710.0$
   * **TOF:** $621.9 \leq \textrm{tof_mz_values} \lt 622.1$

In [14]:
import os
frame_slice_times = {}
scan_slice_times = {}
quad_slice_times = {}
tof_slice_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to slice {sample_id}:")
    
    print("Testing slice data[100.0: 100.5]")
    frame_slice_time = %timeit -o tmp = data[100.:100.5]
    frame_slice_times[sample_id] = np.average(frame_slice_time.timings)
    
    print("Testing slice data[:, 450]")
    scan_slice_time = %timeit -o tmp = data[:, 450]
    scan_slice_times[sample_id] = np.average(scan_slice_time.timings)
    
    print("Testing slice data[:, :, 700.0: 710]")
    quad_slice_time = %timeit -o tmp = data[:, :, 700.0: 710]
    quad_slice_times[sample_id] = np.average(quad_slice_time.timings)
    
    print("Testing slice data[:, :, :, 621.9: 622.1]")
    tof_slice_time = %timeit -o tmp = data[:, :, :, 621.9: 622.1]
    tof_slice_times[sample_id] = np.average(tof_slice_time.timings)
    
    print("")

Time to slice DDA_6:
Testing slice data[100.0: 100.5]
2 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Testing slice data[:, 450]
45.7 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing slice data[:, :, 700.0: 710]
28.4 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing slice data[:, :, :, 621.9: 622.1]
87.5 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to slice DIA_6:
Testing slice data[100.0: 100.5]
8.47 ms ± 679 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Testing slice data[:, 450]
33.5 ms ± 6.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing slice data[:, :, 700.0: 710]
734 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing slice data[:, :, :, 621.9: 622.1]
106 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to slice DDA_21:
Testing slice data[100.0: 100.5]
2.14 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [26]:
overview["Slice LC (s)"] = pd.Series(frame_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice TIMS (s)"] = pd.Series(scan_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice QUAD (s)"] = pd.Series(quad_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice TOF (s)"] = pd.Series(tof_slice_times).apply(
    lambda x : f"{x:#.3g}"
)

## Final overview

The final time performance can thus be summarized as follows:

In [37]:
overview

Unnamed: 0,Type,Gradient (min),Detector strikes,Load .d (s),Save .hdf (s),Load .hdf (s),Slice frame (s),Slice scan (s),Slice quad (s),Slice tof (s)
DDA_6,DDA,6,214172697,1.82,0.572,0.508,0.002,0.0457,0.0284,0.0875
DIA_6,DIA,6,158552099,1.24,0.415,0.359,0.00847,0.0335,0.734,0.106
DDA_21,DDA,21,295251252,3.38,0.81,0.871,0.00214,0.0729,0.115,0.203
DIA_21,DIA,21,730564765,5.33,1.86,2.06,0.00102,0.135,5.58,0.414
DDA_120,DDA,120,2074019899,23.9,5.18,10.5,0.000746,0.364,0.617,1.25
