# Performance

This notebook is intended to demonstrate the performance of AlphaTIMS.
It contains the following information:
1. [**Samples**](#Samples)
2. [**Reading raw Bruker .d folders**](#Reading-raw-Bruker-.d-folders)
3. [**Saving HDF files**](#Saving-HDF-files)
4. [**Reading HDF files**](#Reading-HDF-files)
5. [**Slicing data**](#Slicing-data)
6. [**Final overview**](#Final-overview)

The following system was used:

In [1]:
import alphatims.utils
import pandas as pd
import numpy as np

alphatims.utils.set_threads(8)
log_file_name = alphatims.utils.set_logger(
    log_file_name="performance_log.txt",
    overwrite=True
)
alphatims.utils.show_platform_info()
alphatims.utils.show_python_info()
alphatims.utils.set_progress_callback(None)

2021-05-05 09:39:16> Platform information:
2021-05-05 09:39:16> system        - Darwin
2021-05-05 09:39:16> release       - 19.6.0
2021-05-05 09:39:16> version       - 10.15.7
2021-05-05 09:39:16> machine       - x86_64
2021-05-05 09:39:16> processor     - i386
2021-05-05 09:39:16> cpu count     - 8
2021-05-05 09:39:16> cpu frequency - 2300.00 Mhz
2021-05-05 09:39:16> ram           - 22.0/32.0 Gb (available/total)
2021-05-05 09:39:16> 
2021-05-05 09:39:16> Python information:
2021-05-05 09:39:16> alphatims  - 0.2.6
2021-05-05 09:39:16> bokeh      - 2.2.3
2021-05-05 09:39:16> click      - 7.1.2
2021-05-05 09:39:16> datashader - 0.12.1
2021-05-05 09:39:16> h5py       - 3.2.1
2021-05-05 09:39:16> hvplot     - 0.7.1
2021-05-05 09:39:16> numba      - 0.53.1
2021-05-05 09:39:16> pandas     - 1.2.3
2021-05-05 09:39:16> psutil     - 5.8.0
2021-05-05 09:39:16> python     - 3.8.8
2021-05-05 09:39:16> pyzstd     - 0.14.4
2021-05-05 09:39:16> selenium   - 3.141.0
2021-05-05 09:39:16> tqdm       - 

## Samples

Five samples are used and compared:

In [2]:
file_names = {
    "DDA_6": "/Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d",
    "DIA_6": "/Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_SA_HeLa_200ng_300-1200_2steps_16scans_06-15_5_6min_4cm_S1-A4_1_21720.d",
    "DDA_21": "/Users/swillems/Data/alphatims_testing/20201207_tims03_Evo03_PS_SA_HeLa_200ng_EvoSep_prot_DDA_21min_8cm_S1-C10_1_22476.d",
    "DIA_21": "/Users/swillems/Data/alphatims_testing/20201207_tims03_Evo03_PS_SA_HeLa_200ng_EvoSep_prot_high_speed_21min_8cm_S1-C8_1_22474.d",
    "DDA_120": "/Users/swillems/Data/alphatims_testing/HeLa_200ng_1428.d"
}
overview = pd.DataFrame(index=file_names.keys())
overview["Type"] = pd.Series({x: x.split("_")[0] for x in file_names})
overview["Gradient (min)"] = pd.Series({x: x.split("_")[1] for x in file_names})

We first load all these files to show their basic statistics before we do the actual timing:

In [3]:
import logging
import alphatims.bruker

timstof_objects = {}
detector_strikes = {}
for sample_id, file_name in file_names.items():
    logging.info(f"Initial loading of {sample_id}")
    timstof_objects[sample_id] = alphatims.bruker.TimsTOF(file_name)
    detector_strikes[sample_id] = len(timstof_objects[sample_id])
    logging.info("")

2021-05-05 09:39:16> 
2021-05-05 09:39:16> Initial loading of DDA_6
2021-05-05 09:39:16> Importing data from /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-05-05 09:39:16> Reading frame metadata for /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-05-05 09:39:17> Reading 2,978 frames with 214,172,697 detector strikes for /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-05-05 09:39:18> Indexing /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d...
2021-05-05 09:39:18> Succesfully imported data from /Users/swillems/Data/alphatims_testing/20201016_tims03_Evo03_PS_MA_HeLa_200ng_DDA_06-15_5_6min_4cm_S1-A1_1_21717.d
2021-05-05 09:39:18> 
2021-05-05 09:39:18> Initial loading of DIA_6
2021-05-05 09:39:18> Importing data

In [4]:
overview["Detector strikes"] = pd.Series(detector_strikes).apply(
    lambda x : f"{x:,}"
)

In summary, we thus consider the following samples:

In [5]:
overview

Unnamed: 0,Type,Gradient (min),Detector strikes
DDA_6,DDA,6,214172697
DIA_6,DIA,6,158552099
DDA_21,DDA,21,295251252
DIA_21,DIA,21,730564765
DDA_120,DDA,120,2074019899


## Reading raw Bruker .d folders

To avoid unwanted system inferences, we perform a `timeit` function to get a robust estimate of loading times for raw Bruker .d folders:

In [6]:
alphatims.utils.set_logger(stream=False)
raw_load_times = {}
for sample_id, file_name in file_names.items():
    print(f"Time to load {sample_id} raw Bruker .d folder:")
    raw_load_time = %timeit -o tmp = alphatims.bruker.TimsTOF(file_name)
    raw_load_times[sample_id] = np.average(raw_load_time.timings)
    print("")
overview["Load .d (s)"] = pd.Series(raw_load_times).apply(
    lambda x : f"{x:#.3g}"
)

Time to load DDA_6 raw Bruker .d folder:
1.48 s ± 7.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_6 raw Bruker .d folder:
1.05 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_21 raw Bruker .d folder:
2.95 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_21 raw Bruker .d folder:
4.5 s ± 83.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_120 raw Bruker .d folder:
21 s ± 1.78 s per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [7]:
overview["Load .d (s)"] = pd.Series(raw_load_times).apply(
    lambda x : f"{x:#.3g}"
)

## Saving HDF files

Each of the data files can also be exported to HDF files:

In [8]:
save_hdf_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to export {sample_id} to HDF file:")
    path = data.directory
    file_name = f"{data.sample_name}.hdf"
    save_hdf_time = %timeit -o tmp = data.save_as_hdf(path, file_name, overwrite=True)
    save_hdf_times[sample_id] = np.average(save_hdf_time.timings)
    print("")

Time to export DDA_6 to HDF file:
562 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DIA_6 to HDF file:
407 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DDA_21 to HDF file:
761 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DIA_21 to HDF file:
1.83 s ± 70.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to export DDA_120 to HDF file:
4.93 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [9]:
overview["Save .hdf (s)"] = pd.Series(save_hdf_times).apply(
    lambda x : f"{x:#.3g}"
)

## Reading HDF files

Once these HDF files are created, they can be loaded much faster than raw Bruker .d folders:

In [10]:
import os
load_hdf_times = {}
for sample_id, data in timstof_objects.items():
    print(f"Time to load {sample_id} HDF file:")
    file_name = os.path.join(data.directory, f"{data.sample_name}.hdf")
    load_hdf_time = %timeit -o tmp = alphatims.bruker.TimsTOF(file_name)
    load_hdf_times[sample_id] = np.average(load_hdf_time.timings)
    print("")

Time to load DDA_6 HDF file:
458 ms ± 7.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_6 HDF file:
307 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_21 HDF file:
731 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DIA_21 HDF file:
1.89 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Time to load DDA_120 HDF file:
10 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [11]:
overview["Load .hdf (s)"] = pd.Series(load_hdf_times).apply(
    lambda x : f"{x:#.3g}"
)

## Slicing data

Lastly, we can slice this data. Since this uses Numba JIT compilation, we first compile the relevant functions with an initial slice call:

In [12]:
tmp = timstof_objects["DDA_6"][0]

Now we can time how long it takes to slice in different dimensions:
   * **LC:** $100.0 \leq \textrm{retention_time} \lt 100.5$
   * **TIMS:** $\textrm{scan_index} = 450$
   * **Quadrupole:** $700.0 \leq \textrm{quad_mz_values} \lt 710.0$
   * **TOF:** $621.9 \leq \textrm{tof_mz_values} \lt 622.1$

In [13]:
import os

frame_slice_times = {}
scan_slice_times = {}
quad_slice_times = {}
tof_slice_times = {}

frame_slice_counts = {}
scan_slice_counts = {}
quad_slice_counts = {}
tof_slice_counts = {}

for sample_id, data in timstof_objects.items():
    print(f"Time to slice {sample_id}:")
    
    count = len(data[100.:100.5])
    print(
        f"Testing LC slice data[100.0: 100.5] ({count:,} detector hits)."
    )
    frame_slice_time = %timeit -o tmp = data[100.:100.5]
    frame_slice_counts[sample_id] = count
    frame_slice_times[sample_id] = np.average(frame_slice_time.timings)
    
    count = len(data[:, 450])
    print(
        f"Testing TIMS slice data[:, 450] ({count:,} detector hits)."
    )
    scan_slice_time = %timeit -o tmp = data[:, 450]
    scan_slice_counts[sample_id] = count
    scan_slice_times[sample_id] = np.average(scan_slice_time.timings)
    
    count = len(data[:, :, 700.0: 710])
    print(
        f"Testing QUAD slice data[:, :, 700.0: 710] ({count:,} detector hits)."
    )
    quad_slice_time = %timeit -o tmp = data[:, :, 700.0: 710]
    quad_slice_counts[sample_id] = count
    quad_slice_times[sample_id] = np.average(quad_slice_time.timings)
    
    count = len(data[:, :, :, 621.9: 622.1])
    print(
        f"Testing TOF slice data[:, :, :, 621.9: 622.1] ({count:,} detector hits)."
    )
    tof_slice_time = %timeit -o tmp = data[:, :, :, 621.9: 622.1]
    tof_slice_counts[sample_id] = count
    tof_slice_times[sample_id] = np.average(tof_slice_time.timings)
    
    print("")

Time to slice DDA_6:
Testing LC slice data[100.0: 100.5] (18,034 detector hits).
1.66 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Testing TIMS slice data[:, 450] (393,184 detector hits).
36.5 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing QUAD slice data[:, :, 700.0: 710] (185,235 detector hits).
26.4 ms ± 645 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing TOF slice data[:, :, :, 621.9: 622.1] (52,422 detector hits).
76 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Time to slice DIA_6:
Testing LC slice data[100.0: 100.5] (86,623 detector hits).
6.18 ms ± 63.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Testing TIMS slice data[:, 450] (277,025 detector hits).
23 ms ± 969 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing QUAD slice data[:, :, 700.0: 710] (8,346,135 detector hits).
692 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Testing TOF slice data[:, :

In [14]:
overview["Slice LC (s)"] = pd.Series(frame_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice LC (hits)"] = pd.Series(frame_slice_counts).apply(
    lambda x : f"{x:,}"
)
overview["Slice TIMS (s)"] = pd.Series(scan_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice TIMS (hits)"] = pd.Series(scan_slice_counts).apply(
    lambda x : f"{x:,}"
)
overview["Slice QUAD (s)"] = pd.Series(quad_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice QUAD (hits)"] = pd.Series(quad_slice_counts).apply(
    lambda x : f"{x:,}"
)
overview["Slice TOF (s)"] = pd.Series(tof_slice_times).apply(
    lambda x : f"{x:#.3g}"
)
overview["Slice TOF (hits)"] = pd.Series(tof_slice_counts).apply(
    lambda x : f"{x:,}"
)

## Final overview

The final time performance can thus be summarized as follows:

In [15]:
overview.T

Unnamed: 0,DDA_6,DIA_6,DDA_21,DIA_21,DDA_120
Type,DDA,DIA,DDA,DIA,DDA
Gradient (min),6,6,21,21,120
Detector strikes,214172697,158552099,295251252,730564765,2074019899
Load .d (s),1.48,1.05,2.95,4.50,21.0
Save .hdf (s),0.562,0.407,0.761,1.83,4.93
Load .hdf (s),0.458,0.307,0.731,1.89,10.0
Slice LC (s),0.00166,0.00618,0.00170,0.000870,0.000698
Slice LC (hits),18034,86623,20923,5098,2331
Slice TIMS (s),0.0365,0.0230,0.0638,0.128,0.342
Slice TIMS (hits),393184,277025,658010,1353926,3988021
