In [1]:
# default_exp performance

# Performance

> Paralellization for GPU and CPU


AlphaPept deals with high-throughput data. As this can be computationally intensive, we try to make all functions as performant as possible. To do so, we rely on two principles:
* **Compilation**
* **Parallelization**

A first step of **compilation** can be achieved by using NumPy arrays which are already heavily c-optimized. Net we consider three kinds of compilation:
* **Python** This allows to use no compilation
* **Numba** This allows to use just-in-time (JIT) compilation.
* **Cuda** This allows compilation on the GPU.

All of these compilation approaches can be combined with **parallelization** approaches. We consider the following possibilities:
* **No parallelization** Not all functionality can be parallelized.
* **Multithreading** This is only performant when Python's global interpreter lock (GIL) is released or when mostly using input-/output (IO) functions.
* **GPU** This is only available if an NVIDIA GPU is available and properly configured.

Note that not all compilation approaches can sensibly be combined with all parallelization approaches.

In [2]:
#export 

COMPILATION_MODE_OPTIONS = [
    "python",
    "python-multithread",
    "numba",
    "numba-multithread",
    "cuda", # Cuda is always multithreaded
]

Next we import all libraries, taking into account that not every machine has a GPU (with NVidia cuda cores) available:

In [3]:
#export 

import functools
import math
import os
import logging
import psutil
import ast

# Parallelization
import multiprocessing
import threading

# Compilation
import numpy as np
import numba
from numba import cuda
try:
    import cupy
    cuda.get_current_device()
    __GPU_AVAILABLE = True
except ModuleNotFoundError:
    __GPU_AVAILABLE = False
    cupy = None
    logging.info("Cupy is not available")
except cuda.CudaSupportError:
    __GPU_AVAILABLE = False
    logging.info("Cuda device is not available")

def is_valid_compilation_mode(compilation_mode: str) -> None:
    """Check if the provided string is a valid compilation mode.

    Args:
        compilation_mode (str): The compilation mode to verify.

    Raises:
        ModuleNotFoundError: When trying to use an unavailable GPU.
        NotImplementedError: When the compilation mode is not valid.

    """
    if compilation_mode.startswith("cuda") and not __GPU_AVAILABLE:
        raise ModuleNotFoundError('Cuda functions are not available.')
    if compilation_mode not in COMPILATION_MODE_OPTIONS:
        raise NotImplementedError(
            f"Compilation mode '{compilation_mode}' is not available, "
            "see COMPILATION_MODE_OPTIONS for available options."
        )

By default, we will use `cuda` if it is available. If not, `numba-multithread` will be used as default.

In [4]:
#export

if __GPU_AVAILABLE:
    COMPILATION_MODE = "cuda"
else:
    COMPILATION_MODE = "numba-multithread"

To consistently use multiple threads or processes, we can set a global MAX_WORKER_COUNT parameter.

In [5]:
#export 

MAX_WORKER_COUNT = psutil.cpu_count()

def set_worker_count(worker_count: int = 1, set_global: bool = True) -> int:
    """Parse and set the (global) number of threads.

    Args:
        worker_count (int): The number of workers.
            If larger than available cores, it is trimmed to the available maximum.
            If 0, it is set to the maximum cores available.
            If negative, it indicates how many cores NOT to use.
            Default is 1
        set_global (bool): If False, the number of workers is only parsed to a valid value.
            If True, the number of workers is saved as a global variable.
            Default is True.

    Returns:
        int: The parsed worker_count.

    """
    max_cpu_count = psutil.cpu_count()
    if worker_count > max_cpu_count:
        worker_count = max_cpu_count
    else:
        while worker_count <= 0:
            worker_count += max_cpu_count
    if set_global:
        global MAX_WORKER_COUNT
        MAX_WORKER_COUNT = worker_count
    return worker_count

Compiled functions are intended to be very fast. However, they do not have the same flexibility as pure Python functions. In general, we recommend to use staticly defined compilation functions for optimal performance. We provide the option to define a default compilation mode for decorated functions, while also allowing to define the compilation mode for each individual function.

**NOTE**: Compiled functions are by default expected to be performed on a single thread. Thus, 'cuda' funtions are always assumed to be device functions which makes them callable from within the GPU, unless explicitly stated otherwise. Similarly, 'numba' functions are always assumed to bo 'nopython' and 'nogil'.

**NOTE** If the global compilation mode is set to Python, all decorators default to python, even if a specific compilation_mode is provided.

In addition, we allow to enable dynamic compilation, meaning the compilation mode of functions can be changed at runtime. Do note that this comes at the cost of some performance, as compilation needs to be done at runtime as well. Moreover, functions that are defined with dynamic compilation can not be called from within other compiled functions (with the exception of 'python' compilation, which means no compilation is actually performe|d).

**NOTE**: Dynamic compilation must be enabled before functions are decorated to take effect at runtime, otherwise they are statically compiled with the current settings at the time they are defined! Alternatively, statically compiled functions of a an 'imported_module' can reloaded (and thus statically be recompiled) with the commands:
```
import importlib
importlib.reload(imported_module)
```

In [6]:
#export 

DYNAMIC_COMPILATION_ENABLED = False

def set_compilation_mode(
    compilation_mode: str = None,
    enable_dynamic_compilation: bool = None,
) -> None:
    """Set the global compilation mode to use.

    Args:
        compilation_mode (str): The compilation mode to use.
            Will be checked with `is_valid_compilation_mode`.
            Default is None
        enable_dynamic_compilation (bool): Enable dynamic compilation.
            If enabled, code will generally be slower and no other functions can
            be called from within a compiled function anymore, as they are compiled at runtime.
            WARNING: Enabling this is strongly disadvised in almost all cases!
            Default is None.

    """
    if enable_dynamic_compilation is not None:
        global DYNAMIC_COMPILATION_ENABLED
        DYNAMIC_COMPILATION_ENABLED = enable_dynamic_compilation
    if compilation_mode is not None:
        is_valid_compilation_mode(compilation_mode)
        global COMPILATION_MODE
        COMPILATION_MODE = compilation_mode


def compile_function(
    _func: callable = None,
    *,
    compilation_mode: str = None,
    **decorator_kwargs,
) -> callable:
    """A decorator to compile a given function.

    Numba functions are by default set to use `nogil=True` and `nopython=True`,
    unless explicitly defined otherwise.
    Cuda functions are by default set to use `device=True`,
    unless explicitly defined otherwise..

    Args:
        compilation_mode (str): The compilation mode to use.
            Will be checked with `is_valid_compilation_mode`.
            If None, the global COMPILATION_MODE will be used as soon as the function is decorated for static compilation.
            If DYNAMIC_COMPILATION_ENABLED, the function will always be compiled at runtime and
            thus by default returns a Python function.
            Static recompilation can be enforced by reimporting a module containing
            the function with importlib.reload(imported_module).
            If COMPILATION_MODE is Python and not DYNAMIC_COMPILATION_ENABLED, no compilation will be used.
            Default is None
        **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.

    Returns:
        callable: A decorated function that is compiled.

    """
    if compilation_mode is None:
        if DYNAMIC_COMPILATION_ENABLED:
            compilation_mode = "dynamic"
        else:
            compilation_mode = COMPILATION_MODE
    elif COMPILATION_MODE.startswith("python"):
        compilation_mode = "python"
    else:
        is_valid_compilation_mode(compilation_mode)
    def parse_compilation(current_compilation_mode, func):
        if current_compilation_mode.startswith("python"):
            compiled_function = __copy_func(func)
        elif current_compilation_mode.startswith("numba"):
            if "nogil" in decorator_kwargs:
                if "nopython" in decorator_kwargs:
                    compiled_function = numba.jit(func, **decorator_kwargs)
                else:
                    compiled_function = numba.jit(func, **decorator_kwargs, nopython=True)
            elif "nopython" in decorator_kwargs:
                compiled_function = numba.jit(func, **decorator_kwargs, nogil=True)
            else:
                compiled_function = numba.jit(func, **decorator_kwargs, nogil=True, nopython=True)
        elif current_compilation_mode.startswith("cuda"):
            if "device" in decorator_kwargs:
                compiled_function = cuda.jit(func, **decorator_kwargs)
            else:
                compiled_function = cuda.jit(func, **decorator_kwargs, device=True)
        return compiled_function
    def decorated_function(func):
        if compilation_mode != "dynamic":
            is_valid_compilation_mode(compilation_mode)
            static_compiled_function = parse_compilation(compilation_mode, func)
            return functools.wraps(func)(static_compiled_function)
        else:
            def dynamic_compiled_function(*func_args, **func_kwargs):
                compiled_function = parse_compilation(COMPILATION_MODE, func)
                return compiled_function(*func_args, **func_kwargs)
            return functools.wraps(func)(dynamic_compiled_function)
    if _func is None:
        return decorated_function
    else:
        return decorated_function(_func)


import types
import functools

def __copy_func(f):
    """Based on http://stackoverflow.com/a/6528148/190597 (Glenn Maynard)"""
    g = types.FunctionType(f.__code__, f.__globals__, name=f.__name__,
                           argdefs=f.__defaults__,
                           closure=f.__closure__)
    g = functools.update_wrapper(g, f)
    g.__kwdefaults__ = f.__kwdefaults__
    return g

Testing yields the expected results:

In [7]:
import types

set_compilation_mode(compilation_mode="numba-multithread")

@compile_function(compilation_mode="python")
def test_func_python(x):
    """Docstring test"""
    x[0] += 1
    
@compile_function(compilation_mode="numba")
def test_func_numba(x):
    """Docstring test"""
    x[0] += 1

set_compilation_mode(enable_dynamic_compilation=True)

@compile_function
def test_func_dynamic_runtime(x):
    """Docstring test"""
    x[0] += 1

set_compilation_mode(enable_dynamic_compilation=False, compilation_mode="numba-multithread")

@compile_function
def test_func_static_runtime_numba(x):
    """Docstring test"""
    x[0] += 1

a = np.zeros(1, dtype=np.int64)
assert(isinstance(test_func_python, types.FunctionType))
test_func_python(a)
assert(np.all(a == np.ones(1)))

a = np.zeros(1)
assert(isinstance(test_func_numba, numba.core.registry.CPUDispatcher))
test_func_numba(a)
assert(np.all(a == np.ones(1)))

if __GPU_AVAILABLE:
    @compile_function(compilation_mode="cuda", device=None)
    def test_func_cuda(x):
        """Docstring test"""
        x[0] += 1

    # Cuda function cannot be tested from outside the GPU
    a = np.zeros(1)
    assert(isinstance(test_func_cuda, numba.cuda.compiler.Dispatcher))
    test_func_cuda.forall(1,1)(a)
    assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="python")
a = np.zeros(1)
assert(isinstance(test_func_static_runtime_numba, numba.core.registry.CPUDispatcher))
test_func_static_runtime_numba(a)
assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="python")
a = np.zeros(1)
assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
test_func_dynamic_runtime(a)
assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="numba")
a = np.zeros(1)
assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
test_func_dynamic_runtime(a)
assert(np.all(a == np.ones(1)))

# # Cuda function cannot be tested from outside the GPU
# set_compilation_mode(compilation_mode="cuda")
# a = np.zeros(1)
# assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
# test_func_dynamic_runtime.forall(1,1)(a)
# assert(np.all(a == np.ones(1)))

Next, we define the 'performance_function' decorator to take full advantage of both compilation and parallelization for maximal performance. Note that a 'performance_function' can not return values. Instead, it should store results in provided buffer arrays.

In [8]:
#export 

def performance_function(
    _func: callable = None,
    *,
    worker_count: int = None,
    compilation_mode: str = None,
    **decorator_kwargs,
) -> callable:
    """A decorator to compile a given function and allow multithreading over an multiple indices.

    NOTE This should only be used on functions that are compilable.
    Functions that need to be decorated need to have an `index` argument as first argument.
    If an iterable is provided to the decorated function,
    the original (compiled) function will be applied to all elements of this iterable.
    The most efficient way to provide iterables are with ranges, but numpy arrays work as well.
    Functions can not return values,
    results should be stored in buffer arrays inside thge function instead.

    Args:
        worker_count (int): The number of workers to use for multithreading.
            If None, the global MAX_WORKER_COUNT is used at runtime.
            Default is None.
        compilation_mode (str): The compilation mode to use. Will be forwarded to the `compile_function` decorator.
        **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.

    Returns:
        callable: A decorated function that is compiled and parallelized.

    """
    if worker_count is not None:
        worker_count = set_worker_count(worker_count, set_global=False)
    if compilation_mode is None:
        if DYNAMIC_COMPILATION_ENABLED:
            compilation_mode = "dynamic"
        else:
            compilation_mode = COMPILATION_MODE
    elif COMPILATION_MODE.startswith("python"):
        compilation_mode = "python"
    else:
        is_valid_compilation_mode(compilation_mode)
    def _decorated_function(func):
        if compilation_mode != "dynamic":
            compiled_function = compile_function(
                func,
                compilation_mode=compilation_mode,
                **decorator_kwargs
            )
        def _parallel_python(
            compiled_function,
            iterable,
            start,
            stop,
            step,
            *func_args
        ):
            if start != -1:
                for index in range(start, stop, step):
                    compiled_function(index, *func_args)
            else:
                for index in iterable:
                    compiled_function(index, *func_args)
        _parallel_numba = numba.njit(nogil=True)(_parallel_python)
        def _parallel_cuda(compiled_function, iterable, *func_args):
            cuda_func_dict = {"cuda": cuda, "compiled_function": compiled_function}
            # Cuda functions cannot handle tuple unpacking but need a fixed number of arguments.
            if isinstance(iterable, range):
                func_string = ", ".join(f"arg{i}" for i in range(len(func_args) + 3))
                cuda_string = (
                    f"@cuda.jit\n"
                    f"def cuda_func({func_string}):\n"
                    f"    index = arg0 + arg2 * cuda.grid(1)\n"
                    f"    compiled_function(index, {func_string[18:]})\n"
                )
                exec(cuda_string, cuda_func_dict)
                cuda_func_dict["cuda_func"].forall(len(iterable), 1)(
                    iterable.start,
                    iterable.stop,
                    iterable.step,
                    *func_args
                )
            else:
                func_string = ", ".join(f"arg{i}" for i in range(len(func_args) + 1))
                cuda_string = (
                    f"@cuda.jit\n"
                    f"def cuda_func({func_string}):\n"
                    f"    index = arg0[cuda.grid(1)]\n"
                    f"    compiled_function(index, {func_string[6:]})\n"
                )
                exec(cuda_string, cuda_func_dict)
                cuda_func_dict["cuda_func"].forall(len(iterable), 1)(iterable, *func_args)
        def _performance_function(iterable, *func_args):
            if compilation_mode == "dynamic":
                selected_compilation_mode = COMPILATION_MODE
                _compiled_function = compile_function(
                    func,
                    compilation_mode=selected_compilation_mode,
                    **decorator_kwargs
                )
            else:
                _compiled_function = compiled_function
                selected_compilation_mode = compilation_mode
            try:
                iter(iterable)
            except TypeError:
                iterable = np.array([iterable])
            if worker_count is None:
                selected_worker_count = MAX_WORKER_COUNT
            else:
                selected_worker_count = worker_count
            if selected_compilation_mode == "cuda":
                _parallel_cuda(_compiled_function, iterable, *func_args)
            else:
                if "python" in selected_compilation_mode:
                    parallel_function = _parallel_python
                elif "numba" in selected_compilation_mode:
                    parallel_function = _parallel_numba
                else:
                    raise NotImplementedError(
                        f"Compilation mode {selected_compilation_mode} is not valid. "
                        "This error should not be possible, something is seriously wrong!!!"
                    )
                if (selected_compilation_mode in ["python", "numba"]) or (selected_worker_count == 1):
                    iterable_is_range = isinstance(iterable, range)
                    x = np.empty(0, dtype=np.int64) if iterable_is_range else iterable
                    parallel_function(
                        _compiled_function,
                        np.empty(0, dtype=np.int64) if iterable_is_range else iterable,
                        iterable.start if iterable_is_range else -1,
                        iterable.stop if iterable_is_range else -1,
                        iterable.step if iterable_is_range else -1,
                        *func_args
                    )
                else:
                    workers = []
                    for worker_id in range(selected_worker_count):
                        local_iterable = iterable[worker_id::selected_worker_count]
                        iterable_is_range = isinstance(local_iterable, range)
                        worker = threading.Thread(
                            target=parallel_function,
                            args=(
                                _compiled_function,
                                np.empty(0, dtype=np.int64) if iterable_is_range else local_iterable,
                                local_iterable.start if iterable_is_range else -1,
                                local_iterable.stop if iterable_is_range else -1,
                                local_iterable.step if iterable_is_range else -1,
                                *func_args
                            )
                        )
                        worker.start()
                        workers.append(worker)
                    for worker in workers:
                        worker.join()
                        del worker
        return functools.wraps(func)(_performance_function)
    if _func is None:
        return _decorated_function
    else:
        return _decorated_function(_func)

We test this function with a simple smoothing algorithm.

In [9]:
def smooth_func(index, in_array, out_array, window_size):
    min_index = max(index - window_size, 0)
    max_index = min(index + window_size + 1, len(in_array))
    smooth_value = 0
    for i in range(min_index, max_index):
        smooth_value += in_array[i]
    out_array[index] += smooth_value / (max_index - min_index)


set_compilation_mode(compilation_mode="numba-multithread")
set_worker_count(0)
array_size = 10**6
smooth_factor = 10**4

# python test
in_array = np.arange(array_size)
out_array = np.zeros_like(in_array)

func = performance_function(compilation_mode="python")(smooth_func)
%time func(range(in_array[::100].shape[0]), in_array[::100], out_array[::100], smooth_factor//10) #too slow to test otherwise

# numba test
in_array = np.arange(array_size)
out_array = np.zeros_like(in_array)

func = performance_function(compilation_mode="numba")(smooth_func)
%time func(range(in_array.shape[0]), in_array, out_array, smooth_factor)

# numba-multithread test
in_array = np.arange(array_size)
out_array = np.zeros_like(in_array)

func = performance_function(compilation_mode="numba-multithread")(smooth_func)
%time func(range(in_array.shape[0]), in_array, out_array, smooth_factor)

# cuda test
if __GPU_AVAILABLE:
    in_array = cupy.arange(array_size)
    out_array = cupy.zeros_like(in_array)

    func = performance_function(compilation_mode="cuda")(smooth_func)
    %time func(range(in_array.shape[0]), in_array, out_array, smooth_factor)
    %time tmp = out_array.get()

CPU times: total: 2.55 s
Wall time: 2.54 s
CPU times: total: 7.47 s
Wall time: 7.49 s
CPU times: total: 11 s
Wall time: 887 ms


Finally, we also provide functionality to use multiprocessing instead of multithreading.

**NOTE**: There are some inherent limitation with the number of processes that Python can spawn. As such, no process Pool should use more than 50 processes. 

In [10]:
#export 
from multiprocessing import Pool

def AlphaPool(process_count: int) -> multiprocessing.Pool:
    """Create a multiprocessing.Pool object.

    Args:
        process_count (int): The number of processes.
            If larger than available cores, it is trimmed to the available maximum.


    Returns:
        multiprocessing.Pool: A Pool object to parallelize functions with multiple processes.

    """
    max_processes = psutil.cpu_count()
    new_max = min(process_count, 50, max_processes)
    
    if new_max == 0:
        new_max = 1
    logging.info(f"AlphaPool was set to {process_count} processes. Setting max to {new_max}.")

    return Pool(new_max)

In [11]:
#hide
from nbdev.export import *
notebook2script()

Converted 00_settings.ipynb.
Converted 01_chem.ipynb.
Converted 02_io.ipynb.
Converted 03_fasta.ipynb.
Converted 05_search.ipynb.
Converted 06_score.ipynb.
Converted 07_recalibration.ipynb.
Converted 08_quantification.ipynb.
Converted 09_matching.ipynb.
Converted 10_constants.ipynb.
Converted 11_interface.ipynb.
Converted 12_performance.ipynb.
Converted 13_export.ipynb.
Converted 14_display.ipynb.
Converted 15_label.ipynb.
Converted additional_code.ipynb.
Converted contributing.ipynb.
Converted file_formats.ipynb.
Converted index.ipynb.
