# Performance

In [None]:
# default_exp performance

AlphaPept deals with high-throughput data. As this can be computationally intensive, we try to make all functions as performant as possible. To do so, we rely on two principles:
* **Compilation**
* **Parallelization**

A first step of **compilation** can be achieved by using NumPy arrays which are already heavily c-optimized. Net we consider three kinds of compilation:
* **Python** This allows to use no compilation
* **Numba** This allows to use just-in-time (JIT) compilation.
* **Cuda** This allows compilation on the GPU.

All of these compilation approaches can be combined with **parallelization** approaches. We consider the following possibilities:
* **No parallelization** Not all functionality can be parallelized.
* **Multithreading** This is only performant when Python's blobal interpreter lock (GIL) is released or when mostly using input-/output (IO) functions.
* **GPU** This is only available if an NVIDIA GPU is available and properly configured.

Note that not all compilation approaches can sensibly be combined with all parallelization approaches.

In [1]:
#export 

COMPILATION_MODE_OPTIONS = [
    "python",
    "python-multithread",
    "numba",
    "numba-multithread",
    "cuda", # Cuda is always multithreaded
]
COMPILATION_MODE = "numba-multithread"

Next we import all libraries, taking into account that not every machine has a GPU (with NVidia cores) available:

In [2]:
#export 

import functools
import math
import os
import logging
import psutil
import ast

# Parallelization
import multiprocessing
import threading

# Compilation
import numpy as np
import numba
from numba import cuda
try:
    import cupy
    cuda.get_current_device()
    __GPU_AVAILABLE = True
except ModuleNotFoundError:
    __GPU_AVAILABLE = False
    cupy = None
    logging.info("Cupy is not available")
except cuda.CudaSupportError:
    __GPU_AVAILABLE = False
    logging.info("Cuda device is not available")
    
def is_valid_compilation_mode(compilation_mode: str):
    """TODO
    """
    if compilation_mode.startswith("cuda"):
        if not __GPU_AVAILABLE:
            raise ModuleNotFoundError('Cuda functions are not available.')
    if compilation_mode not in COMPILATION_MODE_OPTIONS:
        raise NotImplementedError(
            f"Compilation mode '{compilation_mode}' is not available, "
            "see COMPILATION_MODE_OPTIONS for available options."
        )

To consistently use multiple threads or processes, we can set a global MAX_WORKER_COUNT parameter.

In [3]:
#export 

MAX_WORKER_COUNT = 1

def set_worker_count(worker_count: int = 1, set_global: bool = True) -> int:
    """Parse and set the (global) number of threads.

    Parameters
    ----------
    worker_count : int
        The number of workers.
        If larger than available cores, it is trimmed to the available maximum.
        If 0, it is set to the maximum cores available.
        If negative, it indicates how many cores NOT to use.
        Default is 1
    set_global : bool
        If False, the number of workers is only parsed to a valid value.
        If True, the number of workers is saved as a global variable.
        Default is True.

    Returns
    -------
    : int
        The parsed worker_count.
    """
    max_cpu_count = psutil.cpu_count()
    if worker_count > max_cpu_count:
        worker_count = max_cpu_count
    else:
        while worker_count <= 0:
            worker_count += max_cpu_count
    if set_global:
        global MAX_WORKER_COUNT
        MAX_WORKER_COUNT = worker_count
    return worker_count

Compiled functions are intended to be very fast. However, they do not have the same flexibility as pure Python functions. In general, we recommend to use staticly defined compilation functions for optimal performance. We provide the option to define a default compilation mode for decorated functions, while also allowing to define the compilation mode for each individual function.

**NOTE**: Compiled functions are by default expected to be performed on a single thread. Thus, 'cuda' funtions are always assumed to be device functions which makes them callable from within the GPU, unless explicitly stated otherwise. Similarly, 'numba' functions are always assumed to bo 'nopython' and 'nogil'.

In addition, we allow to enable dynamic compilation, meaning the compilation mode of functions can be changed at runtime. Do note that this comes at the cost of some performance, as compilation needs to be done at runtime as well. Moreover, functions that are defined with dynamic compilation can not be called from within other compiled functions (with the exception of 'python' compilation, which means no compilation is actually performed).

**NOTE**: Dynamic compilation must be enabled before functions are decorated to take effect at runtime, otherwise they are statically compiled with the current settings at the time they are defined! Alternatively, statically compiled functions of a an 'imported_module' can reloaded (and thus statically be recompiled) with the commands:
```
import importlib
importlib.reload(imported_module)
```

In [4]:
#export 

DYNAMIC_COMPILATION_ENABLED = False
    
def set_compilation_mode(
    compilation_mode: str = None,
    enable_dynamic_compilation: bool = None,
) -> None:
    """TODO
    """
    if enable_dynamic_compilation is not None:
        global DYNAMIC_COMPILATION_ENABLED
        DYNAMIC_COMPILATION_ENABLED = enable_dynamic_compilation
    if compilation_mode is not None:
        is_valid_compilation_mode(compilation_mode)
        global COMPILATION_MODE
        COMPILATION_MODE = compilation_mode
    

def compile_function(
    _func: callable = None,
    *,
    compilation_mode: str = None,
    **decorator_kwargs,
):
    """TODO
    """
    if compilation_mode is None:
        if DYNAMIC_COMPILATION_ENABLED:
            compilation_mode = "dynamic"
        else:
            compilation_mode = COMPILATION_MODE
    def parse_compilation(current_compilation_mode, func):
        if current_compilation_mode.startswith("python"):
            compiled_function = func
        elif current_compilation_mode.startswith("numba"):
            if "nogil" in decorator_kwargs:
                if "nopython" in decorator_kwargs:
                    compiled_function = numba.jit(func, **decorator_kwargs)
                else:
                    compiled_function = numba.jit(func, **decorator_kwargs, nopython=True)
            elif "nopython" in decorator_kwargs:
                compiled_function = numba.jit(func, **decorator_kwargs, nogil=True)
            else:
                compiled_function = numba.jit(func, **decorator_kwargs, nogil=True, nopython=True)
        elif current_compilation_mode.startswith("cuda"):
            if "device" in decorator_kwargs:
                compiled_function = cuda.jit(func, **decorator_kwargs)
            else:
                compiled_function = cuda.jit(func, **decorator_kwargs, device=True)
        return compiled_function
    def decorated_function(func):
        if compilation_mode != "dynamic":
            is_valid_compilation_mode(compilation_mode)
            static_compiled_function = parse_compilation(compilation_mode, func)
            return functools.wraps(func)(static_compiled_function)
        else:
            def dynamic_compiled_function(*func_args, **func_kwargs):
                compiled_function = parse_compilation(COMPILATION_MODE, func)
                return compiled_function(*func_args, **func_kwargs)
            return functools.wraps(func)(dynamic_compiled_function)
    if _func is None:
        return decorated_function
    else:
        return decorated_function(_func)

Testing yields the expected results:

In [5]:
import types

set_compilation_mode(compilation_mode="numba-multithread")

@compile_function(compilation_mode="python")
def test_func_python(x):
    """Docstring test"""
    x[0] += 1
    
@compile_function(compilation_mode="numba")
def test_func_numba(x):
    """Docstring test"""
    x[0] += 1
    
@compile_function(compilation_mode="cuda")
def test_func_cuda(x):
    """Docstring test"""
    x[0] += 1

set_compilation_mode(enable_dynamic_compilation=True)

@compile_function
def test_func_dynamic_runtime(x):
    """Docstring test"""
    x[0] += 1

set_compilation_mode(enable_dynamic_compilation=False, compilation_mode="numba-multithread")

@compile_function
def test_func_static_runtime_numba(x):
    """Docstring test"""
    x[0] += 1

a = np.zeros(1, dtype=np.int64)
assert(isinstance(test_func_python, types.FunctionType))
test_func_python(a)
assert(np.all(a == np.ones(1)))

a = np.zeros(1)
assert(isinstance(test_func_numba, numba.core.registry.CPUDispatcher))
test_func_numba(a)
assert(np.all(a == np.ones(1)))

# # Cuda function cannot be tested from outside the GPU
# a = np.zeros(1)
# assert(isinstance(test_func_cuda, numba.cuda.compiler.Dispatcher))
# test_func_cuda.forall(1,1)(a)
# assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="python")
a = np.zeros(1)
assert(isinstance(test_func_static_runtime_numba, numba.core.registry.CPUDispatcher))
test_func_static_runtime_numba(a)
assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="python")
a = np.zeros(1)
assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
test_func_dynamic_runtime(a)
assert(np.all(a == np.ones(1)))

set_compilation_mode(compilation_mode="numba")
a = np.zeros(1)
assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
test_func_dynamic_runtime(a)
assert(np.all(a == np.ones(1)))

# # Cuda function cannot be tested from outside the GPU
# set_compilation_mode(compilation_mode="cuda")
# a = np.zeros(1)
# assert(isinstance(test_func_dynamic_runtime, types.FunctionType))
# test_func_dynamic_runtime.forall(1,1)(a)
# assert(np.all(a == np.ones(1)))

Next, we define the 'performance_function' decorator to take full advantage of both compilation and parallelization for maximal performance. Note that a 'performance_function' can note reutnr values. Instead, it should store results in provided buffer arrays.

In [47]:
#export 

def performance_function(
    _func: callable = None,
    *,
    worker_count: int = None,
    compilation_mode: str = None,
    **decorator_kwargs,
):
    """TODO
    """
    if worker_count is not None:
        worker_count = set_worker_count(worker_count, set_global=False)
    if compilation_mode is None:
        if DYNAMIC_COMPILATION_ENABLED:
            compilation_mode = "dynamic"
        else:
            compilation_mode = COMPILATION_MODE
    else:
        is_valid_compilation_mode(compilation_mode)
    def _decorated_function(func):
        if compilation_mode != "dynamic":
            compiled_function = compile_function(
                func,
                compilation_mode=compilation_mode,
                **decorator_kwargs
            )
        def _parallel_python(
            compiled_function,
            iterable,
            thread_id,
            start,
            stop,
            step,
            *func_args
        ):
            if len(iterable) == 0:
                for index in range(start, stop, step):
                    compiled_function(index, *func_args)
            else:
                for index in iterable:
                    compiled_function(index, *func_args)
        _parallel_numba = numba.njit(nogil=True)(_parallel_python)
        def _parallel_cuda(compiled_function, iterable, *func_args):
            cuda_func_dict = {"cuda": cuda, "compiled_function": compiled_function}
            func_string = ", ".join(f"arg{i}" for i in range(len(func_args)))
            cuda_string = (
                f"@cuda.jit\n"
                f"def cuda_func({func_string}):\n"
                f"    index = cuda.grid(1)\n"
                f"    compiled_function(index, {func_string})\n"
            )
            exec(cuda_string, cuda_func_dict)
            cuda_func_dict["cuda_func"].forall(len(iterable), 1)(*func_args)
#                 @cuda.jit
#                 def cuda_func(a1, a2, a3):
#                     index = cuda.grid(1)
#                     compiled_function(index, a1, a2, a3)
#                 cuda_func.forall(len(iterable), 1)(*func_args)
        def _performance_function(iterable, *func_args):
            if compilation_mode == "dynamic":
                selected_compilation_mode = COMPILATION_MODE
                _compiled_function = compile_function(
                    func,
                    compilation_mode=selected_compilation_mode,
                    **decorator_kwargs
                )
            else:
                _compiled_function = compiled_function
                selected_compilation_mode = compilation_mode
            try:
                iter(iterable)
            except TypeError:
                iterable = [iterable]
            if worker_count is None:
                selected_worker_count = MAX_WORKER_COUNT
            else:
                selected_worker_count = worker_count
            if selected_compilation_mode == "cuda":
                _parallel_cuda(_compiled_function, iterable, *func_args)
            else:
                if "python" in selected_compilation_mode:
                    parallel_function = _parallel_python
                elif "numba" in selected_compilation_mode:
                    parallel_function = _parallel_numba
                else:
                    raise NotImplementedError(
                        f"Compilation mode {selected_compilation_mode} is not valid. "
                        "This error should not be possible, something is seriously wrong!!!"
                    )
                if selected_compilation_mode in ["python", "numba"]:
                    iterable_is_range = isinstance(iterable, range)
                    parallel_function(
                        _compiled_function,
                        np.empty(0, dtype=np.int64) if iterable_is_range else iterable,
                        0,
                        iterable.start if iterable_is_range else -1,
                        iterable.stop if iterable_is_range else -1,
                        iterable.step if iterable_is_range else -1,
                        *func_args
                    )
                else:
                    workers = []
                    for worker_id in range(selected_worker_count):
                        local_iterable = iterable[worker_id::selected_worker_count]
                        iterable_is_range = isinstance(local_iterable, range)
                        worker = threading.Thread(
                            target=parallel_function,
                            args=(
                                _compiled_function,
                                np.empty(0, dtype=np.int64) if iterable_is_range else local_iterable,
                                worker_id,
                                local_iterable.start if iterable_is_range else -1,
                                local_iterable.stop if iterable_is_range else -1,
                                local_iterable.step if iterable_is_range else -1,
                                *func_args
                            )
                        )
                        worker.start()
                        workers.append(worker)
                    for worker in workers:
                        worker.join()
                        del worker
        return functools.wraps(func)(_performance_function)
    if _func is None:
        return _decorated_function
    else:
        return _decorated_function(_func)

In [82]:
def smooth_func(index, in_array, out_array, window_size):
    min_index = max(index - window_size, 0)
    max_index = min(index + window_size + 1, len(in_array))
    smooth_value = 0
    for i in range(min_index, max_index):
        smooth_value += 2 * in_array[i]
    out_array[index] += smooth_value / (max_index - min_index)

    
set_worker_count(0)
s = 10**6
in_array = cupy.arange(s)
out_array = cupy.zeros_like(in_array)

func = performance_function(compilation_mode="cuda")(smooth_func)
%time func(range(in_array.shape[0]), in_array, out_array, 100000)
# %time print(out_array)

Wall time: 12.5 s


In [88]:
s = 10**6
out_array

array([ 300000,  300003,  300006, ..., 5699988, 5699991, 5699994])

In [92]:
%time func(range(s), in_array, out_array, 100000)

Wall time: 12.5 s


In [None]:
d = {}
f = (
    "def z(a):\n"
    "  np.sum(a)\n"
    "  return a"
)
exec(f, d)
# d[z]

In [None]:
r = np.arange(3)

d["z"](r)

In [None]:
isinstance(r, np.ndarray)

In [None]:
d

In [None]:
fs = "a1,a2"
exec(
    "@cuda.jit\n"
    f"def cuda_func2({fs}):\n"
    "    index = cuda.grid(1)\n"
    f"    compiled_function(index, {fs})\n"
)
cuda_func2

In [None]:
d["z"]

In [None]:
#export 
import multiprocessing
import threading
import functools
import math
import numpy as np
from numba import njit
import psutil
from multiprocessing import Pool
import os
import logging

import numpy as numpy_
cupy = numpy_
from numba import cuda as cuda_
cuda = cuda_
import numba as numba_
numba = numba_

# We use jit_fun and jit_fun_gpu.
# This is to be able too distinguish GPU device functions and numba optimized functions in a GPU setting

jit_fun = None
jit_fun_gpu = None
speed_mode = None


try:
    import cupy as cupy_
except ModuleNotFoundError:
    cupy_ = None

def dummy_decorator(func):
    """
    Dummy decorator that does nothing
    """
    return func

def set_speed_mode(mode):
    """
    Function to change 
    
    """
    global jit_fun
    global jit_fun_gpu
    global speed_mode
    global cupy
    
    speed_mode = mode
    
    if mode == 'python':
        jit_fun = dummy_decorator
        jit_fun_gpu = dummy_decorator
        cupy = numpy_
    elif mode == 'numba':
        jit_fun = njit
        jit_fun_gpu = njit
        cupy = numpy_
    elif mode == 'numba_gpu':
        jit_fun = njit
        jit_fun_gpu = cuda.jit(device=True)
        if cupy_ is not None:
            cupy = cupy_
        else:
            raise ModuleNotFoundError('Cupy not installed.')
    else:
        raise NotImplementedError(mode)
        
        
if cupy_ is not None:
    set_speed_mode('numba_gpu')
else:
    set_speed_mode('numba')
    
@numba.njit
def grid_1d(x): return -1
@numba.njit
def grid_2d(x): return -1, -1

def set_cuda_grid(dimensions=0):
    global cuda
    if dimensions == 0:
        cuda = cuda_
        cuda.grid = cuda_.grid
    if dimensions == 1:
        cuda = numba_
        cuda.grid = grid_1d
    if dimensions == 2:
        cuda = numba_
        cuda.grid = grid_2d
        
        
def numba_threaded(_func=None, *, cpu_threads=0):
    if cpu_threads <= 0:
        cpu_threads = multiprocessing.cpu_count()
        
    def parallel_compiled_func_inner(func):
        if speed_mode == 'python':
            numba_func = func
        else:
            numba_func = numba.njit(nogil=True)(func)

        def numba_func_parallel(thread, iterable, *args):
            for i in range(thread, len(iterable), cpu_threads):
                numba_func(i, iterable, *args)

        if speed_mode == 'python':
            numba_func_parallel = numba_func_parallel
        else:
            numba_func_parallel = numba.njit(nogil=True)(numba_func_parallel)

        def wrapper(iterable, *args):
            threads = []
            for thread_id in range(cpu_threads):
                t = threading.Thread(
                    target=numba_func_parallel,
                    args=(thread_id, iterable, *args)
                )
                t.start()
                threads.append(t)
            for t in threads:
                t.join()
                del t
        return functools.wraps(func)(wrapper)
    
    if _func is None:
        return parallel_compiled_func_inner
    else:
        return parallel_compiled_func_inner(_func)
    
     
def parallel_compiled_func(
    _func=None,
    *,
    cpu_threads=None,
    dimensions=1,
):
    #set_cuda_grid()
    if dimensions not in (1, 2):
        raise ValueError("Only 1D and 2D are supported")

    if cpu_threads is not None:
        use_gpu = False
    else:
        try:
            cuda_.get_current_device()
        except cuda_.CudaSupportError:
            use_gpu = False
            cpu_threads = 0
        else:
            use_gpu = True
        try:
            import cupy
        except ModuleNotFoundError:
            use_gpu = False
            cpu_threads = 0
            
    if cpu_threads is None:
        cpu_threads = multiprocessing.cpu_count()
        
    if cpu_threads <= 0:
        cpu_threads = multiprocessing.cpu_count()

    if speed_mode == 'numba_gpu':
        use_gpu = True
    elif speed_mode == 'numba':
        use_gpu = False
    elif speed_mode == 'python':
        use_gpu = False
                
    if use_gpu:
        set_cuda_grid()
        def parallel_compiled_func_inner(func):
            cuda_func = cuda.jit(func)
            if dimensions == 1:
                def wrapper(iterable_1d, *args):
                    cuda_func.forall(iterable_1d.shape[0], 1)(
                        -1,
                        iterable_1d,
                        *args
                    )
            elif dimensions == 2:
                def wrapper(iterable_2d, *args):
                    threadsperblock = (
                        min(iterable_2d.shape[0], 16),
                        min(iterable_2d.shape[0], 16)
                    )
                    blockspergrid_x = math.ceil(
                        iterable_2d.shape[0] / threadsperblock[0]
                    )
                    blockspergrid_y = math.ceil(
                        iterable_2d.shape[1] / threadsperblock[1]
                    )
                    blockspergrid = (blockspergrid_x, blockspergrid_y)
                    cuda_func[blockspergrid, threadsperblock](
                        -1,
                        -1,
                        iterable_2d,
                        *args
                    )
            return functools.wraps(func)(wrapper)
    else:
        set_cuda_grid(dimensions)
        if cpu_threads <= 0:
            cpu_threads = multiprocessing.cpu_count()
        def parallel_compiled_func_inner(func):
            
            if speed_mode == 'python':
                numba_func = func
            else:
                numba_func = numba.njit(nogil=True)(func)

            if dimensions == 1:
                def numba_func_parallel(
                    thread,
                    iterable_1d,
                    *args
                ):
                    for i in range(
                        thread,
                        len(iterable_1d),
                        cpu_threads
                    ):
                        numba_func(i, iterable_1d, *args)
            elif dimensions == 2:
                def numba_func_parallel(
                    thread,
                    iterable_2d,
                    *args
                ):
                    for i in range(
                        thread,
                        iterable_2d.shape[0],
                        cpu_threads
                    ):
                        for j in range(iterable_2d.shape[1]):
                            numba_func(i, j, iterable_2d, *args)
                            
            if speed_mode == 'python':
                numba_func_parallel = numba_func_parallel
            else:
                numba_func_parallel = numba.njit(nogil=True)(numba_func_parallel)

            def wrapper(iterable, *args):
                threads = []
                for thread_id in range(cpu_threads):
                    t = threading.Thread(
                        target=numba_func_parallel,
                        args=(thread_id, iterable, *args)
                    )
                    t.start()
                    threads.append(t)
                for t in threads:
                    t.join()
                    del t
            return functools.wraps(func)(wrapper)
    if _func is None:
        return parallel_compiled_func_inner
    else:
        return parallel_compiled_func_inner(_func)
    
def set_max_process(a):
    max_processes = psutil.cpu_count()
    new_max = min(a, max_processes)
    
    return new_max

def AlphaPool(a, *args, **kwargs):  
    max_processes = psutil.cpu_count()
    if a == -1:
        a = max_processes
    new_max = min(a, 50, max_processes)
    print(f"AlphaPool was set to {a} processes. Setting max to {new_max}.")

    return Pool(new_max)

In [None]:
#hide
from nbdev.export import *
notebook2script()