# Speeding up ToFu with **cython**

Laura S. Mendoza

## Table of contents

1. Cython's brief overview
2. My work on ToFu
3. How to optimize code: what I've learned
4. Parallelization with Cython
5. CI in ToFu with Cython

## A Cython overview

- A python library and a **language** to interface Python with C
- Aims to be easy to write as Python language and fast as C/C++
- Cython code is translated to C/C++ and compiled to Python module
- Easy way to link C/C++ external librairies to your Python code
- Python is valid Cython code
- Allows you to "remove" the python layer and access/compute directly with lowlevel C code

In [1]:
import numpy as np
%load_ext cython

In [2]:
%%cython -f

import numpy as np

# cellule cython : %%cython -f
def basic_python_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

## My work on Tofu

- Optimize core functions written in Cython
- Improve CI/CD of the library
- Improve documentation

## My work on Tofu

- Optimize core functions written in Cython
- Improve CI/CD of the library
- Improve documentation

*Example 1: optimize Ray-tracing algorithm*

<div>
<img src="2012_LM_cython_speedup_images/tokamak.png" width="700"/>
</div>
<div>
<img src="2012_LM_cython_speedup_images/tab_complete.png" width="700"/>
</div>

## My work on Tofu

- Optimize core functions written in Cython
- Improve CI/CD of the library
- Improve documentation

*Example 2: Optimization of spatial integration routines*

<div>
<img src="2012_LM_cython_speedup_images/spatial_integ.png" width="500"/>
</div>

# Speeding up Cython functions

## Basic tutorials and tips

For the basis, this is a list of documentation that I found useful:
* https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html
* https://cython.readthedocs.io/en/latest/src/tutorial/
    
Some basics tips that will speed up your code significantly:
* **Type** your variables : *all* variables, functions inputs, local variables, global variables, etc.
* Minimize functions called from other python libraries (avoid overheads)
* Try defining all local function as `inline`
* Learn the difference between cdef, def, and pcdef
* If not necessary, release the GIL and make it explicit (ie. use `nogil`)
* Use function decorators (e.g. ` @cython.boundscheck(False) `)
* Prefer using C libraries for math computations (from `libmath `: `sin`, `sqrt`, ...)  

## Anotating your code to see potential bottlenecks

In [3]:
%%cython --annotate
cimport cython
cimport numpy as np

def untyped_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

def somewhat_typed_func(np.ndarray[double, ndim=1, mode="c"] tab not None, int tab_len, double scalar):
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef double typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

<div>
<img src="2012_LM_cython_speedup_images/annotations.png" width="500"/>
</div>

The last version has virtually no yellow, thus less interfacing done by Cython, thus in theory is faster

### let's benchmark it

In [4]:
%%cython
import time
import sys

cimport cython
cimport numpy as np
import numpy as np


def untyped_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

def somewhat_typed_func(np.ndarray[double, ndim=1, mode="c"] tab not None, int tab_len, double scalar):
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef double typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline double inline_typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

cdef int L, i, loops = 1000
cdef double start, end, res
for L in [1000, 10000, 100000]:
    np_array = np.ones(L)
    print("For L = ", L)
    start = time.clock()
    res = untyped_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the untyped_func")
    # ..................................................
    start = time.clock()
    res = somewhat_typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the somewhat_typed_func")
    # ..................................................
    start = time.clock()
    res = typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the typed_func")
    # ..................................................
    start = time.clock()
    res = inline_typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the inline_typed_func")

For L =  1000
0.146000 μs, using the untyped_func
0.006000 μs, using the somewhat_typed_func
0.005000 μs, using the typed_func
0.005000 μs, using the inline_typed_func
For L =  10000
1.629000 μs, using the untyped_func
0.015000 μs, using the somewhat_typed_func
0.005000 μs, using the typed_func
0.005000 μs, using the inline_typed_func
For L =  100000
15.093000 μs, using the untyped_func
0.191000 μs, using the somewhat_typed_func
0.006000 μs, using the typed_func
0.005000 μs, using the inline_typed_func


## Handling numpy arrays

- Reading and writing from numpy arrays can be slow in cython. Some tutorials mentioned using memory views, other mention that C array give a clear improvement, and overall several different solutions are mentioned.
- A good benchmark: https://stackoverflow.com/questions/18462785/what-is-the-recommended-way-of-allocating-memory-for-a-typed-memory-view

In [5]:
%%cython
import time
import sys

from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as np
cimport numpy as np

cdef int loops

def timefunc(name):
    def timedecorator(f):
        cdef int L, i

        print("Running", name)
        for L in [1, 10, 100, 1000, 10000, 100000, 500000]:
            np_array = np.ones(L)
            start = time.clock()
            res_array = f(L, np_array)
            end = time.clock()
            print(format((end-start) / loops * 1e6, "2f"), end=" ")
            sys.stdout.flush()

        print("μs")
    return timedecorator

print()
print("-------- TESTS -------")
loops = 3000


@timefunc("numpy buffers")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef np.ndarray arr = np.zeros_like(np_array)
    for i in range(loops):
        arr = np_array
        for j in range(L):
            d = arr[j]
            arr[j] = d*2.
    return arr
    
@timefunc("cpython.array buffer")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef array[double] arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)
        for j in range(L):
            d =  arr[j]
            arr[j] = d*2.
    return np.asarray(arr)


@timefunc("cpython.array memoryview")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef double[::1] arr

    for i in range(loops):
        arr = np_array
        for j in range(L):
            # usage
            d = arr[j]
            arr[j] = d*2.
    return arr
    

@timefunc("cpython.array raw C type with trick")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef array arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)
        for j in range(L):
            # initialization
            arr.data.as_doubles[j] = np_array[j]
            # usage
            d = arr.data.as_doubles[j]
            arr.data.as_doubles[j] = d*2.
    # Prevents dead code elimination
    return np.asarray(arr)


@timefunc("C pointers")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef double* arrptr

    for i in range(loops):
        arrptr = <double*>malloc(L*sizeof(double))
        for j in range(L):
            arrptr[j] = np_array[j]
            d = arrptr[j]
            arrptr[j] = d*2.
        free(arrptr)
    return np_array # np.asarray(<double[:L]>arrptr)


-------- TESTS -------
Running numpy buffers
0.093000 0.655667 6.197333 57.957333 575.113000 5822.956667 30140.638667 μs
Running cpython.array buffer
0.081333 0.068333 0.128333 0.645333 5.425000 50.012667 249.452333 μs
Running cpython.array memoryview
0.238000 0.218667 0.265333 0.742667 5.505667 52.231333 254.715000 μs
Running cpython.array raw C type with trick
0.048333 0.033333 0.100667 0.624667 5.489667 54.005333 282.038667 μs
Running C pointers
0.024667 0.019667 0.071000 0.603000 5.436000 52.125000 252.807000 μs


**In conclusion:**
 For all cases, you will gain a 2x factor speed up by using Memoryview or C pointers. Since the memory is already allocated for the numpy array, it is not necessary to use `malloc`. We will adopt the following declaration:
~~~~
cdef double* arrptr
arrptr = <double*> np_array.data
~~~~
 
**However** avoid using pointers if you need to return a numpy array since the casting of return is extremelly time consuming, if you are aming for readability and easy to write: use memoryviews (to initialize using `array`, which means only 1D arrays) 

## Parallelization: do and donts

- Edit your `setup.py` to use OpenMP
- Add compilation flags for faster math operations
- To parallelize a loop using OpenMP it's easy: `range` => `prange`
- Declare functions as `nogil` and use `with nogil` loops to explicitly declare that the gil is released
- "*The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter.*"

In [6]:
import Cython.Compiler.Options as CO
CO.extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ]
CO.extra_link_args = ['-fopenmp']

Be careful with:
- parrallelizing depending loops: `tab[j+1] = tab[j]...`
- reductions (errors on Mac with clang): `i+=1`
- release the **gil**!! Don't use Python objects (some excepetions since Cython 3.0 late 2019), 
- always use appropiated flags: `boundscheck=False`, ...
- using the right scheduler (in my experience `dynamic` default one is the most reliable in different machines)
- `malloc` statements have to be done in `with parallel` loop

With cython you can also you directly **OpenMP** functions.


In [7]:
%%cython --compile=-fopenmp --link-args=-fopenmp

cimport cython

from cython.parallel cimport parallel, prange
from cython.parallel cimport threadid
from libc.stdio cimport stdout, fprintf
import time
import sys

from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free


@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline void seq_func(int L, double* arrptr):
    cdef int j

    for j in range(L):
        arrptr[j] = 2.0*arrptr[j]
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline void bad_par_func(int L, double* arrptr):
    cdef Py_ssize_t j
    cdef double d

    with nogil, parallel():
        arrptr[0] = 0
        for j in prange(1, L-1):
            # or any other operation that doesn't allow to the code parallelized
            arrptr[j+1] = 2.0*arrptr[j]-arrptr[j-1]
        arrptr[L-1] = 0
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline void good_par_func(int L, double* arrptr) nogil:
    cdef Py_ssize_t j

    for j in prange(L, nogil=True):
        arrptr[j] = 2.0*arrptr[j]
    return


cdef int L, i, loops = 1000, ilps
cdef double start, end, res
cdef double t0=0.0, t1=0.0, t2=0.0
cdef double* tab
for L in [1000, 10000, 100000]:
    tab = <double *> malloc(sizeof(double) * L)
    print("For L = ", L)
    for ilps in range(loops):
        # ..................................................
        start = time.clock()
        seq_func(L, tab)
        end = time.clock()
        t0 += (end - start) / loops
        # ..................................................
        start = time.clock()
        bad_par_func(L, tab)
        end = time.clock()
        t1 += (end - start) / loops
        # ..................................................
        start = time.clock()
        good_par_func(L, tab)
        end = time.clock()
        t2 += (end - start) / loops

    print(format(t0 * 1e6, "2f"), "μs, using the sequential loop")
    print(format(t1 * 1e6, "2f"), "μs, using the parallel 1 loop")
    print(format(t2 * 1e6, "2f"), "μs, using the parallel 2 loop")
    
    free(tab)

For L =  1000
48.983000 μs, using the sequential loop
92.317000 μs, using the parallel 1 loop
50.300000 μs, using the parallel 2 loop
For L =  10000
148.093000 μs, using the sequential loop
201.716000 μs, using the parallel 1 loop
56.730000 μs, using the parallel 2 loop
For L =  100000
1549.256000 μs, using the sequential loop
310.969000 μs, using the parallel 1 loop
77.598000 μs, using the parallel 2 loop


## One more tip

A benchmark showing the best way to declare a local small array

In [8]:
%%cython --compile-args=-openmp --link-args=-openmp -a

cimport cython

from cython.parallel import parallel, prange
from libc.stdlib cimport abort, malloc, free
import time, sys
import numpy as np
cimport numpy as np


cdef int loops

def timefunc(name):
    def timedecorator(f):
        cdef int L, i
        cdef np.ndarray np_array
        cdef np.ndarray[double] global_buf

        print("Running", name)
        for L in [10000, 1000000]:
            np_array = np.ones(L)
            global_buf = np_array
            start = time.clock()
            f(global_buf, L, <int>(L/2))
            end = time.clock()
            print(format((end-start) / loops * 1e6, "2f"), end=" ")
            sys.stdout.flush()

        print("μs")
    return timedecorator

print()
print("-------- TESTS -------")
loops = 1000

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Static allocation for n=2")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double[2] local_buf
    cdef int idx, i

    with nogil, parallel():
        for i in range(loops):
            for idx in prange(n2, schedule='guided'):
                local_buf[0] = global_buf[idx*2]
                local_buf[1] = global_buf[idx*2+1]
                func(local_buf)
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Dynamic allocation for n=2")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double* local_buf
    cdef int idx, i

    with nogil, parallel():
        for i in range(loops):
            local_buf = <double *> malloc(sizeof(double) * 2)
            for idx in prange(n2, schedule='guided'):
                local_buf[0] = global_buf[idx*2]
                local_buf[1] = global_buf[idx*2+1]
                func(local_buf)
            free(local_buf)

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Static allocation for n=4")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double[4] local_buf
    cdef int idx, i, n4 = <int> (n2/2)

    with nogil, parallel():
        for i in range(loops):
            for idx in prange(n4, schedule='guided'):
                local_buf[0] = global_buf[idx*4]
                local_buf[1] = global_buf[idx*4+1]
                local_buf[2] = global_buf[idx*4+2]
                local_buf[3] = global_buf[idx*4+3]
                func(local_buf)
    return

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
@timefunc("Dynamic allocation for n=4")
def _(double[::1] global_buf not None, int n, int n2):
    cdef double* local_buf
    cdef int idx, i, n4 = <int> (n2/2)

    with nogil, parallel():
        for i in range(loops):
            local_buf = <double *> malloc(sizeof(double) * 4)
            for idx in prange(n4, schedule='guided'):
                local_buf[0] = global_buf[idx*4]
                local_buf[1] = global_buf[idx*4+1]
                local_buf[2] = global_buf[idx*4+2]
                local_buf[3] = global_buf[idx*4+3]
                func(local_buf)
            free(local_buf)
        
# ==============================================================================
# test function
cdef void func(double* local_buf) nogil:
    cdef int i=0
    return

LinkError: command 'gcc' failed with exit status 1

**Conclusion:** It might seem counter-intuitive but using a dynamic `malloc` (and `free`-ing accordingly) instead of declaring an array statically, will improve the performance of your code.

## Packaging a code with Cython

- Cython has to be added to your `install_requires` in `setup.py`
- All cython modules have to be compiled in `setup.py`
- Not all machines will have OpenMP installed: 
    * check for the OS of the machine to determine compiler
    * determine openmp flag
    * compile sall OpenMP snippet
    * check if errors (OpenMP not installed or not properlly installed)
    * compile your lib accordingly (don't forget the flags)
    * set a global variable to use or not OpenMP: if not run all code sequentially

