# Cython Speed-up notes

While I started coding on *Cython* I found a number of tips and tricks of what to (not) do. This is a collection of those things...

## Basic tutorials and tips

For the basis, this is a list of documentation that I found useful:
* https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html
* https://cython.readthedocs.io/en/latest/src/tutorial/
    
Some basics tips that will speed up your code significantly:
* **Type** your variables : *all* variables, functions inputs, local variables, global variables, etc.
* Minimize functions called from other python libraries (avoid overheads)
* Try defining all local function as `inline`
* Learn the difference between cdef, def, and pcdef
* If not necessary, release the GIL and make it explicit (ie. use `nogil`)
* Use function decorators (e.g. ` @cython.boundscheck(False) `)

## Some examples

### Typing variables

An easy way to see if you are typing (correctly) all variables, is to see the annotated version of your source code. Let us compare the following three functions.

In [7]:
%load_ext Cython

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython


In [13]:
%%cython --annotate
import time
import sys

cimport cython
cimport numpy as np
import numpy as np


def untyped_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

def somewhat_typed_func(np.ndarray[double, ndim=1, mode="c"] tab not None, int tab_len, double scalar):
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef double typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

We can already see that the third function `typed_func`, has much less yellow, which generally means less C code behind it, thus faster code. Let's benchmark them.

In [17]:
%%cython
import time
import sys

cimport cython
cimport numpy as np
import numpy as np


def untyped_func(tab, tab_len, scalar):
    res = 0.
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

def somewhat_typed_func(np.ndarray[double, ndim=1, mode="c"] tab not None, int tab_len, double scalar):
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef double typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef inline double inline_typed_func(double[::1] tab, int tab_len, double scalar) nogil:
    cdef double res = 0.
    cdef int i
    for i in range(tab_len):
        res += tab[i] * scalar
    return res

cdef int L, i, loops = 1000
cdef double start, end, res
for L in [1000, 10000, 100000]:
    np_array = np.ones(L)
    print("For L = ", L)
    start = time.clock()
    res = untyped_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the untyped_func")
    # ..................................................
    start = time.clock()
    res = somewhat_typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the somewhat_typed_func")
    # ..................................................
    start = time.clock()
    res = typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the typed_func")
    # ..................................................
    start = time.clock()
    res = inline_typed_func(np_array, L, 2.)
    end = time.clock()
    print(format((end-start) / loops * 1e6, "2f"), end=" ")
    sys.stdout.flush()
    print("μs, using the inline_typed_func")

For L =  1000
0.294000 μs, using the untyped_func
0.029000 μs, using the somewhat_typed_func
0.029000 μs, using the typed_func
0.013000 μs, using the inline_typed_func
For L =  10000
3.984000 μs, using the untyped_func
0.024000 μs, using the somewhat_typed_func
0.090000 μs, using the typed_func
0.021000 μs, using the inline_typed_func
For L =  100000
27.526000 μs, using the untyped_func
0.149000 μs, using the somewhat_typed_func
0.015000 μs, using the typed_func
0.012000 μs, using the inline_typed_func


## Working with numpy arrays in I/O

The first challenge I was confronted to, was handling Numpy arrays. The cython part of our code takes as inputs numpy arrays, and should give as output numpy arrays as well. However, reading and writing from numpy arrays can be slow in cython. Some tutorials mentioned using memory views, other mention that C array give a clear improvement, and overall several different solutions are mentioned. A StackOverflow answer makes a good benchmark between these solutions for a code that only need to create arrays (not taking any inputs) and giving back a numpy array:
https://stackoverflow.com/questions/18462785/what-is-the-recommended-way-of-allocating-memory-for-a-typed-memory-view

However, here we need to focus on the copying and accessing the data from the numpy array.

In [1]:
%load_ext Cython

In [5]:
%%cython
import time
import sys

from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as np
cimport numpy as np

cdef int loops

def timefunc(name):
    def timedecorator(f):
        cdef int L, i

        print("Running", name)
        for L in [1, 10, 100, 1000, 10000, 100000, 1000000]:
            np_array = np.ones(L)
            start = time.clock()
            res_array = f(L, np_array)
            end = time.clock()
            print(format((end-start) / loops * 1e6, "2f"), end=" ")
            sys.stdout.flush()

        print("μs")
    return timedecorator

print()
print("-------- TESTS -------")
loops = 3000


@timefunc("numpy buffers")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    for i in range(loops):
        for j in range(L):
            d = np_array[j]
            np_array[j] = d*0.
    # Prevents dead code elimination
    str(np_array[0])
    return np_array
    
@timefunc("cpython.array buffer")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef array[double] arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)
        for j in range(L):
            # initialization
            arr[j] = np_array[j]
            # access
            d = arr[j]
            arr[j] = d*2.
    # Prevents dead code elimination
    return np.asarray(arr)


@timefunc("cpython.array memoryview")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i, j
    cdef double d
    cdef double[::1] arr

    for i in range(loops):
        arr = np_array
        for j in range(L):
            # usage
            d = arr[j]
            arr[j] = d*0.
    # Prevents dead code elimination
    return np_array
    

@timefunc("cpython.array raw C type with trick")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef array arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)
        for j in range(L):
            # initialization
            arr.data.as_doubles[j] = np_array[j]
            # usage
            d = arr.data.as_doubles[j]
            arr.data.as_doubles[j] = d*2.
    # Prevents dead code elimination
    return np.asarray(arr)


@timefunc("C pointers")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef double* arrptr

    for i in range(loops):
        arrptr = <double*> np_array.data
        for j in range(L):
            d = arrptr[j]
            arrptr[j] = d*0.

    return np_array

@timefunc("malloc memoryview")
def _(int L, np.ndarray[double, ndim=1, mode="c"] np_array not None):
    cdef int i
    cdef double* arrptr
    cdef double[::1] arr

    for i in range(loops):
        arrptr = <double*> np_array.data
        arr = <double[:L]>arrptr
        for j in range(L):
            d = arrptr[j]
            arrptr[j] = d*0.

    return np_array

@timefunc("argument memoryview")
def _(int L, double[::1] np_array not None):
    cdef int i, j
    cdef double d

    for i in range(loops):
        for j in range(L):
            # usage
            d = np_array[j]
            np_array[j] = d*0.
    # Prevents dead code elimination
    return np_array


-------- TESTS -------
Running numpy buffers
0.012333 0.015000 0.086667 0.693000 5.882000 61.836667 699.461333 μs
Running cpython.array buffer
0.118333 0.112000 0.454667 1.292667 9.133333 94.626667 2712.874000 μs
Running cpython.array memoryview
1.057667 0.921333 1.100667 1.807000 7.421333 66.377667 698.120333 μs
Running cpython.array raw C type with trick
0.064667 0.082667 0.413000 1.584667 13.355333 132.008000 3072.307667 μs
Running C pointers
0.005000 0.007333 0.023333 0.253000 3.104333 34.064333 467.816000 μs
Running malloc memoryview
0.903000 0.998667 0.931000 1.401333 3.445667 34.143333 481.790667 μs
Running argument memoryview
0.011333 0.014667 0.088333 0.794667 6.753000 62.516667 700.184333 μs


**In conclusion:**
 For all cases, you will gain a 2x factor speed up by using a C pointer. Since the memory is already allocated for the numpy array, it is not necessary to use `malloc`. We will adopt the following declaration:
~~~~
cdef double* arrptr
arrptr = <double*> np_array.data
~~~~

Note that for all functions we declared the numpy array in the function header.


## Parallelization and arrays


After optimizing the code, the obvious step to speed-up the code is to parallelize. From the documentation it seems that this should be quite easy, but I discovered a few things to keep in mind. Let's start with an example, taken from the official Cython website : https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html

In [36]:
%load_ext Cython
import Cython.Compiler.Options as CO
CO.extra_compile_args = ['-fopenmp']
CO.extra_link_args = ['-fopenmp']

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython


In [40]:
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force

from cython.parallel import parallel, prange
from libc.stdlib cimport abort, malloc, free

cdef Py_ssize_t idx, i, n = 100
cdef int * local_buf
cdef size_t size = 10

with nogil, parallel():
    local_buf = <int *> malloc(sizeof(int) * size)
    if local_buf == NULL:
        abort()

    # populate our local buffer in a sequential loop
    for i in xrange(size):
        local_buf[i] = i * 2

    # share the work using the thread-local buffer(s)
    for idx in prange(n, schedule='guided'):
        func(local_buf)

    free(local_buf)


# I just simply added this to test it
cdef void func(int* local_buf) nogil:
    cdef int i=0
    return

CompileError: command 'gcc' failed with exit status 1