# Cython

One of the best ways to sppe up python code is to convert it into compiled C code. Luckly this is fairly easy we can actually do it in a jupyter notebook which we can use for testing things (to use it in scripts we will need some extra steps).  First we install cython using `conda install cython` then we need to load the extension to the notebook:

In [1]:
%load_ext Cython

Now let's compare a normal python function and it's cython-ized version.  We will use the example if a function that calculates the Nth Fibonacci number:

In [None]:
def fib1(N):
    a,b = 0,1
    for i in range(N):
        a,b = b,a+b
    return a

In [None]:
%%cython
def fib2(N):
    a,b = 0,1
    for i in range(N):
        a,b = b,a+b
    return a

In [None]:
%%cython
def fib3(int N):
    cdef int i
    cdef int a=0,b=1
    for i in range(N):
        a,b = b,a+b
    return a

In [None]:
%timeit fib1(1000)
%timeit fib2(1000)
%timeit fib3(1000)

So just adding `%%cython` gives us a factor ~2x speedup.  But if we simply add types to our variables with `cdef` this increases to a ~240x speed up!

This is because the function is dominated by the loop which C can do much better.  The `%%cython` magic actually does something tricky in the background. It takes the cell and converts it to C code then compiles it and stores the resulting executable in a temporary location.  We can see the actuall C code generated using the annotate option by adding `-a` after the `%%cython`.  This gives us a window to how the code has been converted to C with highlighting to show how much python interaction is left for each line.   If we click the little '+' on the line number it shows you what this line has been converted to in C and the stronger the yellow the more python interation remains.

We will come back to compilation later but let's look at the difference between writing cython and python code:

1. We don't have to do anything to cython-ise most python code.  We can put almost any python code through the cython compiler and it will work fine and usually run faster.

2. To access performance of C with Cython we usually only have to declare types using `cdef` and sometimes switch the default behaviour of some operations using simple flags.

3. In cython we can now use all C libraries and easily access threaded parallelisim by avoiding the GIL.

So we see that there are very few differences.  Cython is a superset of python so we don't have to change anything if we don't want to.  As cython is effectivly an optimisation tool we should profile the code and only cythonise the slowest parts.  This is the main advantage. If you wanted to access the speed of C you would otherwise have to re-write all your code in C where lots of things can be significantly more difficult.  Instead we can use the convenience of python for most of the code and only invoke C in the sections where performance is most important.

## Types
Using cython is it's basic form is pretty easy.  Let's look at the cdef statement a bit more. Here are the following basic cdef types:

In [None]:
%%cython
cdef char i=1           # Oddly an 8 bit integer (-128 to 127) (it's enough to label all charcaters so can be used for strings)
cdef short j=2          # 16 bit integer (-32,768 to 32,767)
cdef int k=3            # 32 bit integer (-2,147,483,648 to 2,147,483,647)
cdef unsigned int l=4   # 32 bit +ve integer (0 to 4,294,967,295), "unsigned" can go infront of all numeric types
cdef long int m=5       # 64 bit integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
cdef float x=0.0        # 32 bit float (6 decimal places, max exponent 38)
cdef double y = 0e0     # 64 bit float (12 decimal places, max exponent 1023)
cdef list [1,2,3]       # just a normal list (not much performance gain)
cdef dict ['a':1,'b':2] # just a normal dict (not much performance gain)

Here once we define a type we have to stick with it unlike python which dynamically changes types to accuratly store any number you give it.  This means we are now in danger of overflow errors.  This is when you assign `cdef short j` then write `j = 200**2` and get:

In [None]:
%%cython
cdef short j
j = 200**2
print(j)

This is because 40,000 is larger than 32,767 so we wrap around to the negative part.  Similarly if we try:

In [None]:
%%cython
cdef unsigned int j
j = -1
print(j)

so we have to be a bit carefull with our variables to avoid strange results.

Strings are stored completly differently in C so there is no `cdef` just for them.  Instead they are just a array of `char`.  The `char*` means that it is an address to the point in memory where the string begins. Also python and C encode strings differently so you have to `encode` and `decode` for them to be able to talk to each other.  It's best just to keep strings as python variables.

In [None]:
%%cython
def test(input):
    input_byte = input.encode('utf-8')
    cdef char* c_string = input_byte
    cdef bytes py_string_byte = c_string
    output = py_string_byte.decode('utf-8')
    print(output)
test('Hello')

 We can also use any of the standard C math libraries with:

In [None]:
%%cython
from libc.math cimport sin

def sin_c(double x):
    return sin(x)

In [None]:
import math
x = 0.5
%timeit math.sin(x)
%timeit sin_c(x)

which are a bit faster.  To use numpy arrays we have to do:

In [None]:
%%cython
import numpy as np
cimport numpy as cnp

cdef cnp.ndarray array

but this won't give us all the speed improvement possible as C doesn't know how to allocate the memory as it doesn't know the shape and datatype of the array.  Instead it is better to do:

In [None]:
%%cython
import numpy as np
cimport numpy as cnp

def matrix_dot(cnp.ndarray[cnp.int_t, ndim=2] array1, cnp.ndarray[cnp.int_t, ndim=2] array2):
    cdef cnp.ndarray[cnp.int_t, ndim=2] array3
    array3 = np.dot(array1,array2)
    return array3


In [None]:
import numpy as np
array1 = np.ones((100,100),dtype=np.int)
array2 = np.ones((100,100),dtype=np.int)
%timeit array3 = matrix_dot(array1,array2)
%timeit array4 = np.dot(array1,array2)

Note: we had to put this declaration in a function, this is so cython knows how long the memory needs to be allocated for as it's local to the function. Also you can `cimport numpy as np` I did it to a different name so you could see which is doing what.  Also numpy is already in C so as expected wraping it in cython doesn't help.

We can also optimise the function call by specifying the return type.  For functions we have three choices: `def`, `cdef` and `cpdef`.  The first syas it's callable in python or cython, the second cython only with optimised call, the third is callable in python and cython but optimised in the second case. If you use cdef or cpdef you need to add the type for the return variable like below:

In [None]:
%%cython
import numpy as np
cimport numpy as cnp

cpdef cnp.ndarray[cnp.int_t, ndim=2] matrix_dot2(cnp.ndarray[cnp.int_t, ndim=2] array1, cnp.ndarray[cnp.int_t, ndim=2] array2):
    cdef cnp.ndarray[cnp.int_t, ndim=2] array3
    array3 = np.dot(array1,array2)
    return array3

## Cython for scripts

So using the `%%cython` magic is pretty cool but we can't write a code using it.  So how do we use cython in our normal python code?  It's a four step process (two more than normal):

1. Put your cython code in a file with extension `.pyx` like `cython_module.pyx`

2. Create a file called setup.py with the following:

In [None]:
from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("cython_module.pyx")
)

3. Now compile the code on the command line with:

In [None]:
python setup.py build_ext --inplace

4. Use the new cython functions with:

In [None]:
import cython_module as cym
cym.function_name(variables)

You are now free to use the functions in `cython_modules` in python.

**Example**
Do this with our simple Fibonacci function above.


If we look in the directory we see two new files `cython_modules.c` and `cython_modules.so`  The `.c` is the transliteration of our cython code into C and the `.so` file is the compiled version of it.  If you open the `.c` file you will see that it is now about 2600 lines long.  Mostly it's definitions with the actual calculation appearing around line 1070 and lasting about 80 lines.  It is clear from the `.c` code that the code is doing a lot of checks which python does in the background which can slow down operation of the code.  Again we can see how well we are doing by using the annotate option in our `setup.py` file:

In [None]:
%%file Code/setup.py
from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("cython_module.pyx", annotate=True)
)

This generates a `.html` file which shows us how much of our code has been converted to C.  It should have highlighted two lines the `def fib3()` line and the `return a` line.  This is because we haven't specified what type the function should return.  We can correct this by changing the definition to: `cpdef int fib3()`.  Now when we re-compile the `return` line is white and the `def` line is paler yellow.  This can't be changed as we want the function to be available in python so it must interact with it.

## Extensions
Now we have access to all of the functionallity of C and C++.  This is a massive topic and I couldn't begin to address it here.  There are however a couple of options I will flag up for you to think about in future

Here is a link to compiler directives that can be specified in the setup file for all code or using decorators (which we haven't discussed but are just lines above a function begining with an @) for specific functions:
https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html#compiler-directives
Some common decorators are:
- @cython.boundscheck(False)  Remove checks that you are accessing valid array entries
- @cython.wraparound(False)   Remove the ability to use negative indexing in arrays
- @cython.cdivision(False)    Use C's version of division rather than pythons so no more divide by zero errors

- @cython.profile(True)  This is nessecary if you wan to profile using cProfile

Turning these off and on can help you access more of the C speed by removing python style checks.  If you turn these off you code will usually just produce nonsense or explode when you do something wrong (like in C!) rather than raise an error (like in python).  These can buy some speed but are only really important if they are blocking a loop from being converted to C (where it would vectorise) or this particular loop contains only this type of calculation but this is quite hard to set up. You probably don't need to worry about them much

The second is that now we can access task parallelisim both through the cython `prange` command.

In [13]:
%%cython
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
import cython

@cython.cdivision(True)
@cython.boundscheck(False)
cpdef cnp.ndarray[cnp.int_t, ndim=1] func1(cnp.ndarray[cnp.double_t, ndim=1] Xin):
    cdef int i
    cdef int N = Xin.shape[0]
    cdef cnp.ndarray[cnp.double_t, ndim=1] Xout = np.empty_like(Xin)
    
    for i in range(N):
        Xout[i] = 1e0/Xin[i]
        
    return Xout

@cython.cdivision(True)
@cython.boundscheck(False)
cpdef cnp.ndarray[cnp.int_t, ndim=1] func2(cnp.ndarray[cnp.double_t, ndim=1] Xin):
    cdef int i
    cdef int N = Xin.shape[0]
    cdef cnp.ndarray[cnp.double_t, ndim=1] Xout = np.empty_like(Xin)
    
    for i in prange(N, nogil=True):
        Xout[i] = 1e0/Xin[i]
        
    return Xout

In [14]:
import numpy as np

Xin = np.random.random((10000))+1e0

%timeit func1(Xin)
%timeit func2(Xin)

39 µs ± 82.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39.2 µs ± 87.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Here you can easily run into issues if you want to sum numbers as they all threads have to access the same variable. Cython does make sure the answer is right (unlike in C) but the code becomes effectivly serial so the code will run slower due to the overheads for creating the threads in the first place.  Try switching `out=i` to `out+=i` and run the timing again. Still, this can be an easy way to paralleise simple loops.  Note that this will not happen if there is any python inside the loop, it has to be all cython.

In [2]:
%%cython --compile-args=-fopenmp --link-args=-fopenmp
from cython.parallel import prange

cdef int i
cdef int n = 30
cdef int sum = 0

# for i in range(n):
for i in prange(n, nogil=True):
    sum += i

print(sum)

unable to execute 'gcc-8.2': No such file or directory


CompileError: command 'gcc-8.2' failed with exit status 1