# **Compiling Python code to C**




Other than modifying our code to perform better, easiest way to make it performant is by telling it to run fewer CPU instructions. To help with that python have several methods such as C-based compiling, LLVM(Low Level Virtual Machine) based Numba compiling and built in Pypy JIT (Just In Time) compiler.

But unfortunately each of those methods have downsides. For example, Cython require the developer to write code segments using a hybrid of C and python.

> Not all programs gain speedups from the compilation process. If your code contains more I/O operations, additional external library calls etc then they wont benefit from compiling. Having code segments which repeats same operations many times, loops etc. gains most speed after compiling.

</br>

### JIT(Just In Time) vs AOT(Ahead of Time) compilers

* By using a AOT compiler, we will be able to create a static library that's specialized for the executed machine. If we have used numpy, scipy etc those will also also get compiled based on the requirement.
* JIT compilers will compile the required parts at the time of use. Which can be good sometimes and bad at time. But these types are very easy to use and need less manual intervention.




Since python is a dynamically typed programming language, functions need to be ready for any type of input it would get. This causes python programs to run inefficiently. Therefore we can get additional enhancement for the execution time if we can define the data types.

## Cython

Cython is a compiler that converts type annotated python in to a compiled python module. This extension then can be used normally using `import` just like any other python module.

For installation we need C compiler (mingw or Visual C) and then the python package `pip install Cython`.
[More details rgarding the installation](https://cython.readthedocs.io/en/latest/src/quickstart/install.html)


Below include a sample usage of Cython for our earlier Juliaset problem.

* juliaset.py --> will do the initialization of input lists and call the calculation part.
* cythonfn.pyx --> include the CPU bound calculation part which we need to define.
* setup.py --> this containes the build instructions.

Then after doing the related installations and code changes we can compile the code using the below code.

<center> 

`python setup.py build_ext --inplace`
</center>
<br>

> **Check the `compiling related` folder for related code files.**

For the juliaset code, without any code optimizations, we were able to reduce the tun time for 3.82 seconds.
(Cython compiled code ran for 3.82 seconds. Regular python code ran for 5.71 seconds.)


> Also if our code does not have a complex setup file, we can use `pyximport` module to do the compilation parts directly. All we have to do is install the package and modify the code slightly as follows.

<br>
<pre style="color:yellow">
    import pyximport
    pyximport.install(language_level=3) 
    # After this line any subsequently imported .pyx file will be automatically compiled.
    import cythonfn
    <br>
    Followed by the usual code...
</pre>

This also provides the same performance improvement as before and we are not required to write a setup.py file manually.

We can check how our code block would call python rather than C using `cython -a file_name.pyx`. This generates an annotated HTML file with the code segments like below.

<center><image src="./img/13.jpg" width="700"/></center>

In the above annotated HTML, more yellow mean more interaction with python. By looking at those we can understand where we need to focus on improving.

According to our above annotated HTML, almost every line calls back the python kernal. In order to reduce that we need to convert as much as we can to local C objects and then after the numerical part revert back to typical python objects. 

> Those annotated local C object will only be understood by Cython, not by Python.

Cython annotated function `calc_juliaset_time_cython_ctypes` is like below.


<br>
<pre style="color:yellow">

    def calculate_juliaset_serial_ctypes(int maxiter, zs, cs):
        """Calculate output list using Julia update rule"""
        
        cdef unsigned int i, n
        cdef double complex z, c

        output = [0] * len(zs)
        for i in range(len(zs)):
            n = 0
            z = zs[i]
            c = cs[i]
            while abs(z) < 2 and n < maxiter:
                z = z * z + c
                n += 1
            output[i] = n
        return output
</pre>

Notice the `cdef, unsigned, int, double, complex` declarations. Those are to instruct the Cython compiler to note the data type. Now after we compile the code and check the python kernal interactions as before, output is as follows.

<center><image src="./img/14.jpg" width="600"/></center>

Cleary most lines are now not interacting with python kernal compared to before. Also check the run times of the implementations below.


<center><image src="./img/15.jpg" width="400"/></center>

Wew! Huge speedup!
The reason for this speedup is now most of the code runs in the C level which means C compiler can optimize how the operations need to work.

> **CHECK THE CYTHON DOCUMENTS FOR MORE DETAILS ABOUT USAGES! [CYTHON DOCS](https://cython.readthedocs.io/en/latest/src/quickstart/build.html)**

Also in the above code instead of using `abs(z) < 2` we can use a equivalent math operation `z.real**2 + z.img**2 < 4`. This is much faster because we dont need to calculate the squareroot of each z part inside of the loop. By incorparating this in our code, we can gain more performance improvements.

Also we can use numpy along with Cython type annotations like below.

>To run this we have to include the numpy directory in the setup function. Check setup.py file. Also lots of pain to go thorugh if you use incompatible package versions. :-(

In [None]:
def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""

    cdef unsigned int i, n
    cdef double complex z, c
    cdef int[:] output = np.empty(len(zs), dtype=np.int32) # This will allocate a block of memory without initializing (like C)

    for i in range(len(zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
            z = z * z + c
            n += 1
        output[i] = n
    return output

In above, we have defined the types of array elements as well. This helps Cython compiler to do optimizations for deterministic array operations by accessing memory blocks directly (if they are contigious) without calling the python kernal back. The new Cython buffer interface allows compiled programs to access low level access to any object that implements it like numpy or python arrays. Also `the function’s second argument is double complex[:] zs, which means we have a double-precision complex object using the buffer protocol as specified using [], which contains a one-dimensional data block specified by the single colon :`.

## Python OpenMP

Just as we can use C objects using Cython, we can incorporate OpenMP parrellel processing api to speedup our program. But because of python's GIL, we cant directly use parrellel processing. We need to first disable it and then use Cython functions to improve performance as below.

In [None]:
# cythonfn.pyx
from cython.parallel import prange
import numpy as np
cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):
    """Calculate output list using Julia update rule"""

    cdef unsigned int i, length
    cdef double complex z, c
    cdef int[:] output = np.empty(len(zs), dtype=np.int32)
    length = len(zs)

    with nogil: # To disable python GIL
        for i in prange(length, schedule="guided"):
            z = zs[i]
            c = cs[i]
            output[i] = 0
            while output[i] < maxiter and (z.real * z.real + z.imag * z.imag) < 4:
                z = z * z + c
                output[i] += 1
    return output

In the code instead to range function we have used `prange` which is parallel range function. The schedule parameter defines how the threads should get assigned from `static, dynamic and guided` values. These have different scheduling approaches like dynamic thread allocation, fixed threads-task allocation etc. Choose the configuration that suits your need!

To compile above we need to modify the setup file with below as well.

In [None]:
#setup.py
from distutils.core import setup
from distutils.extension import Extension
import numpy as np

ext_modules = [Extension("cythonfn",
                        ["cythonfn.pyx"],
                        extra_compile_args=['-fopenmp'],
                        extra_link_args=['-fopenmp'])]

from Cython.Build import cythonize
setup(ext_modules=cythonize(ext_modules,
                            compiler_directives={"language_level": "3"},),
                            include_dirs=[np.get_include()])