---
Compiling Python Code
---

While it might seem unintuitive to talk about compiling an interpreted language, it is often an easy and overlooked solution to speeding up Python programs. The advantage of being an interpreted language is that most Python compilers do [Just-In-Time (JIT) compilation](https://en.wikipedia.org/wiki/Just-in-time_compilation), not unlike what [PyPy](http://pypy.org) is doing.

### Cython

We'll first have a look at the Cython (not to be confused with CPython). It has both pre-compilation and just-in-time compilation modes. We will use the former for now as it will help us understand what it's doing and make better use of it.

It is important to note that Cython scripts use extensions to the language and as such, scripts must not end with the .py extension. The recommended extension is .pyx. In Jupyter notebooks we can load the Cython extension using %load_ext cython and mark code using %%cython.

In [None]:
%load_ext Cython

In [None]:
%%cython
import sys
import math
import time

def approx_pi(intervals):
    pi = 0.0
    for i in range(intervals):
        pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
    return pi

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print >> sys.stderr, "usage: {0} <intervals>".format(sys.argv[0])
        sys.exit(1)

    t1 = time.clock()
    pi = approx_pi(int(sys.argv[1]))
    t2 = time.clock()
    print("PI is approximately %.16f, Error is %.16f"%(pi, abs(pi - math.pi)))
    print("Time = %.16f sec\n"%(t2 - t1))

To define compilation steps, we must create a compilation script, written in Python. A simple one, found in the setup_cython.py file, would look like this:

~~~ {.python}
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("*.pyx"))
~~~

And to proceed with the compilation:

~~~ {.input}
$ python setup_cython.py build_ext --inplace
~~~

After some compilation steps involving your C compiler (GCC, clang, icc, ...), you will get, on Unix platforms, a shared library named approx_pi_cython.so. This is very different than what we did with PyPy in that this is not immediately executable: it's only a library exposing functions so our main timing code cannot be executed. In Jupyter the %%cython takes care of all these mechanisms.

In [None]:
%timeit approx_pi(100000000)

As you can see, we are not even twice as fast as our original Python code under CPython. To see why, we have to look at the annotated code, which shows an analysis of the compiled code:

In [None]:
%%cython --annotate
def approx_pi(intervals):
    pi = 0.0
    for i in range(intervals):
        pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
    return pi

This is only a snippet of the entire code but it's enough to understand what's going on.
First, you'll notice you have a C comment with an arrow pointing to the line the next code refers to. This is helpful to know how a line or chunk of code has been translated to C.
Second, we notice that our pi variable is not a double native type, as we would expect, but a Python object. That means every interaction with that variable cannot be native C code and must go back inside the Python VM, as seen in this snippet:

~~~ {.c}
...
    __pyx_t_5 = PyNumber_Multiply(__pyx_int_8, __pyx_t_2); ...
...
~~~

So even for basic arithmetic operations like multiplications, Python is involved. Going back and forth between C/Python that way explains why we don't get really better performance.
But there is a way to help the Cython compiler and give it hint about data types. This is where we begin using language extensions, as in the approx_pi_cython2.pyx file:

In [None]:
%%cython 
def approx_pi(int intervals):
    cdef double pi
    cdef int i
    pi = 0.0
    for i in range(intervals):
        pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
    return pi

All we did was add types to the input parameter (int), as well the two local variable pi and i (cdef double). Let's compile and run it to compare:

In [None]:
%time approx_pi(100000000)

We are now on par with the PyPy interpreter. One could argue that using PyPy is easier than compiling with Cython and they would have a point: PyPy doesn't require a C compiler nor a setup script to work. However, Cython will integrate with other C extensions. Let's try to do better with Cython by looking again at the generated C code:

In [None]:
%%cython --annotate
def approx_pi(int intervals):
    cdef double pi
    cdef int i
    pi = 0.0
    for i in range(intervals):
        pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
    return pi

Everything looks almost right. Our variables are now native types (double and int). The only thing left is this call to __Pyx_mod_long instead of the (way faster) C modulo operator (%). This is done mainly because of different behavior when using negative numbers. In C, -1%10 == -1 and in Python, -1%10 == 9. Since we know we won't have any negative numbers going from 0 to intervals-1, we can safely tell the Cython compiler to use the native modulo operator:

In [None]:
%%cython --annotate
#cython:cdivision=True

def approx_pi(int intervals):
    cdef double pi
    cdef int i
    pi = 0.0
    for i in range(intervals):
        pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
    return pi

In [None]:
%%cython --annotate
#cython:cdivision=True

def approx_pi(int intervals):
    cdef double pi
    cdef int i
    pi = 0.0
    for i in range(intervals/2):
        pi += 4 / (float)(4 * i + 1)
        pi -= 4 / (float)(4 * i + 3)
    return pi

Note that the #cython compiler directive must be at the very beginning of your pyx file. To access the list of valid compiler directives, head over to the [Cython documentation page](http://docs.cython.org/src/reference/compilation.html#compiler-directives).

In [None]:
%time approx_pi(100000000)

We are now as fast as our first C code. I will let you have a look for yourself at the generated C code to confirm that the C modulo operator was indeed used.


> ## Data Processing {.challenge}
>
> Try optimizing the exercices/process_data.py script using the Cython compiler. What speedup can you achieve?
>
> __Tip__: Before running the script, make a copy, generate a random sample and work on the pyx file:
>
> ~~~ {.input}
> cd exercices/ && python gen_inputs.py && cp process_data.py process_data.pyx
> ~~~
>
> A possible solution can be found in the solutions/process_data_cython.py file.

In [None]:
# %load ../exercices/process_data.py
from __future__ import division

def read_data():
    data = []
    fp = open("inputs.dat", "r")

    line = 1
    while line:
        line = fp.readline()
        if line:
            row = []
            for elem in line.split(','):
                elem = elem.strip()
                if elem:
                    row.append(float(elem))
            data.append(row)
        
    fp.close()
    return data

def process_A(data):
    """
    Return a new matrix of the same shape as data, with each original
    element squared by it's transposition equivalent.

    result[i][j] = data[i][j] ** data[j][i]
    """
    result = []
    for i in range(len(data)):
        row = []
        for j in range(len(data[i])):
            row.append(data[i][j] ** data[j][i])
        result.append(row)
    return result

def process_B(m1, m2):
    """
    Return the sum of the difference between each corresponding
    elements of two square matrices.

    diff = (m2[0][0] - m1[0][0]) + (m2[0][1] - m1[0][1]) + ...
    """

    diff = 0.
    for i in range(len(m1)):
        for j in range(len(m1[i])):
            diff += m2[i][j] - m1[i][j]
    return diff

def main():
    data = read_data()
    result_1 = process_A(data)
    print "Difference is: ", process_B(data, result_1)


if __name__ == "__main__":
    main()


### Numba

Another option for JIT compiling is the [Numba project](http://numba.pydata.org/). The Numba compiler is provided by Continuum Analytics, which also distribute the Anaconda Python distribution. In its simpler form, you only need to add the @jit annotation to the code you want to speed up:

In [None]:
from numba import jit
import numpy

@jit
def approx_pi(intervals):
    pi = 0.0
    for i in range(intervals):
        pi += (4 - 8 * (i % 2)) / (float)(2 * i + 1)
    return pi

%timeit approx_pi(100000000)

This would result in the following execution:

In [None]:
%timeit approx_pi(100000000)

Also, keep in mind that, altough it might be worth a try, applying the Numba @jit annotation doesn't provide much more gain when your code already uses Numpy:

In [None]:
@jit
def approx_pi(intervals):
    pi1 = 4/numpy.arange(1, intervals, 4)
    pi2 = -4/numpy.arange(3, intervals, 4)
    return numpy.sum(pi1) + numpy.sum(pi2)

%time approx_pi(100000000)