# Intro to Cython and Numba

# Why Cython

![DevTime](whycython.png)

# Why **not** Cython
## (When not Cython)



## Outline

* Cython
* Numba
* Fight!


# Part 1: cython

We want to integrate the function $f(x) = x^4 - 3x$.

In [None]:
def f(x):
    y = x**4 - 3*x
    return y
    
def integrate_f(a, b, n):
    dx = (b - a) / n
    dx2 = dx / 2
    s = f(a) * dx2
    for i in range(1, n):
        s += f(a + i * dx) * dx
    s += f(b) * dx2
    return s

integrate_f(-100, 100, 100_000)

Now, let's time this:

In [None]:
%timeit integrate_f(-100, 100, 100_000)

Not too bad, but this can add up. Let's see if Cython can do better:

In [None]:
%load_ext cython

In [None]:
%%cython

def f2(x):
    y = x**4 - 3*x
    return y
    
def integrate_f2(a, b, n):
    dx = (b - a) / n
    dx2 = dx / 2
    s = f2(a) * dx2
    for i in range(1, n):
        s += f2(a + i * dx) * dx
    s += f2(b) * dx2
    return s

In [None]:
f2

In [None]:
import sys
sys.modules[f2.__module__]

In [None]:
integrate_f2(-100, 100, 100_000)

In [None]:
%timeit integrate_f2(-100, 100, 100_000)

That's a little bit faster, which is nice since all we did was to call Cython on the exact same code. But can we do better?

### manual type specialization

In [None]:
%%cython

def f3(double x):
    y = x**4 - 3*x
    return y
    
def integrate_f3(double a, double b, int n):
    dx = (b - a) / n
    dx2 = dx / 2
    s = f3(a) * dx2
    for i in range(1, n):
        s += f3(a + i * dx) * dx
    s += f3(b) * dx2
    return s

In [None]:
%timeit integrate_f3(-100, 100, 100_000)

The final bit of "easy" Cython optimization is "declaring" the variables inside the function:

In [None]:
%%cython

def f4(double x):
    y = x**4 - 3*x
    return y
    
def integrate_f4(double a, double b, int n):
    cdef double dx = (b - a) / n
    cdef double dx2 = dx / 2
    cdef double s = f4(a) * dx2
    cdef int i

    for i in range(1, n):
        s += f4(a + i * dx) * dx
    s += f4(b) * dx2
    return s

In [None]:
%timeit integrate_f4(-100, 100, 100_000)

3× speedup with so little effort is pretty nice. What else can we do?

Cython has a nice "-a" flag (for annotation) that can provide clues about why your code is slow.

`%%cython -a`

In [None]:
%%cython -a

def f4(double x):
    y = x**4 - 3*x
    return y
    
def integrate_f4(double a, double b, int n):
    cdef:
        double dx = (b - a) / n
        double dx2 = dx / 2
        double s = f4(a) * dx2
        int i = 0
    for i in range(1, n):
        s += f4(a + i * dx) * dx
    s += f4(b) * dx2
    return s

## Exercise 1!

Head over to `cython-primes/exercise.ipynb`. See instructions there.

That's a lot of yellow still! How do we reduce this?

## Function specialization

In [None]:
%%cython -a

def f5(double x):
    y = (x*x*x - 3)*x
    return y
def integrate_f5(double a, double b, int n):
    cdef:
        double dx = (b - a) / n
        double dx2 = dx / 2
        double s = f5(a) * dx2
        int i = 0
    for i in range(1, n):
        s += f5(a + i * dx) * dx
    s += f5(b) * dx2
    return s

In [None]:
%timeit integrate_f5(-100, 100, 100_000)

In [None]:
f5

### summary of python vs. cython

```
  pure python:                 35 ms
  python-compatible cython:    24 ms 
  specialization of arguments: 18 ms
  full type specilization:     13 ms
  c-only function:              6.1 ms
  simplified expression form:   0.178 ms
```

## Exercise 2!

Head over to `cython-fibbo/exercise.ipynb`. Watch out — this one is tricky.

# Using Cython in production code

In [None]:
%%script false

# setup.py — don't run this in the notebook

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy as np

setup(
  cmdclass = {'build_ext': build_ext},
  ext_modules = [
    Extension("integrate_f5", ["integrate_f5.pyx"],
              include_dirs=[np.get_include()],
              extra_compile_args=[],
              extra_link_args=[]),
  ]
)

# run with 'python setup.py build_ext -i'

# Exercise 3

Navigate to `cython-distrib/` in a terminal, follow instructions in the `README` file there.

# Cython architecture

![Cython architecture](cython_architecture_small.png)

## Dealing with numpy arrays

In [None]:
def nth_prime(n):
    n_found = 0
    candidate = 2
    while True:
        good = True
        for div in range(2, candidate):
            if candidate % div == 0:
                good = False
                break
        if good:
            n_found += 1
            if n_found == n:
                return candidate
        # try with the next number
        candidate += 1

In [None]:
import numpy as np

def nth_prime_sieve(n):
    n_found = 0
    candidate = 2
    sieve = np.empty(n-1, dtype=int)
    
    while True:
        good = True
        for div in sieve[:n_found]:
            if candidate % div == 0:
                good = False
                break
        if good:
            n_found += 1
            if n_found == n:
                return candidate

            sieve[n_found-1] = candidate

        # try with the next number
        candidate += 1

# Exercise

Head over to `cython-dot/` and open `exercise.ipynb` there. Follow instructions.

# Part 2: numba

In [None]:
from numba import jit

@jit
def f(x):
    y = x**4 - 3*x
    return y
    
@jit
def integrate_f7(a, b, n):
    dx = (b - a) / n
    dx2 = dx / 2
    s = f(a) * dx2
    for i in range(1, n):
        s += f(a + i * dx) * dx
    s += f(b) * dx2
    return s

In [None]:
%%timeit -n 1 -r 1

integrate_f7(-100, 100, 100_000)

In [None]:
%timeit integrate_f7(-100, 100, 100_000) 

In [None]:
integrate_f7

# Numba architecture

![Numba architecture](numba_architecture_small.png)

# Why numba?

- native python code
- type flexibility


# Exercise

Open `numba-prime/exercise.ipynb` and follow the instructions there.

# Exercise

Open `numba-fibbo/exercise.ipynb` and follow the instructions there. Warning: this is not as simple as it looks.

# numba nopython and python modes

In [None]:
import numba

@numba.jit
def f(x):
    y = x*5 + x
    return y

In [None]:
f(1)

In [None]:
import numpy as np
x = np.eye(3)
print('x:', x)
print()
print('f(x):', f(x))

In [None]:
f('abc')

In [None]:
f

In [None]:
f.signatures

In [None]:
f.nopython_signatures

In [None]:
import numba

@numba.jit(numba.types.int32(numba.types.int32))
def f(x):
    y = x**4 - 3*x
    return y

In [None]:
f(33)

In [None]:
f(33.5)

In [None]:
f(np.eye(3))

In [None]:
f.signatures

When `jit()` is called with a set of types, the compilation is *eager* (happens immediately).

Doing this allows precise control over types.

It is also possible to require `nopython` mode. Numba will raise an error if this is not possible:

In [None]:
@jit(nopython=True)
def f(...):
    ...

# Let's not forget `numpy` (and C)

In [None]:
import numpy as np
def f(x):
    y = x**4 - 3*x
    return y

def integrate_f8(a, b, n):   
    dx = (b - a) / n
    dx2  = dx / 2
    x = np.linspace(a, b, n)
    s = f(x)
    s = s[0]*dx2 + s[1:-1].sum()*dx + s[-1]*dx2 
    
    return s

integrate_f8(-100, 100, 100_000)

In [None]:
%timeit integrate_f8(-100, 100, 100_000)

In [None]:
### summary of python vs. cython vs. numba vs. C

pure python:                  33 ms
python-compatible cython:     24 ms 
specialization of arguments:  18 ms
full type specilization:      13 ms
c-only function:               6 ms
simplified expression form:    0.178 ms

numba jit:                     0.170 ms

numpy:                         7 ms
numpy simplified expression:   0.500 ms
    
plain C (-O0):                 7.3 ms
C simplified expression (-O0): 1.5 ms
C simplified expression (-O3): 0.200 ms
                               0.164 with -march=native
                               0.052 with -ffast-math
                               # https://gcc.gnu.org/wiki/FloatingPointMath

# Exercise

Figure out why `integrate_f8` returns a result that is slightly different than the previous functions.

# Side demo

`c-integrate/` directory contains C code that can be compiled and used as a benchmark to compare to Cython and Numba.

# Automatic parallelization in numba

In [None]:
def trig_ident_np(x):
    return (np.sin(x)**2 + np.cos(x)**2 +
            np.sin(x)**2 + np.cos(x)**2 +
            np.sin(x)**2 + np.cos(x)**2 +
            np.sin(x)**2 + np.cos(x)**2).sum()/4

@jit
def trig_ident_jit(x):
    s = 0    
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            s += (np.sin(x[i,j])**2 + np.cos(x[i,j])**2 +
                  np.sin(x[i,j])**2 + np.cos(x[i,j])**2 +
                  np.sin(x[i,j])**2 + np.cos(x[i,j])**2 +
                  np.sin(x[i,j])**2 + np.cos(x[i,j])**2) / 4
    return s

@jit(parallel=True)
def trig_ident_jitp(x):
    return (np.sin(x)**2 + np.cos(x)**2 +
            np.sin(x)**2 + np.cos(x)**2 +
            np.sin(x)**2 + np.cos(x)**2 +
            np.sin(x)**2 + np.cos(x)**2).sum()/4

In [None]:
x = np.random.randn(5,5)
x

In [None]:
trig_ident_np(x)

In [None]:
x = np.random.randn(500, 50_000)

In [None]:
%timeit -r 1 trig_ident_np(x)

In [None]:
%timeit trig_ident_jit(x)

In [None]:
%timeit trig_ident_jitp(x)

# Exercise

Open `numba-dot/exercise.ipynb`, see instructions therein.

# Stuff I didn't talk about, slide 1 / n

## Cython

* releasing the GIL    
![A bullfinch, CC BY-SA 3.0 https://commons.wikimedia.org/w/index.php?title=User:Sp.herp](Red-headed_Bullfinch_small.jpg)


* threads in Cython
![Threads, CC BY 2.0 https://www.flickr.com/people/10506540@N07](Embroidery_Floss_Multi-Colored_10-21-09_IMG_8048_small.jpg)

* OpenMP
![OpenMP logo](openmp_lg_transparent_small.gif)

* Numba for the GPU
![Nvidia card](NVIDIA-Tesla-K20X_small.jpg)


# MPI vs. multiprocessing vs. threading vs. numba

# Concluding remarks

Some pros and cons about Cython and Numba

- Cython pros:
  * very wide support
  * easy to distribute compiled code to most users
  * quite developed optimizing workflow (e.g, `%%cython -a`)
- Cython cons:
  * need to use a new language
  * compiled code


- Numba pros:
  * quite easy to use, especially if you're starting from Cython code
  * often eye-popping, face-melting performance
- Numba cons:
  * problematic to install outside of conda/pip
  * hard to optimise. If it's slow, you have to guess (though they are helpful on mailing list)
  * many parts of Python still unsupported, e.g. dicts.
  * project still young and some people are paranoid that it could disappear

# Documentation

- Exercises and repo: https://github.com/ASPP/aspp-cython-numba
- This notebook: https://github.com/ASPP/aspp-cython-numba/blob/master/lecture/Python%20vs%20Cython%20vs%20Numba.ipynb
- Cython:
  - https://cython.readthedocs.io/en/latest/
  - https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html
- Numba:
  - http://numba.pydata.org/numba-doc/latest/index.html