# Profiling, Cython, and Numba 🚀
### Zbyszek & Jakob
### ASPP 2022, Bilbao, Spain

## Outline

* Introduction
* Profiling
* Speed up Python code using Cython
 * Basic principles
 * Interacting with NumPy arrays
* Using Numba to speed up Python code

 * ~Release the GIL and parallelize easily~ *(moved to parallel lecture)*
 * ~Wrap C/C++ code~ *(not relevant enough)*

## Introduction

* Sometimes, it seems like the execution speed of some script is *the* thing which keeps you from your next scientific breakthrough
* Both Cython and Numba are tools to make your code faster -> "optimization"
* So when should you optimize your code?

oral exercise: give examples in which scenarios you would benefit from optimization

## The three rules of optimization
(adapted from Sebastian Witowski, EuroPython 2016)

#### 1. Don't.
 * Optimization comes with costs.
 * Likely you don't need it.
 * Invest in better hardware.

oral exercise: give examples for costs associated with optimization

#### 2. Don't yet.
 * Is your code finished?
 * Did you write tests?
 * Are you sure it's worth the investment?

#### 3. Profile
* Don't guess which part of your code you should optimize!
* Measure. Measure. Measure.

## Runtime profilers

- profilers monitor the execution of your script and record, for example, how much time is spent in each function
- here we consider [py-spy](https://github.com/benfred/py-spy), a sampling-based profiler for Python
  - simply speaking `py-spy` examines your program after regular interval and records which part is currently executed
- you can apply it to your script with `py-spy record -o profile.svg -- python myprogram.py`
  - to make measurements accurate it needs to collect enough of data; you can control the "sampling rate" using the `-r` argument
- after measuring `py-spy` will produce a "flamegraph" like the following
![flamegraph](./figures/flamegraph.svg)

## Example: numerical integration

![RiemannSum](figures/MidRiemann2.svg)

Riemann sum: $\int_a^b dx f(x) \approx \sum_{i = 0}^{n - 1} f(a + (i + 0.5) \Delta x) \Delta x$ with $\Delta x = (b - a)/n$

here $a=0, b=2, n=4$

### Example implementation
(see [./profiling/numerical_integration.py](./profiling/numerical_integration.py))

Where do you think the bottlenecks are? *(don't do this at home!)*

In [8]:
!pygmentize ./profiling/numerical_integration.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m

[34mimport[39;49;00m [04m[36mmatplotlib[39;49;00m[04m[36m.[39;49;00m[04m[36mpyplot[39;49;00m [34mas[39;49;00m [04m[36mplt[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m


[34mdef[39;49;00m [32mparse_arguments[39;49;00m():
    parser = argparse.ArgumentParser(
        description=[33m"[39;49;00m[33mMeasure error of numerical intergration.[39;49;00m[33m"[39;49;00m
    )
    parser.add_argument(
        [33m"[39;49;00m[33mn_max[39;49;00m[33m"[39;49;00m,
        [36mtype[39;49;00m=[36mint[39;49;00m,
        help=[33m"[39;49;00m[33mmaximal number of bins to use for integration[39;49;00m[33m"[39;49;00m,
    )
    parser.add_argument(
        [33m"[39;49;00m[33ma[39;49;00m[33m"[39;49;00m, [36mtype[39;49;00m=[36mfloat[39;49;00m, help=[33m"[39;49;00m[33mlower bound for integration[39;49;00m[33m"[39;49;00m
    )


## Demo

Jakob will demonstrate a typical profiling/optimization workflow based on this script.

- time
- py-spy
- notebook (timeit/lprun)
- time (of improved version)

## Exercise

It's time to put theory into practice. We have prepared an example script (see [./profiling/count_words.py](./profiling/count_words.py)) which counts the number of occurences of words in a text.

1. Familarize yourself with the script.
2. Guess which parts are slow and should be optimized. *(don't do this at home.)*
3. Use the workflow (time -> py-spy- > timeit/lprun -> time) we have just demonstrated to reduce the script's execution time. **Make sure not to break the tests.**

Afterwards we will discuss the exercise jointly.

## Exercise discussion

What did we learn?
- ...

## Profiling conclusion

- before optimizing, first finish your code & write tests
- then *measure* to find slow functions
- optimize only the slowest functions & know when to stop!
- most profilers can be invoked stand-alone and within ipython
- `time` and `%timeit` and also `import timeit; timeit.time('some_func()')`
- [py-spy](https://github.com/benfred/py-spy) is just one of many profilers; alternatives:
  - [cProfile](https://docs.python.org/3/library/profile.html) + [snakeviz](https://github.com/jiffyclub/snakeviz)
  - [pyinstrument](https://github.com/joerick/pyinstrument)
- here we focus on profiling *runtime*, but maybe you are limited by *memory*
  - [memray](https://github.com/bloomberg/memray)

- 80/20 rule

### What to do (in order of complexity):
- do nothing
- buy better hardware
- data structures and algorithms
- memoization / caching
- vectorization (`numpy`!!)
- libraries (`blas` vs. `openblas` vs. `atlas` vs. Intel `mkl`)
- parallelization
- GPUs
- cython / numba / pythran
- low-level code


## Cython

In [14]:
def f(x):
    return x ** 4 - 3 * x

def integrate_f(func, a, b, n):
    s = 0
    dx = (b - a) / n
    
    s += func(a) * dx/2
    for i in range(1, n):
        s += func(a + n * dx) * dx
    s += func(b) * dx/2
    return s

In [15]:
integrate_f(f, -10, +10, 1_000_000)

199400.00060315797

In [16]:
%timeit integrate_f(f, -10, +10, 1_000_000)

303 ms ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%load_ext cython

The cython extension is already loaded. To reload it, use:
  %reload_ext cython


In [19]:
%%cython

def f2(x):
    return x ** 4 - 3 * x

def integrate_f2(func, a, b, n):
    s = 0
    dx = (b - a) / n
    
    s += func(a) * dx/2
    for i in range(1, n):
        s += func(a + n * dx) * dx
    s += func(b) * dx/2
    return s

In [22]:
f2, integrate_f2

(<function _cython_magic_5c6075df9c1dca66216cfb2e434e0104.f2>,
 <function _cython_magic_5c6075df9c1dca66216cfb2e434e0104.integrate_f2>)

In [23]:
sys.modules[f2.__module__]

<module '_cython_magic_5c6075df9c1dca66216cfb2e434e0104' (/home/zbyszek/.cache/ipython/cython/_cython_magic_5c6075df9c1dca66216cfb2e434e0104.cpython-310-x86_64-linux-gnu.so)>

In [None]:
!file /home/zbyszek/.cache/ipython/cython/_cython_magic_a9dc65ed82a290407cecd88aeb8605c0.cpython-310-x86_64-linux-gnu.so

In [24]:
integrate_f2(f2, -10, +10, 1_000_000)

199400.00060315797

In [25]:
%timeit integrate_f2(f2, -10, +10, 1_000_000)

241 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
%%cython

def f3(double x):
    return x ** 4 - 3 * x

def integrate_f3(func, double a, double b, int n):
    cdef:
        double s = 0
        double dx = (b - a) / n
    
    s += func(a) * dx/2
    for i in range(1, n):
        s += func(a + n * dx) * dx
    s += func(b) * dx/2
    return s

In [29]:
integrate_f3(f3, -10, +10, 1_000_000)

199400.00060315797

In [30]:
%timeit integrate_f3(f3, -10, +10, 1_000_000)

93.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [31]:
%%cython

def f4(double x):
    return x ** 4 - 3 * x

def integrate_f4(func, double a, double b, int n):
    cdef double s = 0
    cdef double dx = (b - a) / n
    
    s += func(a) * dx/2
    
    cdef int i
    for i in range(1, n):
        s += func(a + n * dx) * dx
    
    s += func(b) * dx/2
    
    return s

In [32]:
integrate_f4(f4, -10, +10, 1_000_000)

199400.00060315797

In [34]:
%timeit integrate_f4(f4, -10, +10, 1_000_000)

93.5 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Exercise 01-cython-primes

Please open `01-cython-primes/exercise.ipynb` and follow instructions therein.

### Cython function type specialization

In [None]:
%%cython -a

cdef double f4(double x):
    return x ** 4 - 3 * x

def integrate_f4(double a, double b, int n) -> double:
    cdef:
        double dx = (b - a) / n
        double dx2 = dx / 2
        double s
        int i

    s = f4(a) * dx2
    for i in range(1, n):
        s += f4(a + i * dx) * dx
    s += f4(b) * dx2
    return s


In [None]:
%timeit integrate_f4(-10, +10, 1_000_000)

### Cython formula optimization

## Exercise: 02-cython-distrib

 Please open a terminal, `cd` to `02-cython-distrib/`, and follow the instructions in `README`.




### When `setup.py` and when `meson.build`?

[<img src="images/logo-over-white.svg" width="100"/>](images/logo-over-white.svg)

- setuptools is (still) the standard in the Python ecosystem
- excellent integration with PyPI and other Python packages
- automatic downloads from PyPI
- clumsy integration with non-Python libraries
- weak support for optional dependencies and partial rebuilds

[<img src="images/Meson_(software)_logo_2019.svg" width="180"/>](images/Meson_(software)_logo_2019.svg)

- Meson is arguably the best available build system for compiled code
- excellent integration with pkgconfig and other system libraries
- integration with PyPI via pip, somewhat clumsy
- excellent support for user configuration, optional dependencies, and partial rebuilds

Thus, if setuptools is a good solution for Python projects with some Cython code, and no dependencies on system libraries. Meson is a good solution for some self-contained Python and/or Cython code, possibly alongside other non-Python libraries and executables.

# Cython and Numpy Arrays

Let's start by summing up an array

In [62]:
%%cython -a

import cython

@cython.wraparound(False)
@cython.boundscheck(False)
def mysum(double [::1] arr):
    cdef size_t N = arr.size
    cdef double sum = 0
    for i in range(1, N-1):
        sum += arr[i]
    sum += arr[0]
    sum += arr[N-1]
        
    return sum

Let's write a "mean filter"


$$ \{ x_0, x_1, ...  , x_{n-2}, x_n \} \longrightarrow \{ \frac{x_0 + x_1}{2}, \frac{x_0 + x_1 + x_2}{3}, \frac{x_1 + x_2 + x_3}{3}, ... , \frac{x_{i-1} + x_i + x_{i+1}}{3}, ... , \frac{x_{n-3} + x_{n-2} + x_{n-1}}{3}, \frac{x_{n-2} + x_{n-1}}{2} \} $$

In [63]:
import numpy as np

def mean3filter(arr):
    arr_out = np.empty_like(arr)
    
    arr_out[0] =  (arr[0] + arr[1]) / 2
    for i in range(1, arr.shape[0] - 1):
        arr_out[i] = arr[i-1:i+2].sum() / 3
    arr_out[-1] = (arr[-2] + arr[-1]) / 2

    return arr_out

# Wrapping external code in Cython

In [64]:
f3

<function _cython_magic_3a32b0ec1be700f82bc5f623ce70867b.f3>

# Numba

In [76]:
import numba

@numba.jit
def f(x):
    return x ** 4 - 3 * x

@numba.jit
def integrate_f(func, a, b, n):
    s = 0
    dx = (b - a) / n
    
    s += func(a) * dx/2
    for i in range(1, n):
        s += func(a + n * dx) * dx
    s += func(b) * dx/2
    return s

In [72]:
%timeit integrate_f(f, -10, +10, 1_000_000)

1.37 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [77]:
f

CPUDispatcher(<function f at 0x7f4620c55750>)

In [83]:
f.nopython_signatures

[(int64,) -> int64, (float64,) -> float64]

In [85]:
x = np.eye(3)
x

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [86]:
f(x)

array([[-2.,  0.,  0.],
       [ 0., -2.,  0.],
       [ 0.,  0., -2.]])

In [87]:
f.nopython_signatures

[(int64,) -> int64,
 (float64,) -> float64,
 (array(float64, 2d, C),) -> array(float64, 2d, C)]