# Profiling, Cython, and Numba 🚀
### Zbyszek & Jakob
### ASPP 2022, Bilbao, Spain

## Outline

* Introduction
* Profiling
* Speed up Python code using Cython
 * Basic principles
 * Interacting with NumPy arrays
* Using Numba to speed up Python code

## Introduction

* By now you are the *Master of Research*(tm).
![Master of research](figures/mor.png)
* Using your newly gained skills you can confidently transform any idea into a great manuscript.

* It seems like the only thing that's holding you back is the **execution speed** of your scripts!
* Both Cython and Numba are tools to make your code faster -> **optimization**.

## Exercise

Who thinks that they would benefit from reduced execution time?

Please raise your hand.

## The three rules of optimization
(adapted from Sebastian Witowski, EuroPython 2016)

#### 1. Don't.
 * Likely you don't need it.
 * Optimization comes with costs.

## Exercise

What are costs associated with optimization?

#### 2. Don't yet.
 * Is your code finished?
 * Did you write tests?
 * Are you sure it's worth the investment?

#### 3. Profile
* Don't guess which part of your code you should optimize!
* Measure. Measure. Measure.

## Runtime profilers

- profilers monitor the execution of your script and record, for example, how much time is spent in each function
- here we consider [py-spy](https://github.com/benfred/py-spy), a sampling-based profiler for Python
  - simply speaking `py-spy` examines your program at regular intervals and records which part is currently executed
- you can apply it to your script with `py-spy record -o profile.svg -- python myprogram.py`
  - to make timings accurate it needs to collect enough of data; you can control the "sampling rate" using the `-r` argument
- after measuring `py-spy` will produce a "flamegraph" like the following
![flamegraph](./figures/flamegraph.svg)

## Demo

Using a simple script, Jakob will explain how to read flamegraphs.

## Example: numerical integration

![RiemannSum](figures/MidRiemann2.svg)

Riemann sum: $\int_a^b dx f(x) \approx \sum_{i = 0}^{n - 1} f(a + (i + 0.5) \Delta x) \Delta x$ with $\Delta x = (b - a)/n$

here $a=0, b=2, n=4$

### Example implementation
(see [./profiling/numerical_integration.py](./profiling/numerical_integration.py))

Where do you think the bottlenecks are? *(don't do this at home!)*

## Demo

With your help, Jakob will demonstrate a typical profiling/optimization workflow based on this script.

## Exercise 00

It's time to put theory into practice. We have prepared an example script (see [./profiling/count_words.py](./profiling/count_words.py)) which counts the number of occurences of words in a text.

0. Fork & clone this repository.
1. Familarize yourself with the script.
2. Guess which functions are slow and should be optimized. *(don't do this at home.)*
3. Use the workflow (time -> py-spy- > timeit/lprun -> time) we have just demonstrated to reduce the script's execution time. **Make sure not to break the tests.**
4. Commit your changes in a new branch and create a PR. Include the duration before/after optimization in the PR message.

Afterwards we will discuss the exercise jointly.

## Exercise discussion

What did we learn?
- ...

## Profiling conclusion

- Before optimizing, first finish your code & write tests!
- Then *measure* to find slow functions. **Profiling is easy!**
- Only optimize the slowest functions & *know when to stop*!
- Most profilers can be invoked stand-alone and within ipython
  - `time` (commandline)
  - `%timeit`
  - `import timeit; timeit.time('some_func()')`
- [py-spy](https://github.com/benfred/py-spy) is just one of many profilers; alternatives:
  - [cProfile](https://docs.python.org/3/library/profile.html) + [snakeviz](https://github.com/jiffyclub/snakeviz)
  - [pyinstrument](https://github.com/joerick/pyinstrument)
- Here we focus on profiling *runtime*, but maybe you are limited by *memory*
  - [memray](https://github.com/bloomberg/memray)

### Optimization: what to do (in order of increasing complexity)

- Do nothing
- "Vectorization" (`numpy`!!)
- Data structures and algorithms
- Memoization / caching
- Non-Python libraries (`blas` vs. `openblas` vs. `atlas` vs. Intel `mkl`)
- Buy better hardware
- **Cython / Numba** / pythran
- **Parallelization** (->tomorrow)
- GPUs
- Low-level code


## Cython

In [None]:
%load_ext cython

In [None]:
def f(x):
    return x**4 - 3 * x

def integrate_f(func, a, b, n):
    dx = (b - a) / n
    s = 0.0
    for i in range(n):
        s += func(a + (i + 0.5) * dx) * dx
    return s

In [None]:
%timeit integrate_f(-10, +10, 1_000_000)

In [None]:
%timeit integrate_f2(-10, +10, 1_000_000)

In [None]:
f2, integrate_f2

In [None]:
integrate_f3(f3, -10, +10, 1_000_000)

In [None]:
%timeit integrate_f3(f3, -10, +10, 1_000_000)

In [None]:
integrate_f4(f4, -10, +10, 1_000_000)

In [None]:
%timeit integrate_f4(f4, -10, +10, 1_000_000)

## Exercise 01-cython-primes

Please open [01-cython-primes/exercise.ipynb](01-cython-primes/exercise.ipynb) and follow instructions therein.

### Cython function type specialization

In [None]:
%timeit integrate_f4(-10, +10, 1_000_000)

### Cython formula optimization

## Exercise: 02-cython-distrib

 Please open a terminal, `cd` to `02-cython-distrib/`, and follow the instructions:

In [None]:
from IPython import display
display.display(display.Markdown(open('02-cython-distrib/README.md').read()))





### When `setup.py` and when `meson.build`?

[<img src="images/logo-over-white.svg" width="100"/>](images/logo-over-white.svg)

- setuptools is (still) the standard in the Python ecosystem
- excellent integration with PyPI and other Python packages
- automatic downloads from PyPI
- clumsy integration with non-Python libraries
- weak support for optional dependencies and partial rebuilds

[<img src="images/Meson_(software)_logo_2019.svg" width="180"/>](images/Meson_(software)_logo_2019.svg)

- Meson is arguably the best available build system for compiled code
- excellent integration with pkgconfig and other system libraries
- integration with PyPI via pip, somewhat clumsy
- excellent support for user configuration, optional dependencies, and partial rebuilds

Thus, if setuptools is a good solution for Python projects with some Cython code, and no dependencies on system libraries. Meson is a good solution for some self-contained Python and/or Cython code, possibly alongside other non-Python libraries and executables.

# Cython and Numpy Arrays

Let's start by summing up an array

Let's write a "mean filter"


$$ \{ x_0, x_1, ...  , x_{n-2}, x_n \} \longrightarrow \{ \frac{x_0 + x_1}{2}, \frac{x_0 + x_1 + x_2}{3}, \frac{x_1 + x_2 + x_3}{3}, ... , \frac{x_{i-1} + x_i + x_{i+1}}{3}, ... , \frac{x_{n-3} + x_{n-2} + x_{n-1}}{3}, \frac{x_{n-2} + x_{n-1}}{2} \} $$

In [None]:
import numpy as np

def mean3filter(arr):
    arr_out = np.empty_like(arr)
    
    arr_out[0] =  (arr[0] + arr[1]) / 2
    for i in range(1, arr.shape[0] - 1):
        arr_out[i] = arr[i-1:i+2].sum() / 3
    arr_out[-1] = (arr[-2] + arr[-1]) / 2

    return arr_out

## Exercise 03-cython-mean3filter

Please open [03-cython-mean3filter/exercise.ipynb](03-cython-mean3filter/exercise.ipynb) and follow instructions therein.

# Wrapping external code in Cython

In [None]:
f3

# Numba

In [None]:
%timeit integrate_f(f, -10, +10, 1_000_000)

# Architecture of Cython and Numba

[<img src="images/cython_architecture.png" width="400"/>](images/cython_architecture.png)

[<img src="images/numba_architecture.png" width="400" />](images/numba_architecture.png)