# High performance Python code

The two biggest benefits of Python are:

1. Optimized for readability which makes reading/writing code relatively easy
2. Incredible ecosystem of packages that have developed around vanilla Python

but one of the trade-offs that has historically been made by Python developers is that they are forced to write C/Fortran/CUDA code in order to achieve maximum performance. There have been a number of advances that make this much less true today than it was 5 or 10 years ago.

We will talk about two these advances:

* Just in time (JIT) compilation
* More support within Python for specialized hardware like GPUs

## Numba (CPU)

Pieces of this section were taken from the [QuantEcon lecture on Numba](https://python-programming.quantecon.org/numba.html).

If you'd like to know more about these tools, we recommend reading the QuantEcon lecture and the (very good) [Numba documentation](https://numba.readthedocs.io/en/stable/index.html#)

In [None]:
import numba
import numpy as np
import pandas as pd

## Compiled vs Interpreted

You may have heard about the differences between "compiled programming languages" and "interpreted programming languages"

* A compiled language is run in a few steps:
  1. Programmer writes the code
  2. Compiler converts that code into machine code
  3. Computer runs machine code. Note that once the code is compiled, it can be run whenever one wants without the compilation step
* An interpreted language runs code differently:
  1. Programmer writes code
  2. Computer "runs" the code by
    * An "interpreter" reads the code line-by-line
    * For each line, the interpreter figures out what the inputs are and tries to convert it to machine code
    * Computer runs the machine code

**Pros and cons of compiled**

* Once the compiler has run, the code is already machine code and runs very fast (as fast as possible given the code you wrote)
* For very large programs, compilation requires the upfront cost of compilation which can take minutes/hours
* Compiled programs can only be shared within similar hardware architecture and operating systems (though as long as there's a compiler for the hardware/OS, one could recompile the code)

**Pros and cons of interpreted**

* As long as there is an interpreter for the hardware/operating system, interpreted code can be easily shared
* Significantly slower than compiled code because of the back and forth to read the code line-by-line (which has to be redone each time the code is run!)
* Easier to interact with your code (and more importantly, your data!) because you can run one line at a time

## Just-in-time compiled (JIT)

JIT is a relatively modern development which has the goal of bridging some of the gaps between compiled and interpreted.

Rather than compile the code ahead of time or interpreting line-by-line, JIT compiles small chunks of the code right before it runs them.

For example, imagine that we have a function `mc_approximate_pi` that approximates the value of pi using Monte-carlo methods... We might even want to run this function multiple times to average across the approximations. The way JIT works is,

1. Check the input types to the function
2. The first time it sees particular types of inputs to the function, it compiles the function assuming those types as inputs and stores this compiled code
3. The computer then runs the function using the compiled code -- If it has seen these inputs before, it can jump directly to this step.

### Our favorite JIT tools

* `Numba`: [Numba](https://numba.pydata.org/) is a package built for Python that adds JIT compilation capabilities for a subset of the Python programming languages -- The priority has been tools for scientific computing `numpy` etc... The main drawback is that only certain packages work with JIT.
* `Julia`: [Julia](https://julialang.org/) is an exciting new language that is based entirely around JIT compilation. The fact that the language is built around JIT means that all packages interact nicely with one another while maintaining their JIT capabilities.

### Numba

As mentioned, Numba is a Python package that adds JIT compilation to a subset of the language using the LLVM compiler library


What works within Numba?

* Many Python objects. including: lists, tuples, dictionaries, integers, floats, strings
* Python logic, including: `if.. elif.. else`, `while`, `for .. in`, `break`, `continue`
* NumPy arrays
* Many (but not all!) NumPy functions

For more information, read these sections from the documentation

* [Supported Python features](https://numba.readthedocs.io/en/stable/reference/pysupported.html)
* [Supported NumPy  features](https://numba.readthedocs.io/en/stable/reference/numpysupported.html)

When to use Numba?

* Loops!!!
* Can facilitate parallelization (we won't talk about this today)
* GPU code generation (we won't talk about this today)
* Did we say loops yet?

### Example

Let's begin with the example we described above by writing a function that approximates pi.

Imagine that you have the value of $\pi$ has been lost and that you're tasked with finding it. How would you do it?

In [None]:
def calculate_pi(n=1_000_000):
    """
    Approximates pi by drawing two random numbers and
    determining whether the of the sum of their squares
    is less than one (which tells us if the points are
    in the upper-right quadrant of the unit circle). The
    fraction of draws in the upper-quadrant approximates
    the area which we can then multiply by 4 to get the
    area of the circle (which is pi since r=1)
    """
    in_circ = 0

    # Iterate for many samples
    for i in range(n):
        # Draw random numbers
        x = np.random.random()
        y = np.random.random()

        if (x**2 + y**2) < 1:
            in_circ += 1

    return 4 * (in_circ / n)


In [None]:
%%time

# Vanilla Python function
calculate_pi(5_000_000)

In [None]:
%%time

# JIT function
calculate_pi_numba = numba.jit(calculate_pi)
calculate_pi_numba(5_000_000)

In [None]:
%%time

calculate_pi_numba(5_000_000)

Why was the second run faster?

Remember the order than JIT works -- The first time it sees a particular function with given inputs, it has to compile the function

### Pandas?

Does Numba work with Pandas?

We continue with our example of drawing random numbers to approximate $\pi$. We will now draw I$ distinct groups of $N$ points and track whether each point was in the circle.

In [None]:
def fill_dataframe(N, I):
    # Create empty dataframe
    df = pd.DataFrame(index=np.arange(N), columns=np.arange(I))

    for n in range(N):
        for i in range(I):
            x = np.random.rand()
            y = np.random.rand()

            df.at[n, i] = int((x**2 + y**2) < 1)

    return df

In [None]:
# Our I approximations of pi
N = 1_000
I = 10

4 * fill_dataframe(N, I).mean(axis="index")

In [None]:
fill_dataframe_numba = numba.jit(fill_dataframe, nopython=False)

fill_dataframe_numba(N, I)

**Object mode vs no Python mode**

* Object mode: Allows Numba to call out to the Python interpreter if it sees something that it doesn't recognize - The cost is that this is slow and requires Numba to make certain optimization sacrifices
* No Python mode: If it sees an object that Numba doesn't recognize, it throws an error. This helps allow Numba make additional optimizations.

Numba's default behavior used to be to compile things in "object" mode but, recently, they've decided to reverse the default behavior to be no Python mode because it was the main use case (and how they recommend people use it).

**Alternate way to fill the DataFrame**


In [None]:
@numba.jit(nopython=True)
def simulate_pi_draws(N, I):
    pi_out = np.empty((N,  I), dtype="int")

    for n in range(N):
        for i in range(I):
            x = np.random.rand()
            y = np.random.rand()

            pi_out[n, i] = int((x**2 + y**2) < 1)
    
    return pi_out


def fill_dataframe2(N, I):
    pi_draws = simulate_pi_draws(N, I)

    df = pd.DataFrame(
        data=pi_draws, index=np.arange(N), columns=np.arange(I)
    )

    return df

In [None]:
%%time

4*fill_dataframe(10_000, 25).mean(axis=0).mean()

In [None]:
%%time

4*fill_dataframe_numba(10_000, 25).mean(axis=0).mean()

In [None]:
%%time

4*fill_dataframe2(10_000, 25).mean(axis=0).mean()

Additionally, `numba` makes it easy to parallelize some frequent use cases with the use of `prange`.

In [None]:
from numba import prange

numba.get_num_threads()

In [None]:
@numba.jit(nopython=True)
def evaluate_pi_st(N, I):
    pi = np.empty(I)

    for i in range(I):
        pi_approx = 0.0
        for n in range(N):
            x = np.random.rand()
            y = np.random.rand()

            pi_approx += 4*((x**2 + y**2) < 1)/N

        pi[i] = pi_approx

    return np.mean(pi)

evaluate_pi_st(25, 5);


@numba.jit(nopython=True, parallel=True)
def evaluate_pi_mt(N, I):
    pi = np.empty(I)

    for i in prange(I):

        pi_approx = 0.0
        for n in range(N):
            x = np.random.rand()
            y = np.random.rand()

            pi_approx += 4*((x**2 + y**2) < 1)/N

        pi[i] = pi_approx

    return np.mean(pi)

evaluate_pi_mt(25, 5);

In [None]:
%%time

evaluate_pi_st(50_000, 10_000)

In [None]:
%%time

evaluate_pi_mt(50_000, 10_000)

In [None]:
3.54 / 0.4

## Numba (GPU)

Historically writing code that could run a GPU has been very difficult and required specialized tools.

This is becoming less and less true for certain classes of problems. We will talk about some of the tools that make it very easy to leverage your GPU for scientific computing.

One note: Using `numba` to write GPU code requires an NVIDIA GPU. NVIDIA has some proprietary tooling called CUDA that `numba` uses.

In [None]:
from numba import cuda

![cpu-vs-gpu](cpu-vs-gpu.png)

CPUs and GPUs are built with different purposes in mind. CPUs are optimized to perform relatively large tasks in serial while GPUs are structured to do many relatively small tasks all at once.

In [None]:
gpu = cuda.current_context()

gpu.device.name  # I have 4864 cuda cores on my GPU

**CUDA Kernels**

The "small" tasks that a GPU executes over all of its cores is referred to as a kernel.

A kernel takes arrays as inputs and it does not have a return value. This means that the output that we would like from the kernel should be written into one of the arrays.

Let's start with a simple example where we add one to whatever value is currently in the array.

In [None]:
@cuda.jit
def increment_by_one(x):
    pos = cuda.grid(1)
    if pos < x.size:
        x[pos] = x[pos] + 1.0

### Threads and blocks

You might be wondering in the function above what the lines

```
    pos = cuda.grid(1)
    if pos < x.size:
       ...
```

are for.

When you do GPU computing the work is broken into blocks of work.

Each of these blocks receives a certain number of threads assigned to it and these threads each do a portion of the work. The number of threads per block should be a multiple of 32 (with a maximum of 1024 threads per block) because of the way that the hardware is designed.

By default, this function will run on every thread of every block, but this doesn't always make sense because the size of the array might not be equal to `threads_per_block * nblocks` because `threads_per_block` should be a multiple of 32.

Choosing a number of threads/blocks

In [None]:
%%time

test = np.ones(10_000)

blocks = 64
threads_per_block = (10_000 // 64) + 1

for n in range(25):
    increment_by_one[blocks, threads_per_block](test)

cuda.synchronize()

In [None]:
test

Each time we call the function above, it is being passed from the CPU memory, to the GPU memory, and then back.

The majority of compute applications are "memory bound" rather than "compute bound" which means that moving memory quickly enough is usually the binding constraint on speeding up our code.

When possible, you should move your array to the GPU and not move it back until you're done.

In [None]:
%%time

test = np.ones(10_000)
test_cuda = cuda.to_device(test)

blocks = 64
threads_per_block = (10_000 // 64) + 1

for n in range(25):
    increment_by_one[blocks, threads_per_block](test_cuda)

cuda.synchronize()
test_result = test_cuda.copy_to_host()

In [None]:
test_result

### Back to computing pi

We now return to our computing pi example except we do it on the GPU now

In [None]:
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32

In [None]:
@cuda.jit(device=True)
def in_circle(x, y):
    if (x**2 + y**2) <= 1:
        return 1
    else:
        return 0


@cuda.jit
def compute_pi_kernel(rng_states, iterations, out):
    pos = cuda.grid(1)

    if pos < out.size:
        inside = 0
        for i in range(iterations):
            x = xoroshiro128p_uniform_float32(rng_states, pos)
            y = xoroshiro128p_uniform_float32(rng_states, pos)

            inside += in_circle(x, y)

        out[pos] = 4.0 * (inside / iterations)


def compute_pi_on_gpu(N, I):
    # Threads/blocks
    threads_per_block = 256
    blocks = (I//threads_per_block + 1)

    # Set up RNG
    rng_states = create_xoroshiro128p_states(I, seed=20220116)

    out = cuda.device_array(I)
    compute_pi_kernel[blocks, threads_per_block](rng_states, N, out)
    cuda.synchronize()

    return np.mean(out.copy_to_host())

# Make sure things get compiled
compute_pi_on_gpu(50, 10);

**Comparing the speeds**

In [None]:
N = 50_000
I = 50_000

In [None]:
%%time

evaluate_pi_st(N, I)

In [None]:
%%time

evaluate_pi_mt(N, I)

In [None]:
%%time

compute_pi_on_gpu(N, I)