# Python Programming

## 1. Introduction to Python

::::{hint} Prerequisites

:::{dropdown} **Scalar variables**: integers (`int`), floating point numbers (`float`), strings (`str`), *etc.*

```python
42  # int
1.0  # float
'these'
"are"
"""all
valid strings."""
```
:::

:::{dropdown} **Collections**: `list`, `tuple`, `dict`, `set`, *etc.*

```python
>>> # Lists: are mutable ordered-sequences
>>> my_things = [1, 2, 'banana']
>>> my_things.append(3.14)
>>> my_things
[1, 2, 'banana', 3.14]
```
```python
>>> # Tuples: are immutable ordered-sequences
>>> truthy_values = ('yes', 'ja', True, 1.0)
>>> del truthy_values[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object doesn't support item deletion
```
```python
>>> # Dictionaries: key-value pairs or hash-maps
>>> game_status = {"jill": 99, "t-rex": "RIP"}
```
```python
>>> # Sets: unique unordered sequences
>>> fruits = {"apple", "banana", "banana"}
>>> fruits
{'apple', 'banana'}
```
:::

:::{dropdown} **Control structures**: `if-elif-else`, `for` loop, `while` loop, *etc.*

```python

# Conditional expressions
if 0.5 > schrödingers_cat > 1.0:
    print("Alive")
elif 0. > schrödingers_cat >= 0.5:
    print("Dead")
else:
    print("Undead")

# Loops
for i_day in range(365):
    print("Hi, good morning!")

while True:
    print("Na", end="")
    if input() == "stop":
        break
print("Batman!")
```

:::

:::{dropdown} **Functions**: built-in functions (`len`, `type`, ...) and user-defined functions.

```python
>>> greeting = "hello world"
>>> len(greeting)
11
>>> type(greeting)
<class 'str'>
```
:::

:::{dropdown} **Decorators**: special syntax which modifies the behaviour of a function or a class.
:class: dropdown

```python
@something
def foo():
    ...
```

```{seealso}

- `functools.lru_cache` for an example of a decorator,
- `functools.wraps` and `contextlib.contextmanager` to create your own decorators.
```

:::

::::


Python also uses and encourages us to structure code using
- Type annotations: hints which can be used by a type-checker or a compiler.
- Class: simple encapsulation, inheritance, *etc.*
- Modules: a `.py` file containing a valid Python code. Typically does not execute anything on import.
- Package: an installable, re-usable collection of modules and other files, distributable via PyPI or conda-forge

**but this is not necessary for the purpose of this webinar.**

::::{hint} Jupyter and IPython 

This webinar is demonstrated on a Jupyter notebook, which uses an IPython (Interactive Python) kernel (a console running in the background).
All Python syntax are valid in IPython and on top of that it has some special features.

Here, we will make use of the special variable `_` and **magic commands** ✨ `%time`, `%timeit` and `%%timeit`.

:::: 

Special variable `_` stores the value of the previous output.

In [None]:
1 + 2

In [None]:
print(_)

Line magic `%time` records the wall-time for executing a Python expression _once_.

In [None]:
%time sum(range(100))

Line magic `%timeit` let's you do a micro-benchmark of a single Python expression.

In [None]:
xs = list(range(100))

In [None]:
%timeit sum(xs)

It can be also used as cell-magic `%%timeit` which does a micro-benchmark of a block of Python expressions.

In [None]:
%%timeit
total = 1
for i in range(90):
    total += i / total

## 2. Numpy, Pandas, Matplotlib

In [None]:
# Code for display_timings()

import matplotlib.pyplot as plt


def display_timings(**kwargs):
    nrows = len(kwargs)
    fig, ax = plt.subplots(figsize=(8, min(nrows, 8)))
    keys = list(kwargs)
    fmt_keys = [k.replace("_", " ") for k in kwargs]
    i_keys = list(range(len(keys)))

    ax.barh(i_keys, times := [kwargs[key].average for key in keys])
    ax.errorbar(times, i_keys, xerr=[kwargs[key].stdev for key in keys], fmt="ro")
    ax.set(xscale="log", xlabel="avg. wall time (seconds)")
    axis = plt.gca()
    axis.set_yticks(i_keys, fmt_keys)

### 2.1 Numpy

:::{important}`NumPy` is a Python library for arrays
It can be used to perform a wide variety of (efficient) mathematical operations and linear algebra on arrays and matrices.
:::

In [None]:
a = list(range(10000))
b = [0] * 10000

Why should use **loops**...

In [None]:
%%timeit -q -o
for i in range(len(a)):
    b[i] = a[i] ** 2

In [None]:
time_loops = _

In [None]:
import numpy as np

a = np.arange(10000)

... when you can **vectorize**!

In [None]:
%%timeit -q -o
b = a**2

In [None]:
time_vec = _

In [None]:
display_timings(loops_over_lists=time_loops, vectorized_math_with_arrays=time_vec)

:::{note}
Numpy is often imported as `np` and this is common convention.
:::

#### Creating arrays

Python sequences such as lists or tuples can be **transformed** into a Numpy array.

Numpy arrays can be **1D, 2D, .... n-dimensional**.

In [None]:
a = np.array([1, 2, 3])  # 1-dimensional array (rank 1)
b = np.array([[1, 2, 3], [4, 5, 6]])  # 2-dimensional array (rank 2)

print(b.shape)  # the shape (rows,columns)
print(b.size)  # number of elements

It is often **homogenous**, that is made of a single data type

In [None]:
a.dtype

Arrays can also be **generated**.

In [None]:
np.eye(3)  # Identity "matrix"

In [None]:
a = np.arange(16)
a

In [None]:
a = a.reshape(4, 4)

b = np.random.rand(16).reshape(4, 4)
b

#### Array maths and vectorization

In [None]:
c = np.add(a, b)  # equivalent to `a + b`
c

Other common mathematical operations include:
- elementwise operations:
    - `-` (numpy.subtract)
    - `*` (numpy.multiply)
    - `/` (numpy.divide)
    - `**` (numpy.pow())
- `.T` (numpy.transpose())
- `np.sqrt()`, `np.sum()`, `np.mean()`, `np.std()`, `np.max()`, `np.min()`
- `@` (np.dot() / np.matmul())

In [None]:
# Matrix multiplication
a * b

In [None]:
np.dot(a, b)  # equals to `a @ b`

In [None]:
np.matmul(a, b)

### 2.2 Pandas

:::{important}`Pandas` is for tabular data

It is provides intuitive data structures for functions for reading in, manipulating and
performing high-performance data analysis of tabular data.
:::

#### Code example to analyze the Titanic passenger data

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(url, index_col="Name")

In [None]:
titanic.shape

Tabular data can be **heterogenous**.

In [None]:
titanic.dtypes

In [None]:
# print the first 5 lines of the dataframe

titanic.head()

In [None]:
# print summary statistics for each column

titanic.describe()

In [None]:
titanic[["Age", "Sex", "Survived", "Pclass"]].groupby(["Survived", "Sex"]).aggregate(
    "median"
)

### 2.3 Matplotlib

`Matplotlib` is a comprehensive library for creating static, animated, and interactive visualizations in Python.

#### Visualization of the Titanic passenger data

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(9, 6))
plt.hist(
    [
        titanic[titanic["Survived"] == 1]["Age"],
        titanic[titanic["Survived"] == 0]["Age"],
    ],
    stacked=True,
    bins=30,
    label=["Survived", "Dead"],
)
plt.xlabel("Age (in Years)")
plt.ylabel("Number of passengers on Titanic")
plt.legend()

# GPU Programming using Python

There are several options available to work with python for GPU programming.
- [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/)
- [GPU Programming (Carpentries)](https://arc.leeds.ac.uk/lesson-gpu-programming/)

## 1. **`cuDF`** and **`cuML`** libraries in ![RAPIDS](img/RAPIDS-logo.png)

:::{important} [RAPIDS](https://rapids.ai/) is a high-level package collection

It implements CUDA functionalities and API with Python bindings.

**It only supports NVIDIA GPUs.**

- **`cuDF`** is the dataframe library for manipulating tabular datasets using GPU. cuDF provides a **Pandas**-like API for loading, joining, aggregating, filtering, and manipulating data.
- **`cuML`** is a suite of libraries that implement algorithms and mathematical primitives functions to train machine learning models on your data to make predictions, similar to the **`scikit-learn`** API.

:::

### 1.1 Timing the Titanic

In [None]:
%timeit titanic[["Age", "Sex", "Survived", "Pclass"]].groupby(["Survived", "Sex"]).aggregate("mean")

In [None]:
import cudf

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic_gpu = cudf.read_csv(url, index_col="Name")

In [None]:
%timeit titanic_gpu[["Age", "Sex", "Survived", "Pclass"]].groupby(["Survived", "Sex"]).aggregate("mean")

:::{hint} GPU version was slower. Why?

1. The size of the data needs to be justifiably big for a GPU to be more performant than a CPU. The data is only {eval}`titanic.memory_usage().sum() / 1024` KB big.
2. Aggregation cannot be efficiently parallized
:::

### 1.2 Timing the Taxis

Lets try to analyze NYC taxi data from <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page> to compute median trip duration.

Dictionary of the data: <https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf>

In [None]:
import pandas as pd

trips = pd.read_parquet("./yellow_tripdata_2024-07.parquet")
trips.head()

In [None]:
trips.shape

In [None]:
time_duration = %timeit -o (trips.tpep_dropoff_datetime - trips.tpep_pickup_datetime).mean()

In [None]:
import cudf

trips_gpu = cudf.read_parquet("./yellow_tripdata_2024-07.parquet")

In [None]:
time_duration_gpu = %timeit -o (trips_gpu.tpep_dropoff_datetime - trips_gpu.tpep_pickup_datetime).mean()

**How about an aggregate operation as before?**

In [None]:
time_agg = %timeit -o trips[["passenger_count", "trip_distance", "RatecodeID", "total_amount"]].groupby(["RatecodeID"]).aggregate("mean")

In [None]:
time_agg_gpu = %timeit -o trips_gpu[["passenger_count", "trip_distance", "RatecodeID", "total_amount"]].groupby(["RatecodeID"]).aggregate("mean")

In [None]:
display_timings(
    aggregate=time_agg,
    aggregate_gpu=time_agg_gpu,
    trip_duration=time_duration,
    trip_duration_gpu=time_duration_gpu,
)

:::{hint} Now, the GPU version was faster. Why?

1. The data for NYC Yellow taxis is {eval}`trips.memory_usage().sum() / 1024**2` **MB** big which is a moderately big dataset.
2. Even for the aggregate method, we see a modest performance gain.
:::

In [None]:
del titanic, titanic_gpu, trips, trips_gpu

In [None]:
import gc
gc.collect()

## 2. Numba

:::{important} `Numba` is an open-source just-in-time (JIT) compiler

- It translates a subset of Python and NumPy into fast machine code using LLVM.
- `Numba` offers options for parallelising Python code for CPUs and GPUs, with minor code changes.
:::

### 2.1 `numba.jit()` decorator

Numba provides several utilities for code generation, and its central feature is the `numba.jit()` decorator.

In [None]:
import numpy as np

mx = np.arange(10000).reshape(100, 100)


def go_slow(a):  # Function is compiled and runs in machine code
    result = 0.0
    for i in range(a.shape[0]):
        result += np.sin(a[i, i])
    return result


time_slow = %timeit -o go_slow(mx)

In [None]:
from numba import jit


@jit(nopython=True)
def go_fast(a):
    result = 0.0
    for i in range(a.shape[0]):
        result += np.sin(a[i, i])
    return result


time_fast = %timeit -o go_fast(mx)

In [None]:
display_timings(go_slow=time_slow, go_fast=time_fast)

### 2.2 `ufunc` and `gufunc`

Another feature of Numba is to generate NumPy universal functions.

There are two types of universal functions:
- Those which operate on scalars are “universal functions” (`ufunc`), which are achieved via `@vectorize` decorator
- Those which operate on higher dimensional arrays and scalars are “generalized universal functions” (`gufunc`), which are achived via `@guvectorize` decorator

In [None]:
import math
import numpy as np
import numba


# a simple version without using numba
def func_cpu(x, y):
    return math.pow(x, 3.0) + 4 * math.sin(y)


@np.vectorize(otypes=[float])
def func_numpy_cpu(x, y):
    return math.pow(x, 3.0) + 4 * math.sin(y)


# def func_numpy(x, y):
#     return np.pow(x, 3.0) + 4 * np.sin(y)


@numba.vectorize([numba.float64(numba.float64, numba.float64)], target="cpu")
def func_numba_cpu(x, y):
    return math.pow(x, 3.0) + 4 * math.sin(y)


@numba.vectorize([numba.float64(numba.float64, numba.float64)], target="cuda")
def func_numba_gpu(x, y):
    return math.pow(x, 3.0) + 4 * math.sin(y)

:::{note}
The commented-out variant `func_numpy` which uses numpy functions (`np.pow` and `np.sin`) would automatically vectorize and perform little better than the first two -  making it a better formulation for the purpose. 
**We don't do that here to illustrate vectorized functions which may be required for custom algorithms**.
:::

In [None]:
N = 10000000
mx = np.random.rand(N)
result = np.empty_like(mx)

In [None]:
%%timeit -r 1 -q -o
for i in range(N):
    result[i] = func_cpu(mx[i], mx[i])

In [None]:
time_cpu = _

In [None]:
%timeit -q -o result_numpy_cpu = func_numpy_cpu(mx, mx)

In [None]:
time_numpy_cpu = _

In [None]:
%timeit -q -o result_numba_cpu = func_numba_cpu(mx, mx)

In [None]:
time_numba_cpu = _

In [None]:
%timeit -q -o result = func_numba_gpu(mx, mx)

In [None]:
time_numba_gpu = _

In [None]:
display_timings(
    cpu=time_cpu,
    numpy_cpu=time_numpy_cpu,
    numba_cpu=time_numba_cpu,
    numba_gpu=time_numba_gpu,
)

:::{note}
- Using `ufunc` (or `gufunc`) for GPU programming may not always yield optimal performance due to automatic handling of data transfer and kernel launching.
- In practical applications, not every function can be constructed as a `ufunc`.
:::

### 2.3 An example for vector addition with manual data transfer

Sometimes, for better performance, one need to calibrate kernels and manually manage data transfer.

In [None]:
import numpy as np
import numba


@numba.cuda.jit
def func(a, b, c):
    """GPU vectorized addition. Computes C = A + B"""
    # like threadIdx.x + (blockIdx.x * blockDim.x)
    thread_id = numba.cuda.grid(ndim=1)
    size = len(c)

    if thread_id < size:
        c[thread_id] = a[thread_id] + b[thread_id]

Below, we explicitly move two arrays to the device memory.

In [None]:
N = 10000000
a = numba.cuda.to_device(np.random.random(N))
b = numba.cuda.to_device(np.random.random(N))
c = numba.cuda.device_array_like(a)

In [None]:
type(a)

In [None]:
%timeit -r 1 func.forall(len(a))(a, b, c)
print(c.copy_to_host())

In [None]:
nthreads = 256  # Enough threads per block for several warps per block
nblocks = (len(a) // nthreads) + 1  # Enough blocks to cover entire vector

%timeit -r 1 func[nblocks, nthreads](a, b, c)
print(c.copy_to_host())

In [None]:
del a, b, c
gc.collect()

## 3. Jax

:::{important} `Jax` offers a drop-in import alias for Numpy and JIT compiler

Although `Jax` was originally designed to build neural networks with built-in support for auto-differentiation, it
can also be used to optimize other generic computation loads.

- Like, CuPy (coming soon), using `import jax.numpy as jnp` provides access to a large subset of optimized Numpy functions.
- Like Numba, we decorate with `@jax.jit` to JIT compile. Compiled function can run in GPU (CUDA and experimental ROCm support), TPU or CPU opportunistically.
- Unlike Numba, Jax can only work with certain kinds of code: [pure functions, for-loops written differently, etc.](https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html).


:::

### 3.1 Jax as NumPy

In [None]:
import numpy as np

data = np.random.random((10, 10_000))
data[5, 42] = np.nan
data[7, 1111] = np.nan

In [None]:
# compute 90th percentile ignoring NaNs, and along the rows of an array
time_numpy = %timeit -o np.nanpercentile(data, 90, axis=0)

In [None]:
import jax.numpy as jnp

time_jax = %timeit -o -r 1 jnp.nanpercentile(data, 90, axis=0)

In [None]:
display_timings(numpy_percentile=time_numpy, jax_percentile=time_jax)

### 3.2 Jax as JIT compiler

Here we will revisit the example function shown in [section 2.2](#id-2-2-ufunc-and-gufunc).

In [None]:
import numpy as np
import jax.numpy as jnp
from jax import jit


def func_numpy(x, y):
    return np.power(x, 3.0) + 4 * np.sin(y)


@jit
def func_jax(x, y):
    return jnp.power(x, 3.0) + 4 * jnp.sin(y)

In [None]:
N = 10000000
mx = np.random.rand(N)

In [None]:
time_numpy = %timeit -o func_numpy(mx, mx)

In [None]:
time_jax = %timeit -o func_jax(mx, mx)

In [None]:
import jax

jax.devices()

In [None]:
dmx = jax.device_put(mx)

time_jax_gpu = %timeit -o -n 50 func_jax(dmx, dmx)

In [None]:
display_timings(numpy=time_numpy, jax=time_jax, jax_with_array_in_gpu=time_jax_gpu)

In [None]:
del mx, dmx
gc.collect()

## 4. CuPy

:::{important} `CuPy` is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. 
- It has been developed for NVIDIA GPUs but has experimental support both NVIDIA and AMD GPUs.
- All you need to do is replace `numpy` and `scipy` with `cupy` and `cupyx.scipy` in your Python code.
:::

:::{seealso}
Tutorials:
- https://docs.cupy.dev/en/stable/user_guide/basic.html
- https://arc.leeds.ac.uk/lesson-gpu-programming/02-cupy/index.html
- https://carpentries-incubator.github.io/lesson-gpu-programming/cupy.html
:::

**Replacement of numpy with cupy**

In [None]:
import cupy as cp
import numpy as np

lst = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# creating arrays
lst_cpu = np.array(lst)
lst_gpu = cp.array(lst)

In [None]:
# calculate the Euclidean norm
lst_cpu_norm = np.linalg.norm(lst_cpu)
lst_gpu_norm = cp.linalg.norm(lst_gpu)

print("Using Numpy: ", lst_cpu_norm)
print("Using Cupy:  ", lst_gpu_norm)

Same answer, same decimal precision.

**Speed comparison between cupy and numpy**

In [None]:
# NumPy and CPU Runtime
x_cpu = np.random.random((3000, 1000))
time_numpy = %timeit -o -r 1 np.linalg.norm(x_cpu)

In [None]:
# CuPy and GPU Runtime
x_gpu = cp.random.random((3000, 1000))
time_cupy = %timeit -o -r 1 -n 10 cp.linalg.norm(x_gpu)

In [None]:
display_timings(norm_numpy=time_numpy, norm_cupy=time_cupy)

**Interfacing with user-defined Kernels**

In [None]:
import cupy as cp

x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)

x2

In [None]:
y = cp.zeros((5, 5), dtype=cp.float32)
y

**Simpler approach**: we use `ElementwiseKernel`

In [None]:
add_elemwise = cp.ElementwiseKernel(
    "float32 x1, float32 x2", "float32 y", "y = x1 + x2", "my_add_elemwise"
)

In [None]:
add_elemwise(x1, x2)

**Complicated approach**: we use `RawKernel` which is essentially CUDA / C++ code

In [None]:
add_kernel = cp.RawKernel(
    r"""
extern "C" __global__
void my_add(const float* x1, const float* x2, float* y) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    y[tid] = x1[tid] + x2[tid];
}
""",
    "my_add",
)

In [None]:
add_kernel((5,), (5,), (x1, x2, y))  # grid, block and arguments

y

## 5. PyCUDA

:::{important} [PyCUDA](https://pypi.org/project/pycuda/) is a Python programming environment for CUDA
- It allows users to access to NVIDIA’s CUDA parallel computing API from Python.
- PyCUDA is powerful library but only runs on NVIDIA GPUs.
- Knowledge of CUDA programming is needed.
:::

In [None]:
# Step 1: Initialization

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

In [None]:
# Step 2: Transferring data

# 2.1: Generating numbers with single precision
import numpy as np

mx_cpu = np.random.randn(4, 4)
mx_cpu = mx_cpu.astype(np.float32)
print(mx_cpu, end="\n")

# 2.2: Allocation of memory on GPU
mx_gpu = cuda.mem_alloc(mx_cpu.nbytes)

# 2.3: Transferring data from CPU (host) to GPU (device)
cuda.memcpy_htod(mx_gpu, mx_cpu)

In [None]:
# Step 3: Executing a kernel on GPU

# 3.1 Definition of the kernel
mod = SourceModule(
    """
  __global__ void doublify(float *a)
  {
    int idx = threadIdx.x + threadIdx.y*4;
    a[idx] *= 2.0;
  }
  """
)

# 3.2 Compile this kernel, loading it onto GPU, and then call this kernel
doublify = mod.get_function("doublify")

doublify(mx_gpu, block=(4, 4, 1), grid=(1, 1))

In [None]:
# Step 4: Transferring data from GPU (device) to CPU (host)

mx_doubled = np.empty_like(mx_cpu)
cuda.memcpy_dtoh(mx_doubled, mx_gpu)

print(mx_cpu, "\n\n", mx_doubled)

In [None]:
# Bonus: Abstracting Away the Complications
# Using a pycuda.gpuarray to achieve the same effect with less writing

import pycuda.gpuarray as gpuarray

mx_gpu = gpuarray.to_gpu(np.random.randn(4, 4).astype(np.float32))
mx_doubled = (2 * mx_gpu).get()

print(mx_gpu, "\n\n", mx_doubled)